Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
(USC Thesis Other)
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
GENERATIVE FOUNDATION MODEL ASSISTED PRIVACY-ENHANCING COMPUTING IN
HUMAN-CENTERED MACHINE INTELLIGENCE
by
Tiantian Feng
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2024
Copyright 2024 Tiantian Feng
Dedication
This dissertation is dedicated to my wife and family, and also to my grandfather, in loving memory.
ii
Acknowledgements
I am delighted to have finished one of the last parts of the thesis, which I decided to write on the last day of
my Ph.D. study. Throughout my entire Ph.D. journey, I gained valuable insights into conducting rigorous,
innovative, and responsible research. I am also grateful for the support of fellow students, collaborators,
and advisors who helped me throughout this journey, leading me to learn how to connect, help, and sharing
my experiences to others.
In particular, I would like to thank my advisor, Dr. Shrikanth Narayanan, for providing me with an
enormous amount of support throughout the program. I learned how to be a responsible researcher, scientist, and scholar, making me believe designing technology that works for everyone is important. Moreover,
his dedication to the students keeps reminding me of how important it is for researchers to offer their
knowledge and experiences to others, especially young researchers. Moreover, I would also like to express
my gratitude to my committee members, Dr. Morteza Dehghani, Dr. Kristina Lerman, and Dr. Aiichiro
Nakano, for providing valuable suggestions to make this dissertation better.
Furthermore, I would extend my gratitude to my colleagues at the SAIL lab. I appreciate the help from
Dr. Brandon M. Booth, Dr. Karel Mundnich, and Amrutha Nadarajan, who offered me numerous suggestions and mentorship throughout the study during the TILES study. In addition, I thank the discussions
and suggestions with Rajat Hebbar, Digbalay Bose, Kleanthis Avramidis, Rimita Lahiri, Anfeng Xu, and
many other amazing SAILers throughout my Ph.D. study. Moreover, I appreciate my collaborators at various universities, Dr. Murali Annavaram, Dr. Emilio Ferrara, Dr. Mi Zhang, Dr. Mara Mather, Dr. Salman
iii
Avestimehr, Tuo Zhang, Samiul Alam, and many others, for their valuable feedback and guidance. Last, I
would express my thanks to my industry collaborators and mentors, Dr. Rahul Gupta, Dr. Anil Ramakrishna, Dr. Dimitrios Dimitriadis, Dr Ju Lin, and many others, for helping me with internship projects and
university research collaborations.
I would also like to thank Tanya Acevedo-Lam, Lizsl De Leon, and Andy Chen for their incredible
support over the past few years. Finally, I would like to thank my wife, parents, and friends, who have
always been my greatest source of support. I dedicate this dissertation to them.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Human-centered Machine Learning Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Privacy in Human-centered Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Generative Foundation Models For Privacy-Enhancing Computing . . . . . . . . . . . . . . 4
1.4 Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.1 Data Perturbation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.2 Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.3 Foundation Model Assisted Privacy-enhancing ML . . . . . . . . . . . . . . . . . . 5
1.5 Roadmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2: Privacy-enhancing Computing Background . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Privacy Threats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Privacy Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Machine Learning with Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 3: Data Perturbation For Privacy-Enhancing Human-centered ML . . . . . . . . . . . . . . 12
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Human-centered Application - Speech Emotion Recognition . . . . . . . . . . . . . . . . . 12
3.3 Speech Emotion Recognition Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Noise Perturbation For Privacy-enhancing Computing . . . . . . . . . . . . . . . . . . . . 13
3.4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.2 Relevant Feature Selection - Cloak . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.3 Combining selective noise perturbation with Adversarial Training . . . . . . . . . 16
3.4.4 Training Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.5 Primary task model and secondary task model . . . . . . . . . . . . . . . . . . . . . 18
3.5 Results on Feature Perturbation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.6 Challenges of Data Perturbation in Privacy-enhancing Computing . . . . . . . . . . . . . . 22
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
v
Chapter 4: Federated Learning For Privacy-enhancing Human-centered ML . . . . . . . . . . . . . 24
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Benchmarking FL with Multi-modal Applications . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Human-centered Datasets and Applications . . . . . . . . . . . . . . . . . . . . . . 25
4.2.2 End-to-end Multi-modal Federated Learning Framework . . . . . . . . . . . . . . . 27
4.2.3 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 FL Benchmark Performance on Human-centered Applications . . . . . . . . . . . . . . . . 29
4.4 Challenges of FL in Human-centered Applications . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.1 Performance Gap to Centralized Learning . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.2 Performance Degradation to Real-world Noises . . . . . . . . . . . . . . . . . . . . 31
4.4.2.1 Categorizations of Data Degradation . . . . . . . . . . . . . . . . . . . . 31
4.4.2.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.2.3 Lower-level Data degradation on FL Performance . . . . . . . . . . . . . 33
4.4.2.4 Higher-level Data degradation on FL Performance . . . . . . . . . . . . . 34
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Chapter 5: Privacy-enhancing Unimodal Learning Through Foundation Models . . . . . . . . . . . 36
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 Application - Automatic Speech Understanding (ASU) . . . . . . . . . . . . . . . . . . . . . 37
5.3 ASU Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4 What Research Questions Are We Interested In? . . . . . . . . . . . . . . . . . . . . . . . . 39
5.5 Foundation Model Assisted Low-resource ASU Training . . . . . . . . . . . . . . . . . . . . 40
5.5.1 Speech Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.5.2 ASU Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.6 Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.7.1 Zero-shot performance using synthetic speech for training ASU models . . . . . . 42
5.7.2 Comparing limited speech training with synthetic speech training for ASU . . . . 43
5.7.3 Combining synthetic speech training with limited speech training for ASU . . . . 44
5.7.4 Heuristic investigation on the data generation . . . . . . . . . . . . . . . . . . . . . 45
5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Chapter 6: Privacy-enhancing Multimodal Learning Through Foundation Models . . . . . . . . . . 48
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2 Application - Multimedia Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.3 Multimedia Action Recognition Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.4 Missing Modality Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.4.1 Problem Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.4.2 Visual-Modality Missing in Training Data . . . . . . . . . . . . . . . . . . . . . . . 52
6.4.3 Visual-Modality Missing in Any Data . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.5 GTI-MM: Foundation Model to Assist Sensitive Modality Missing . . . . . . . . . . . . . . . 53
6.5.1 Visual Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.5.2 Diversity Enrichment in Data Generation . . . . . . . . . . . . . . . . . . . . . . . 55
6.5.3 Multi-modal Learning with Visual Imputation . . . . . . . . . . . . . . . . . . . . . 55
6.6 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.6.1 Visual Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.6.2 Pre-trained Multi-modal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.6.3 Model Training and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
vi
6.7 GTI-MM with Visual-Missing in Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.7.1 Would audio be enough for action recognition? . . . . . . . . . . . . . . . . . . . . 57
6.7.2 Zero-shot results with synthetic visual data . . . . . . . . . . . . . . . . . . . . . . 58
6.7.3 GTI-MM with limited visual data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.7.4 Is GTI-MM effective when more visual data is available? . . . . . . . . . . . . . . . . 61
6.7.5 Quantity of Visual Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.7.6 Diversity of Visual Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.7.7 Complexity of the Text Prompt in Text-to-image Generation . . . . . . . . . . . . . 64
6.8 GTI-MM with Visual-Missing in Any Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.8.1 Dropout Training with GTI-MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.8.2 Prompt Learning with GTI-MM Dropout . . . . . . . . . . . . . . . . . . . . . . . . 67
6.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Chapter 7: Extending Foundation Model to Assist Federated Learning . . . . . . . . . . . . . . . . 70
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.2 Experimental Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.3 Generative Foundation Model Assisted FL . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.3.1 Synthetic Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.3.2 Pre-training Downstream Model on Synthetic Data . . . . . . . . . . . . . . . . . . 73
7.3.3 Finetune Trained Downstream Model on Private Client Data with FL . . . . . . . . 73
7.4 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.5 Can we use synthetic zero-shot learning to replace FL? . . . . . . . . . . . . . . . . . . . . 74
7.6 Can Pre-training with Synthetic Data Improve FL Performance? . . . . . . . . . . . . . . . 75
7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Chapter 8: Conclusions and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
vii
List of Tables
3.1 Prediction results for both primary task model and secondary task model using original
input x ∈ De. We report results for whole De and also each subcorpus in De. . . . . . . . . 19
3.2 Prediction results for both primary task model and secondary task model using original
input x ∈ De with 5% of data. We report results for whole De and also each subcorpus in De. 22
4.1 Overview of the 4 datasets included in the FL benchmark. . . . . . . . . . . . . . . . . . . . 25
4.2 The table presents the benchmarking performance. Text colors in blue indicate the best
performance using Attention-based Fusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 FL Benchmarking performance compared to centralized learning baselines. . . . . . . . . . 31
5.1 Summary of dataset statistics used in this sub-work. . . . . . . . . . . . . . . . . . . . . . . 38
6.1 Summary of dataset statistics used in this sub-work. . . . . . . . . . . . . . . . . . . . . . . 51
6.2 Comparisons among complete audio, complete visual, and complete multi-modal models
across different datasets. The visual missing ratio p = 0% in training complete visual and
complete multi-modal models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.3 Performance comparisons between GTI-MM and other baselines across different datasets
with low-resource visual data presents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.4 Performance comparisons between label prompt and LLM-assist prompt in image generation. 65
6.5 Comparisons between MM-Dropout and GTI-MM Dropout. p and q are training and
testing visual missing ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.6 Comparing GTI-MM DT with and without prompt learning on the condition where visual
modality is missing in test data. p = 99% for MiT datasets, while p = 95% for other
datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.1 Accuracy performance of the generated downstream model and standard FL on benchmark
datasets. "1x Synthetic" represents the size of synthetic data with the same size as the real
data. FedAvg and FedOpt are both trained with complete data. . . . . . . . . . . . . . . . . 74
viii
7.2 Accuracy comparison between generated downstream model, standard FL, and GPT-FL.
"∆Metric" represents the accuracy increment by GPT-FL on top of the generated
downstream model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
ix
List of Figures
2.1 The figure presents the general membership inference attack. In this attack, the privacy
attacks start with collecting a set of shadow training datasets that are used to train
the shadow models. Following the shadow training, the attack trains an MIA classifier
through the model outputs from the target and shadow models. At the attacking stage, the
attacker infers the membership property given an input posterior from the target model
with the trained MIA classifier. This figure is from our prior review article in [32]. . . . . . 9
3.1 Training procedures of Cloak which include perturbation training and suppression-value
training. The picture is from our work in [31]. . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Architecture of the Cloak + Adversarial training setup. The picture is from our work in [31]. 18
3.3 Architecture of the primary task model in predicting emotion and secondary task model
in predicting gender. The picture is from our work in [31] . . . . . . . . . . . . . . . . . . 19
3.4 Prediction results for both primary and secondary task models using noise perturbation
ϕ(x). The UAR represents the UAR performance on the combined dataset. . . . . . . . . . 20
4.1 The figure demonstrates the overall architecture of our proposed multimodal federated
learning framework. The figure is from our work in [30]. . . . . . . . . . . . . . . . . . . . 27
4.2 The architecture of the basic model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Relative performance changes with 10% data corrupted (missing modalities vs. missing
labels vs. erroneous labels). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Relative performance changes with 30% data corrupted (missing modalities vs. missing
labels vs. erroneous labels). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 FL performance over training rounds on UCF101 dataset with missing modality, missing
labels, and erroneous labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
x
5.1 The proposed synthetic data generation and ASU training framework. We begin with
generating transcriptions of spoken utterances through label information. The transcripts
are then fed into the text-to-speech model, creating synthetic speech data. Finally, we
perform end-to-end ASU training using pre-trained WavLM models using synthetic and
limited human data. The figure is from our work in [29] . . . . . . . . . . . . . . . . . . . . 41
5.2 ASU fine-tuning performance between real and synthetic speech data. . . . . . . . . . . . 43
5.3 Fine-tuning performance between low-resource real speech data and synthetic speech data. 43
5.4 Limited speech training with pre-training on synthetic speech. . . . . . . . . . . . . . . . . 44
5.5 Fine-tuning performance between random initialization and synthetic data assisted model
initialization in low-resource speech training. The x-axis represents the available real
speech data ratio presented for low-resource training. . . . . . . . . . . . . . . . . . . . . 45
6.1 Problem formulation of missing modalities in this work with audio-visual recognition as
the example. The missing modality includes cases in training data alone or any data. The
figure is from our work [35]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2 Learning framework of GTI-MM: Imputing missing visual modality with synthetic visual
content for robust multi-modal learning. The figure is from our work [35]. . . . . . . . . . 53
6.3 Visual data generation process in GTI-MM. The figure is from our work [35]. . . . . . . . . 54
6.4 Comparisons between GTI-MM and zero-shot learning with synthetic visual data. . . . . . . 59
6.5 Performance comparisons among GTI-MM and other baselines at different training visual
modality ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.6 Impact of generation quantity on GTI-MM performance. . . . . . . . . . . . . . . . . . . . . 62
6.7 Impact of generation tricks on GTI-MM performance. . . . . . . . . . . . . . . . . . . . . . . 63
6.8 Impact of varing action performers in image generation on GTI-MM performance. . . . . . 64
6.9 Comparisons on testing visual modality missing between MM-Zero Imputation and
GTI-MM Dropout, where training visual modality missing ratio p equals testing visual
modality missing ratio q. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.1 Overview of GPT-FL. In this thesis, we focus on the speech and audio application, while
the visual recognition results can be referenced in [103]. . . . . . . . . . . . . . . . . . . . 72
7.2 The figure shows the learning curve of the global model during training on the Google
speech commands dataset. The FL algorithm used is FedAvg, and the figure is from our
work [103]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
xi
7.3 The figure shows the smoothed Gradient diversity of client updates during training on
the Google speech commands dataset. The FL algorithm used is FedAvg, and the figure is
from our work [103]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.1 Visual generation examples for classes of cooking and hiking. The image examples are
from our work in [35]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
xii
Abstract
Human-centered machine intelligence has revolutionized many leading domains, providing more intelligent services and applications in transportation, healthcare, and education. The advances in these fields
profoundly change how people live, work, and interact with each other. These systems frequently utilize
state-of-the-art machine learning (ML) algorithms to achieve a comprehensive understanding of human
conditions, including how people perceive, feel, and interact with others, which provide possibilities to
create technologies that increasingly augment human experiences. Despite promises human-centric ML
systems deliver, they create critical risks in potentially leaking sensitive information that AI practitioners should consider protecting. The sensitive information can be demographics (e.g., age, gender), human states (e.g., health, emotions), or biometric fingerprints. In this thesis, I explore privacy-enhancing
computation associated with human-centered ML. My thesis investigates established approaches to preserve privacy in diverse human-centered applications. However, we identify that these approaches are
frequently ineffective when encountering limited data due to privacy restrictions in sensing, storing, and
using such data. Concurrently, the generative foundation model is a rapidly evolving research field, leading to the success of modern generative AI capable of creating realistic and high-fidelity digital content.
These advances in foundation models and generative AI also present opportunities for privacy-enhancing
computing as high-quality generated content can serve as training data. This leads us to explore using
the foundation model to generate training data to assist limited training encountered with sensitive data
in human-centered applications. Our extensive experiments demonstrate the potential of the generative
xiii
foundation model in assisting limited training caused by privacy constraints in obtaining human-centered
signals. Moreover, we show that the generative foundation model can provide benefits to distributed learning algorithms, such as Federated Learning.
xiv
Chapter 1
Introduction
1.1 Human-centered Machine Learning Systems
In the last few years, we have witnessed numerous remarkable advances in machine learning (ML), particularly deep learning, particularly in fields of natural language processing [23], image classification [43],
video understanding [22], healthcare analysis [68], and intelligent agents [87]. The modern deep learning models, such as foundation models, are typically optimized with massive datasets from the Internet
through unsupervised or self-supervised learning objectives using large-scale computational infrastructures [9]. The majority of these models typically rely on transformer model architecture combined with
several MLP layers for various downstream applications. These pre-trained models provide substantial
improvements in classification, retrieval, and segmentation performances over traditional modeling approaches, providing benefits to understanding human behaviors and conditions to enhance human experiences. This thesis focuses on studying human-centered ML that utilizes deep learning to understand
human states, traits, behaviors, and interactions.
Specifically, modern human-centered AI systems utilize advanced sensor technology and state-of-art
ML algorithms to provide a comprehensive understanding of the human condition – including individual traits, behavior states, and interaction patterns–which provide possibilities to create technologies that
increasingly support and augment human experiences. In general, human-centered AI systems require
1
sensing, processing, and understanding multi-modal data from people in diverse contexts, including environments such as schools, vehicles, restaurants, workplaces, hospitals, and homes. One example is our
recent large-scale experimental studies of > 400 hospital workers (TILES-2018 [71] and TILES-2019 [99])
using wearable sensors. In both these studies, we gathered multi-modal human-centric data (e.g., including
physiological, environmental, proximity, and egocentric audio) from clinical workers in a highly sensitive
hospital environment with minimal intrusion for several weeks (10 weeks for TILES-2018 and 3 weeks
for TILES-2019). Using advanced data mining techniques, e.g., motif discovery and time-series clustering,
our prior works [34, 28] showed promising results in identifying unique behavior patterns among hospital
workers from these human-centered signals collected in the wild. This large-scale human-centered study
presents unique opportunities for researchers to understand human behavioral patterns and their impact
on wellness.
Another example of the human-centered system involves speech-centric applications. Speech is a
natural way for humans to express themselves, communicate with others, and connect with society. Automatic speech understanding involves processing human speech to understand not only the content but
the intent, the topic, and the emotions conveyed through speech. More precisely, automatic speech understanding includes several fundamental speech modeling tasks: automatic speech recognition [66]), speech
emotion recognition [2]), and automatic speaker recognition [47]). The popularity of automatic speech
understanding has enabled smart assistants, such as Siri, Google Voice, and Alexa, to provide benefits to
enhance human experiences and life quality. Moreover, the recent development in speech-centric machine learning leads to various speech models, such as wav2vec 2.0 [7], Whisper [78], WavLM [16], and
Conformer [40], that can understand human speech even in extremely noisy environments. These advancements in speech-centric modeling create unique opportunities and introduce novel research topics
in human behavior understanding, human-computer interface (HCI) [19]), and social media analysis.
2
1.2 Privacy in Human-centered Machine Learning
Although modern ML systems have shown their capabilities and promises in a wide range of humancentered applications, the unique challenge related to privacy is present in most of these systems. The
success of deep learning models relies predominantly on acquiring data and information from people in
potentially sensitive environments and contexts, like homes, workplaces, hospitals, and schools. Collecting these human-centered signals frequently raises significant concerns about compromising user privacy.
It is widely known that data, such as physiological data, can directly or indirectly reveal sensitive information that people might want to keep private [63], including name, demographics, and health status.
Moreover, many modalities, including videos and speech, carry sensitive information that can directly reveal personally identifiable information (P.I.I.), prompting many legislations to safeguard them, including
the recently introduced GDPR [94]. Therefore, it is critical for AI researchers and practitioners to include
privacy protection schemes in developing modern ML systems to prevent user data from unjustifiable and
unauthorized uses.
One major challenge AI researchers encounter is creating privacy-enhancing computing without compromising the ability to sense, interpret, and illuminate the rich diversity across people and contexts.
Conventionally, most ML practitioners evaluate ML models using performance-based metrics, such as F1,
accuracy, precision, and recall. At the same time, limited systems consider measuring the privacy risks
in training and deploying these models. In the past few years, we have witnessed a significant amount of
research on privacy-enhancing AI and ML, and the objective of this paper is to provide a novel aspect of
using generative foundation models to enhance the privacy of existing ML systems.
3
1.3 Generative Foundation Models For Privacy-Enhancing Computing
The recent progress in training multi-modal transformer models with large-scale web-based multi-modal
data sources like LAION-5B [82], Conceptual-captions [84], WIT [91] has opened up new frontiers in foundation models. These foundation models are typically optimized with pre-training objectives involving
reconstruction, contrastive, or predictive tasks, creating models that can comprehend data from diverse
backgrounds and contexts. Increasingly, the foundation model has become an emerging technique that
empowered many evolving technologies, notably generative AI (e.g., ChatGPT and DALL-E-2 ∗
, etc.), enabling the creation of high-fidelity content across images, audio, and natural language based on user-input
requirements or prompts. Beyond their applications in content creation, advanced generative AI models
also demonstrate promise for privacy-enhancing computing. The high-quality content these models are
capable of generating can serve as effective training data, reducing privacy risks associated with collecting user data. For example, researchers studying text-to-image and large language models (LLMs) have
demonstrated that synthetic data yields remarkable zero-shot performances in tasks related to computer
vision and language understanding [95, 44].
1.4 Research Directions
In this dissertation, I propose to address the aforementioned privacy challenges in human-centered applications. Specifically, I present privacy-enhancing computations involving data perturbation for centralized
learning systems. Moreover, I study privacy-enhancing learning in decentralized systems using Federated
Learning. We identified that most existing privacy-enhancing computation approaches require complete
and high-quality data, which becomes challenging when encountering modern privacy regulations. This
leads us to elaborate on investigating the use of the foundation model to assist privacy-enhancing computing in both centralized and federated learning.
∗
https://openai.com/
4
1.4.1 Data Perturbation
Centralized learning requires collecting human-centered data in raw forms. In this setting, these humancentered signals are typically sampled at the client device and are then transferred to the service provider’s
server for post-processing. If service providers are untrusted, they may infer the mentioned private information from the human-centered data. We propose to study data perturbation techniques in obfuscating
the sensitive attribute from the human-centered signals while minimizing the utility loss from the original
application.
1.4.2 Federated Learning
As an alternative to the conventional methods of training machine learning on a single server, federated
learning (FL) employs a server-client model such that the data never leaves the client. Instead, the objective
of the server is to aggregate all the model updates from the clients. Unlike centralized training, the clients
train the model on a local dataset and transmit the updated parameters instead of the raw data. In this
thesis, we explore the use of Federated Learning in diverse human-centered machine-learning applications.
1.4.3 Foundation Model Assisted Privacy-enhancing ML
Even though many efforts have been made in privacy-enhancing computing, we argue that the assumption that complete data is available in conventional privacy-enhancing computation research is impractical
due to the limitations in sensing, storing, and using data. For example, the current data privacy legislation restricts AI practitioners and researchers from accessing and using many data sources for training
their machine learning systems, leading to low-resource training problems. We identify that most existing privacy-enhancing algorithms cannot achieve reasonable utility with low-resource training. As mentioned earlier, the foundation model has enabled applications capable of understanding and comprehending
human-centered signals, leading to advanced text reasoning and high-fidelity digital content. Therefore,
5
we reformulate the privacy-enhancing problem into a low-resource training problem, and we investigate
the capabilities of the foundation model in improving centralized learning with low-resource problems.
On the other hand, federated learning encounters data heterogeneity and computation challenges, resulting in substantial performance degradation compared to centralized learning. With the development
of foundation models, we aim to explore the ability of these models to assist federated learning with constrained computation and extreme data distributions. I want to highlight that this thesis is based on my
previous works in [32, 102, 103, 30, 33, 29, 35, 31].
1.5 Roadmaps
The remaining parts of this dissertation are structured as follows:
• Chapter 2 describes the background information about conventional privacy attacks and commonly used
mitigation approaches.
• Chapter 3 presents the use of data perturbation for privacy-enhancing human-centered machine learning applications. We explore the disadvantages of using perturbation-based approaches in protecting
privacy.
• Chapter 4 explores federated learning on various human-centered ML tasks. Similarly, we study the
challenges in federated learning.
• Chapters 5 and 6 aim to reconceptualize the privacy-enhancing computation from the system perspective, such that data availability is constrained to device failures and data privacy legislation. This leads to
substantial utility degradation with conventional privacy-enhancing computational approaches. Specifically, in Chapter 5, we focus on studying how the generative foundation model can support privacyenhancing computing in unimodal learning, while Chapter 6 investigates its application in multimodal
learning.
6
• Chapter 7 aims to extend the finding from the foundation model-assisted privacy-enhancing ML in centralized learning to federated learning. We demonstrate that leveraging foundation models for data
generation can effectively enhance the performance and convergence speed of standardized federated
learning.
• Chapter 8 concludes the contributions of this thesis and describes the ongoing and proposed future work.
7
Chapter 2
Privacy-enhancing Computing Background
In this chapter, we provide reviews regarding conventional approaches to measuring privacy risks and
reducing privacy risks. This chapter is based on our prior review articles in [32].
2.1 Privacy Threats
In order to assess privacy risks associated with training or deploying an ML system, researchers commonly
create relevant privacy attacks that the system might encounter in real-life scenarios. In the context of
privacy attacks, adversaries aim to obtain unintended information, including sensitive attributes, demographics, and biometric fingerprints. In this review, we classify privacy risks into two primary categories,
including Property Inference Attacks (PIA) and Membership Inference Attacks (MIA). These attacks can
all be applied to reveal private information from human-centered applications. The complete reviews on
other privacy attacks can be found in [32].
Property Inference Attacks (PIA): PIA is one commonly presented attack where the adversary aims to
infer sensitive attributes that are unrelated to the primary task. For example, notable sensitive attributes
include demographic information such as age, gender, and race, while it can also include appearance attributes like the color of hair or height and weight. For example, prior studies in this research thread
8
Figure 2.1: The figure presents the general membership inference attack. In this attack, the privacy attacks
start with collecting a set of shadow training datasets that are used to train the shadow models. Following
the shadow training, the attack trains an MIA classifier through the model outputs from the target and
shadow models. At the attacking stage, the attacker infers the membership property given an input posterior from the target model with the trained MIA classifier. This figure is from our prior review article in
[32].
involved obfuscating gender information from speech [31] and visual data [69]. Increasingly, [69] also
presents privacy attacks regarding appearance detection from visual data.
Membership Inference Attacks (MIA): Membership Inference Attacks (MIAs) [86], as demonstrated in
Figure 2.1, is one of the most popular privacy attacks in the ML community, with the aim of identifying
whether a particular data instance was used in training the designated model. This attack is the assumption
that the attacker could access the posterior output of a data sample by querying the target model. In
the context of MIAs, the attacker typically initiates a training process known as shadow training, which
mimics the target training. To perform shadow training, the attacker first needs to compile a set of shadow
training datasets. Typically, these shadow datasets share similar data distribution or format as the target
training data. Subsequently, the attacker trains a classifier to perform MIAs using a collection of posterior
probabilities derived from both training and shadow data. During the attack stage, the attacker obtains
the posterior output to infer whether the queried data was part of the training data.
9
2.2 Privacy Mitigation
In this paragraph, we will present two popular approaches to privacy mitigation: differential privacy (DP)
and adversarial training. These two approaches have been studied in our prior explorations.
Differential Privacy (DP): DP is essentially to perturb the input data in a way that minimizes the attacker
to infer real distribution from the input data [26]. Specifically, the DP aims to add noise perturbation so
that the attacker cannot distinguish any particular data point in the original dataset. The typical way to
implement differential privacy is through Gaussian noise addition. More concretely, given the privacy
parameters ϵ and δ, we can define the DP as the following:
Definition 2.2.1 ((ϵ, δ)-DP). A random mechanism M satisfies (ϵ, δ)-LDP, where ϵ > 0 and δ ∈ [0, 1), if
and only if for any two adjacent data sets D and D′
in universe X , we have:
P r(M(D)) ≤ e
ϵP r(M(D
′
)) + δ (2.1)
The ϵ > 0 in the above equation is the parameter defining the privacy budget in DP, with a lower ϵ
corresponding to larger perturbation and stronger privacy protection ([27]). For example, when ϵ approximates 0, we are essentially adding an infinite amount of noise, preventing both privacy attacks and the
utility of the data.
Adversarial Training: The concept of adversarial training is based on mutual information. Typically to
protect a sensitive attribute z, given the perturbed data sample h(·), the adversarial training attempts train
an adversary to maximize the mutual information information I(h(x); z). At the same time, we aim to
learn the perturbation distribution h(·) so that the adversary cannot identify the attribute of the perturbed
data. To preserve the utility of the data, it is common to incorporate the target training objective in training
combined with the adversarial training objective. More concretely, adversarial training aims to optimize
the following objectives:
10
min
ψ
max
ϕ
L(advψ(hϕ(x)), z) (2.2)
Our Chapter on data perturbation uses the concept from adversarial training.
2.3 Machine Learning with Synthetic Data
Training with synthetic data is a recent popular research topic in machine learning. The motivation to
study the use of synthetic data in this thesis is to minimize the collection of human-related data. The
emergence of this research topic relies heavily on the success of generative foundation models. These
models are capable of understanding complex human-centered signals, bringing substantial advances in
generating high-fidelity in the forms of text, image, and audio. Prior works in [44] have demonstrated the
massive potential of using these synthetic content as training resources, providing remarkable zero-shot
performance that is even higher than using the curated real-life dataset. In this thesis, we present our
recent exploration of this topic based on [29, 103, 35].
11
Chapter 3
Data Perturbation For Privacy-Enhancing Human-centered ML
3.1 Introduction
As we stated in previous chapters, attribute inference attacks and property inference attacks present unique
challenges in leaking private attributes about a person, such as demographics (e.g., age and gender) [38,
70]. These undesired/unauthorized usages of data may occur when the service provider is not trustworthy
or when an intruder attacks the cloud system [24, 53, 92]. In this chapter, we present a study of data
perturbation approaches to obfuscate sensitive information using the speech emotion recognition (SER)
task as an example in human-centered applications. This chapter is based on our prior works in [31]
3.2 Human-centered Application - Speech Emotion Recognition
Speech understanding is an important human-centered application that can enhance user experiences in
applications such as virtual assistants. Emotion understanding from speech is one major research direction in speech understanding that identifies the emotional states expressed through vocalizations. These
systems find widespread application in various domains, including smart virtual assistants [56], medical
diagnoses [79, 10], and education [60]. While understanding emotions from speech provides advantages in
12
delivering appropriate assistance, it is important to acknowledge that the speech signal also contains extensive information about individual demographics. This thesis explores strategies for preserving privacy
in speech while minimizing any significant compromise to SER performance.
3.3 Speech Emotion Recognition Datasets
IEMOCAP The IEMOCAP database [11] was collected using multi-modal sensors that capture motion,
audio, and video of acted human interactions. The corpus contains 10,039 utterances from 10 subjects targeting expressing categorical emotions. In addition, the utterances are divided into improvised conditions
and scripted conditions based on whether the utterance is from a fixed script. We choose to remove data
from script conditions as suggested in previous work [104].
CREMA-D The CREMA-D [14] corpus is a multi-modal database of emotional speech collected from 91
actors, 48 of whom are male and 43 are female. The set contains 7,442 speech recordings that simulate
emotional expressions including happy, sad, anger, fear, and neutral.
MSP-Improv The MSP-Improv [12] corpus was created to study naturalistic emotions captured from
improvised scenarios. The corpus includes audio and visual data of utterances spoken in a natural condition
(2,785 utterances), target condition (652 target utterances in the improvised scenario), improvised condition
(4,381 utterances from the remainder of the improvised scenario), and read condition (620 utterances). We
decided to use the data only from the improvised scenarios.
3.4 Noise Perturbation For Privacy-enhancing Computing
3.4.1 Problem Definition
ML tasks: In this chapter, we formulate our experiments with a labeled speech emotion data set D including speech samples x1, ..., xn, where xi ∈ R
m and emotion labels y1, ..., yn. Moreover, we specifically
13
define our secondary task (gender prediction) labels as z1, ..., zn. Without loss of generality, we define fθ
as the emotion classifier with parameters θ that predict the emotion labels yi
. Similarly, we denote fg as
the adversary classifier with parameters g to predict gender labels zi
.
Conductive/Non-conductive Features: The findings from a previous study [69] have shown that the
features needed for a specific classification task may differ, with some features playing a critical role in the
primary prediction task, while others can potentially disclose private information. Building on the insights
and setup from this study [69], we similarly define conductive and non-conductive features from speech
data as c ∈ x and u ∈ x, respectively. Here, The conductive features are considered more relevant features
associated with the emotion classification, while the non-conductive feature set represents features of
lesser importance for the emotion prediction.
Learning Objective: The primary objective of our work is to train noise perturbation function ϕ to transform the speech data xi
into ϕ(xi), aiming to prevent the adversary classifier fg from predicting gender
label zi
. Most importantly, we want to ensure that conductive features are kept to preserve the capability
of emotion recognition.
3.4.2 Relevant Feature Selection - Cloak
In this work, we follow a recently proposed framework called Cloak [69], as the foundation to achieve our
learning objective. Specifically, Cloak aims to identify conductive features c and non-conductive features
u from x to the primary prediction task using mutual information. Subsequently, Cloak learns function ϕ
to add noise to non-conductive features u as shown below:
ϕ(x) = x + r, where r ∼ N (µ, Σ) (3.1)
14
Figure 3.1: Training procedures of Cloak which include perturbation training and suppression-value training. The picture is from our work in [31].
Then, we can denote the information leakage as the Mutual Information I between raw data x and the
noise perturbation ϕ(x) [21, 69, 41, 42, 20]. One objective of Cloak is then to maximize the mutual information I(ϕ(x); c) between ϕ(x) and c. In contrast, Cloak also aims to minimize the mutual information
I(ϕ(x); u) between ϕ(x) and u. We want to highlight that the mean µ and the parameters σ in diagonal
covariance matrix Σ in ϕ(x) are two trainable tensors in the training process.
How to maximize I(ϕ(x); c)? Maximizing I(ϕ(x); c) is essentially to find the lower bound for this
mutual information function. According to the proof in [69], this is equivalent to minimizing the empirical
cross-entropy loss over the whole training data. Therefore, we can rewrite this learning objective as the
following cross-entropy loss function L:
min
σ,µ
L(fθ(ϕ(x)), y) (3.2)
How to minimize I(ϕ(x); u)? This optimization problem is the same as finding the upper bound for the
term I(ϕ(x); u). Since u is a subset of x, then the following holds:
I(ϕ(x); u) ≤ I(ϕ(x); x) = H(ϕ(x)) − H(ϕ(x)|x) (3.3)
15
This optimization problem in the above equation has been proved to be the same as minimizing the
following loss function Lc [69]. We want to highlight that the lemma and accompanying proof for the
upper bound of I(ϕ(x); u) can be found in [69]. Specifically, Lc is defined as below:
Lc = − log( 1
m
Xm
j=0
σ
2
j
) (3.4)
Combined Optimization Objective Finally, we combined loss Lc and L, with λ in equation 3.5 indicating
a hyper-parameter that sets the focus on minimizing the first term I(ϕ(x); u). The combined loss is defined
as below:
min
σ,µ
λLc + L(fθ(ϕ(x)), y) (3.5)
3.4.3 Combining selective noise perturbation with Adversarial Training
We want to highlight that in the cloak, the secondary task does not play a role in defining objective function. To further decrease the accuracy of the secondary classifier, we propose to minimize the mutual
information I(ϕ(x); z) between noise representation ϕ(x) and secondary task label. Since it is infeasible
to estimate the mutual information between two arbitrary distributions, this problem is typically turned
into the following adversarial training objectives as suggested in [89]:
min
ψ
L(advψ(ϕ(x)), z); max
σ,µ
L(advψ(ϕ(x)), z) (3.6)
Here advψ represents an adversary in inferring the secondary task. The objective of training ϕ(x) is to
create the noise representation that is minimally informative of the secondary task. In practice, this optimization problem is usually implemented using the gradient-reversal layer (GRL) [36]. Here, the gradient
reversal layer gα is inserted between ϕ(x) and advψ. During the forward pass, the GRL acts as the identity
16
function, while it scales the gradients passed through by −α in the back-propagation stage. Consequently,
we aim to minimize the loss function below:
min
ψ,σ,µ
L(advψ(gα(ϕ(x))), z) (3.7)
In the end, we aim to combine the Cloak framework with adversarial training to select the most important features for the primary task, while minimizing the attribute inference for the secondary task and
potentially other tasks. We follow the training procedure that performs the perturbation training and
suppression-value training in sequence. The model architecture is shown in Figure 3.2. Formally, we aim
to minimize the following loss function:
min
ψ,σ,µ
λLc + L(fθ(ϕ(x)), y) + L(advψ(gα(ϕ(x))), z) (3.8)
3.4.4 Training Procedures
It is easy to observe that the optimization objective in Equation 3.8 is not trivial to implement in practice,
as there is no direct way to perform back-propagation in ϕ(x) with respect to σ and µ. To optimize this,
we define the noise function r as r = σ ⊙ e + µ, where e ∼ N(0, 1). As the variance cannot be negative,
we further constrain the σ as follows:
σ =
1 + tanh(ρ)
2
(3.9)
However, the training process above only aims to find the noise mapping for ϕ(x), but in practice, we
are unsure about the amount of noise that we would add to the non-conductive features. To do this, we
perform suppression-value training to learn the representations to replace the non-conductive features.
Specifically, we can identify the non-conductive features through the noise function ϕ(x). For example,
17
Figure 3.2: Architecture of the Cloak + Adversarial training setup. The picture is from our work in [31].
we argue that features with lower σj are associated with more importance to the primary task, as a small
variation of perturbation could decrease the performance of the primary task. Therefore, we can define the
threshold T to differentiate the conductive and non-conductive features. Finally, we define a mask noisy
learning objective as:
min
µs
L(fθ(hµs
(x)), y) (3.10)
From the above objective, we aim to learn the replace values µs, where ϕ(x) as ϕ(x) = (x+r)⊙m+µs.
Here, the mask function is defined as mj = 0 when σj > T, otherwise mj = 1. Therefore, we perform the
optimization to learn the value that can replace the non-conductive features. We want to highlight that
we freeze the parameters in fθ in the training and only take the gradients with respect to σ, µ at the input.
3.4.5 Primary task model and secondary task model
In this work, we use a convolutional and RNN architecture for the modeling experiments. We first preprocess the input speech data into the mel-spectrogram. We then feed this time-frequency feature to a set
18
Figure 3.3: Architecture of the primary task model in predicting emotion and secondary task model in
predicting gender. The picture is from our work in [31]
Emotion Model (fθ) Gender Model (fg)
Acc UAR Acc UAR
Combined 57.7% 58.9% 97.2% 96.8%
IEMOCAP 57.6% 59.7% 96.3% 96.6%
CREAM-D 69.8% 66.9% 93.5% 93.8%
MSP-Improv 45.8% 44.9% 99.6% 99.7%
Table 3.1: Prediction results for both primary task model and secondary task model using original input
x ∈ De. We report results for whole De and also each subcorpus in De.
of convolutional layers with Batch normalization and ReLU activation functions in between. We subsequently send the learned representations from convolutional layers to the RNN layers. Finally, we flatten
the RNN output and perform the classification using MLPs. In this experiment, we use the dataset data set
Dp and Dadv to train the primary task model fθ and secondary task model fg, respectively. The complete
model architecture is shown in Figure 3.3.
We first present the baseline performance of the primary task and the secondary task. Here, the primary task is speech emotion prediction, and the secondary task is gender prediction from speech. We
perform the training using a learning rate of 10−4
and a dropout rate of 0.2. The more details about the
19
training details are listed in [31]. We aggregate all the training samples from IEMOCAP, CREMD-D, and
MSP-Improv datasets and report the aggregate performance.
The metrics used to evaluate the prediction performance are accuracy and unweighted average recall
(UAR). Specifically, the baseline performance of emotion recognition and gender prediction is demonstrated in Table 3.1. The results indicate a decent emotion prediction performance with a UAR of 58.9%.
On the other hand, we identify that it is extremely easy to perform gender prediction from speech, with the
aggregate gender prediction accuracy of 97.2%. This finding indicates the privacy risks associated with
demographics, like gender, in speech understanding, urging the researcher to develop effective approaches
in privacy-enhancing ML for speech applications.
Feature Noise Ratio (%)
UAR
52
57
62
0 20 40 60 80
Emotion Prediction
Feature Noise Ratio (%)
UAR
40
60
80
10
0
0 20 40 60 80
Gender Prediction
Figure 3.4: Prediction results for both primary and secondary task models using noise perturbation ϕ(x).
The UAR represents the UAR performance on the combined dataset.
20
3.5 Results on Feature Perturbation
Experimental Details: We perform two training experiments with and without adversarial training. As
we discussed earlier, our training in both scenarios involves learning noise mapping of µ and σ. Specifically,
we set ρ = −10 in σ =
1+tanh(ρ)
2
to control the magnitude of the noises. In addition, we use a µ = 0 to
initialize the noise function ϕ(x). As we obtained the noise mapping, we performed the second stage of
training, which is suppression value training. As it is typically challenging to choose the optimal threshold
value T that can maximize the primary prediction performance while minimizing the secondary prediction
performance, we decide to empirically study the threshold value T as the 20%, 40%, 60%, and 80% percentile
in learned σ. We define these percentile values as the suppression ratio (SR), with a lower SR (higher
threshold value T), which means fewer features are replaced in suppression value training. During the
adversarial training, we combine the adversarial training objective in the first stage of training to select
non-conductive features that are more representative of gender.
We plot the comparison between emotion and gender prediction using Cloak-based adversarial training
in Figure 3.4. We want to highlight that an SR = 0% indicates that no noises are being added to the data
samples, as there are no suppression values replacing the non-conductive features. As we increase the
feature noise ratio, we can observe a small emotion prediction performance drop from 58.9% to 57.7% even
with SR = 60%. This indicates that our selected features are still representative of the primary prediction
task. Meanwhile, we can discover that the combined noise training can substantially harm the gender
prediction performance, decreasing the gender prediction accuracy from > 95% to 65.8%. However, we
can observe that both emotion and gender prediction suffer sufficiently large performance drops when the
feature noise ratio increases to 80%.
21
Emotion Prediction (fθ) Gender Prediction (fg)
IEMOCAP 42.3% 81.3%
CREAM-D 54.1% 82.1%
MSP-Improv 38.1% 80.2%
Table 3.2: Prediction results for both primary task model and secondary task model using original input
x ∈ De with 5% of data. We report results for whole De and also each subcorpus in De.
3.6 Challenges of Data Perturbation in Privacy-enhancing Computing
Although positive findings are reported with the proposed data perturbation framework, this problem
formulation assumes that the complete data is available to AI practitioners and researchers. Nevertheless,
in real-world settings, acquiring complete data for training human-centered ML systems is increasingly
challenging as the evolving legislation and laws emphasize the importance and urgency of protecting
data privacy. This results in limited data available for training ML systems. Here, we experiment with
only 5% of data available, as demonstrated in Table 3.2. The results indicate that the utility of humancentered machine learning systems can suffer significantly under low-resource setting scenarios, while
the possibility of privacy inference remains strong. This suggests that the data perturbation approach
would not be effective, as the baseline utility capabilities are constructed.
3.7 Conclusion
In this chapter, we presented our experimental results using data perturbation to enhance privacy in speech
emotion recognition systems. Our perturbation method combines relevant feature selection and adversarial training to learn a noise inject function ϕ(x) to balance the utility for the SER and inference privacy for
sensitive demographic information. Our results show that by adding the noise with the relevant feature selection and adversarial training, our perturbation approach can effectively prevent demographic attributes,
such as gender, from being inferred. Meanwhile, the results show that the injected noise on original input
22
data does not decrease the emotion recognition performance. However, we recognize that the data perturbation approach may prove impractical in low-resource training scenarios, potentially caused by evolving
privacy laws and legislation.
23
Chapter 4
Federated Learning For Privacy-enhancing Human-centered ML
4.1 Introduction
What is Federated Learning? In contrast to centralized training, ML practitioners have developed Federated Learning as an alternative paradigm to build models, without the need to transfer user data from
the edge devices [52]. Unlike centralized training, models are trained locally using locally stored data, and
updated parameters are transmitted to the server instead of raw data. FL allows clients to train a model
collaboratively without sharing their local data, making it one of the most emerging privacy-enhancing
learning algorithms in ML research [48].
Development of Federated Learning: Previous works in FL have primarily focused on designing robust
and efficient algorithms for federated model training. FedAvg [67] was the earliest FL optimization algorithm to train the model in the distribution mechanism. In FedAvg, each client executes local model updates
before submitting the updates to the server. Even though FedAvg offers possibilities for deploying FL in the
wild, it often encounters slow convergence as a consequence of gradient drifting from data heterogeneity.
As such, researchers have proposed algorithms such as stochastic Controlled Averaging Algorithm (SCAFFOLD) [49] and FedProx [59] to minimize the impact of gradient drift for heterogeneous data. For example,
SCAFFOLD accelerates the training speed through control variates, preventing the client gradients from
24
Table 4.1: Overview of the 4 datasets included in the FL benchmark.
Task Dataset Partition Client Num. Modalities Metirc Total Instance
ER MELD Natural 86 Audio, Text UAR 9,718
MAR UCF101 Synthetic 100 Audio, Video Acc 6,837
HAR UCI-HAR Synthetic 105 Acc, Gyro F1 8,979
SM CrisisMMD Synthetic 100 Image, Text F1 18.1K
drifting away from the global optima. Similarly, [80] introduced adaptive optimization algorithms, FedOpt,
that allow server optimization through momentum. This chapter is based on our work in [30].
4.2 Benchmarking FL with Multi-modal Applications
In this chapter, we present our FL benchmark work, Fed-Multimodal, which provides comprehensive experiments with multi-modal human-centered applications. Our motivation to benchmark FL with multimodal datasets is that many real-world human-centered applications are associated with multimodal data
streams. Specifically, we present a subset of the experiments from our study in [30].
4.2.1 Human-centered Datasets and Applications
We select 4 representative datasets related to human-centered applications in this thesis, while the complete benchmark results are presented in [30]. The selected multimodal datasets cover four diverse humancentered applications, including Emotion Recognition, Multimedia Action Recognition, Human Activity
Recognition, and Social Media classification. Below is a brief description of the dataset in each task.
Emotion recognition (ER), such as speech emotion recognition described in the previous chapter, is
widely applied to modern human-computer interfaces, such as virtual assistants. Apart from the application in novel human-centered applications, emotion recognition can also assist in understanding human
behaviors. Emotion recognition involves using sensitive human information such as speech and images,
leading to increased concerns over privacy. Therefore, investigating emotion recognition in the context
25
of Federated learning can facilitate future studies in developing privacy-enhancing emotion recognition
systems. In this chapter, we include MELD [76] as our experimental datasets in our FedMultimodal benchmark. MELD dataset [76] is collected to study the multiparty interactions with multi-modal data from the
Friends TV series. We choose data with emotions from neutral, happy (joy), sad, and angry, leading to
9,718 data samples in our experiments. In this benchmark, we perform emotion recognition using speech
and text modality.
Multimodal Action Recognition (MAR) involves using multimodal data to classify the performed human actions. In this chapter, we utilize UCF101 dataset in the following experiment. There are 6,837
videos with both video and audio information, and only 51 unique labels are associated with the remaining audio-visual data.
Human Activity Recognition (HAR) typically uses wearable data streams, such as accelerometers and
gyroscopes, to classify human postures and activities. This chapter implements the FL benchmark over
UCI-HAR dataset [4]. This data set is one of the most studied datasets in evaluating model performance
in human activity recognition. The complete datasets consist of data from 30 subjects (19-48 years old).
The participants were instructed to wear mobile sensors to perform six daily activities: walking, walking
upstairs, walking downstairs, sitting, standing, and lying.
Social media (SM) contains large-scale multimodal data in diverse topics. Notably, social media have
become an efficient tool and platform in emergencies, making it convenient for people to track important
updates and information during situations involving natural disasters. In this chapter, our benchmark experiments explore disaster information classification, including one social-media-based multimodal dataset
called CrisisMMD [3]. This dataset contains tweet information regarding seven prominent natural disasters, such as wildfires and earthquakes. Our FL task is to predict the impact (e.g., property damage, injury,
and death) of the disasters from image and text data from tweets.
26
Gender
Emotion
Health Status Context
Property Inference Attack
Label Inference Attack
Sensitive Emotion
Multimodal Applications
Human Activity
Recognition
Keyword
Spotting
Sensitive Activity
Emotion
Recognition Healthcare
Membership Inference Attack
Training
Samples? in
Identity Inference Attack
Content Inference Attack
_ lives
here.
Bob lives
here.
Name,
address, etc.
Content
Memorization
Multimodal application
FedMultimodal Library
Federated Simulation Manager:
Data split
FL Learning Framework:
FedAvg
Scaffold
FedProx
Other methods
Dirichlet Allocation Natural Partition
Missing
modalities
Label
noises
Missing
labels
Feature Processing Manager:
Image
(MobileNet)
Text
(MobileBert)
Audio
(MFCC)
Human Activity
Recognition Social Media Multimedia Activity
Recognition
Data
Partition
Synthetic
Partition
Natural
Partition
Feature
Processing
Raw
Audio
(MFCC)
Image/Video
(MobileNet/
MobileViT)
Text
(MobileBert/
DistilBERT)
Multimodal
Models
1D Conv
GRUs
Fusion
Audio/Acc/
Gyro/ECG
Classifier
Y
Video/
Text
GRUs
Fusion
Schemes
Concatenation
based
Attention
based
FL
Optimizers
FedAvg
Scaffold
FedProx
FedOpt
FedRS …
Noise Factor
Emulator
Missing
Modalities
Missing
Labels
Erroneous
Labels
Figure 4.1: The figure demonstrates the overall architecture of our proposed multimodal federated learning
framework. The figure is from our work in [30].
4.2.2 End-to-end Multi-modal Federated Learning Framework
A typical FL benchmark involves non-IID data partitioning, model design, and FL optimizer. In addition
to these existing components widely presented in previous FL benchmarks like LEAF [13], we introduce
fusion schemes and real-world noise factor emulators that are particularly important to benchmark FL
performance for multimodal learning. Real-world noise factor emulators offer unique opportunities to
facilitate FL research in multimodal learning that encounters missing modalities, missing labels, or noisy
labels. We present the overall architecture of our proposed FedMultimodal benchmark in Figure 4.1 ∗
.
Model Design: In contrast to centralized learning, FL occurs frequently in mobile and edge devices with
limited computation and data storage capabilities. Therefore, we chose models requiring lightweight computations containing total parameters in the range of multi-million parameters. Specifically, we construct
ML models mainly based on the 1D Conv-RNN/MLP architecture, as demonstrated in Figure 4.2. For example, the multimodal model in our benchmark consists of a feature encoder, a modality fusion, and a
∗
Figure 4.1 uses image sources from https://openmoji.org/
27
Figure 4.2: The architecture of the basic model.
downstream MLP classifier. Our audio encoders involve Conv+RNN architecture, while video and text
encoders follow the RNN-only architecture. Moreover, we apply the Conv+RNN architecture to model
accelerometer and gyroscope datastreams. The modality-specific representation from feature encoders is
then fused to the multimodal representation, which is subsequently fed through the MLP classifier.
Fusion Schemes: Although our complete FedMultimodal benchmark in [30] includes concatenation-based
fusion and attention-based fusion, we identify that attention-based fusion consistently yields better performances in most presented multimodal datasets. Therefore, this chapter focuses on presenting results using
attention-based fusion. Specifically, the attention-based fusion concatenates the temporal output from each
modality, following an attention mechanism similar to hierarchical attention [98]. Given the concatenated
multimodal data h, the attention output is calculated as follows:
u = tanh(W h+b); a = sof tmax(u
T
c)
v =
X
i
aihi
4.2.3 Experimental Details
Setup. In our benchmark experiments, we set the number of convolutional layers in the feature encoder as
three. Moreover, we use a fixed filter kernel size of 5×5 and perform the search for the optimal number of
28
filters in {16, 32, 64}. On the other hand, we experiment with a hidden layer size of 128 in RNN encoders. We
chose the number of attention heads as 6 in all experiments and used the ReLU as the activation function.
The dropout rate is set to 0.2 in all training experiments. As FL operates in mobile devices with limited
computations, we set the batch size in training as 16 and the local training epoch as 1. Finally, we perform
the training for 200 epochs in all experiments. We perform the FL experiments using FedAvg, FedProx,
and FedOpt algorithms.
Evaluation Metrics. Table 4.1 shows the details regarding our evaluation metrics. Our multimodal FL
benchmark assesses FL performances using commonly used evaluation metrics for each dataset. Specifically, we evaluate FL performance on Crisis-MMD and UCI-HAR datasets using the F1 metric, the MELD
dataset using unweighted average recall (UAR), and the UCF101 dataset using accuracy. Additionally,
we conduct the FL experiments on the MELD and Crisis-MMD using a pre-defined partition for training/validation/testing. These experiments are repeated five times with different seeds. For UCF101, we
perform three-fold experiments using its pre-defined data splits. Finally, we conduct 5-fold cross-validation
on the UCI-HAR dataset.
4.3 FL Benchmark Performance on Human-centered Applications
The benchmark comparisons between different FL optimizers are presented in Table 4.2. As mentioned,
we exclusively present benchmark performance with the attention-based fusion mechanism, as it outperforms concatenation-based fusion in most datasets in our work in [30]. From the table, it is evident that
the FedOpt algorithm consistently outperforms other FL algorithms across various experimental conditions. When comparing performance across different datasets, the HAR task is associated with the highest
performance scores, suggesting the potential of deploying FL in real-life HAR applications. However, our
findings reveal challenges and difficulties in training effective models for social media classification on
29
Table 4.2: The table presents the benchmarking performance. Text colors in blue indicate the best performance using Attention-based Fusion.
Attention-based Fusion
ML Datasets α Metric FedAvg FedProx FedOpt
Natural
Partition MELD - UAR 54.37 54.67 55.37
Synthetic
Partition
UCF101
CrisisMMD
UCI-HAR
5.0
Acc
F1
F1
75.13
39.11
77.75
74.51
39.36
77.38
75.89
38.74
85.17
Synthetic
Partition
UCF101
CrisisMMD
UCI-HAR
0.1
Acc
F1
F1
74.53
8.49
76.66
74.71
25.31
76.58
75.05
27.59
79.80
CrisisMMD datasets. For example, F1 scores on the CrisisMMD dataset are consistently below 30%, implying the need to deploy large-scale pre-trained models for feature extraction from raw image and text
data. Overall, our FL benchmark results underscore the significant potential of deploying FL in humancentered applications such as human activity recognition and multimedia activity recognition. However,
we observe that FL encounters difficulties when applied to social media classification tasks.
4.4 Challenges of FL in Human-centered Applications
4.4.1 Performance Gap to Centralized Learning
In this subsection, we compare the multi-modal FL performance to centralized learning, where large-scale
pre-trained models could be deployed. We used the ImageBind model [37] for visual modality, Audio Spectrogram Transformer [39] for audio modality, and Whisper [78] for speech modality. We used the same
model architecture for the human activity recognition task as there are no existing large-scale pre-trained
models for the modalities in those applications. From the results, we can identify that centralized learning
can outperform FL by a substantial margin using large-scale pre-trained models, suggesting the capability
30
Table 4.3: FL Benchmarking performance compared to centralized learning baselines.
ML Datasets Metric Centralized Federated
MELD UAR 56.23 55.37
UCI-HAR UAR 92.75 85.17
UCF101 Acc 91.32 71.41
CrisisMMD F1 55.17 38.74
of these models to provide competitive performances in human-centered ML tasks. In contrast, we discover that the differences between FL and centralized learning are small in human activity recognition,
highlighting the potential to deploy FL in these computational-friendly applications.
4.4.2 Performance Degradation to Real-world Noises
While our benchmark results show promise in deploying FL in multimodal applications such as multimedia action recognition, it is important to notice that there is no consideration of the robustness of our
multimodal FL benchmark. Real-world settings introduce data degradation challenges such as missing
modalities, absent labels, and erroneous labels, all of which can decrease the quality of the multimodal
data used for FL. Therefore, in this chapter, we investigate the impact of these data degradations on the
performance of multimodal FL.
4.4.2.1 Categorizations of Data Degradation
Missing Modality. The first scenario in data degradation is associated with missing modalities. Instances
of missing modality are commonly presented in real-life human-centered applications, often caused by
factors such as device malfunction, network instability, or privacy constraints. Consequently, it is practical
to encounter the missing modality in FL as formulated in [101, 15], where each client has its modality. In
alignment with previous research in [15], our benchmark simulates the missing modalities in each client
31
where the missing rate of individual modality on a client follows a Bernoulli distribution with the missing
rate set to q.
Missing Labels. The second data degradation scenario in our multimodal FL benchmark is the challenge
of missing labels. The occurrence of missing labels is a common issue in real-life FL scenarios, given
the difficulty of obtaining annotations or ground-truths for data samples. For example, instructing users
to annotate action labels for hundreds of videos is not practical. Therefore, real-world FL settings often
involve scenarios with limited labeled data. To reflect this, we implement the simulation of missing labels
in our proposed multimodal FL benchmark.
Erroneous Labels. Our last data degradation scenario in our multimodal FL benchmark is erroneous
labels. In contrast to missing labels or missing modalities, erroneous labels are frequently associated with
subject bias, skill differences, and labeling mistakes. Drawing inspiration from our previous work in [102],
we incorporate the erroneous label simulation process by defining the erroneous label ratio through a
transition matrix Q. The size of Q is equal to the unique label size. Specifically, for a given ground-truth
label i, we represent Qi,j = P(y = j|y = i) as the probability of erroneously labeling i as label j.
4.4.2.2 Experimental setup
In this section, we present a subset of results related to data degradation in FL in [30]. To simulate the
missing modality, we assume the missing modality in a client follows a Bernoulli distribution with a missing ratio of q. We present experiments with q ∈ {0.1, 0.3}. Specifically, we fill the missing data with a
constant of 0 following prior work in [73]. Our multimodal fusion implements the scheme to mask missing
data in calculating attention scores. Moreover, we implement a missing label simulation to emulate the
missing label rate l ∈ {0.1, 0.3}. Our training over missing labels is to perform FL with labeled data while
we leave the effort of mitigating this issue using semi-supervised learning or self-supervised learning as
future works.
32
Dataset Names
Relative Change (%)
-15
-10
-5
0
MELD UCI-HAR UCF101 CrisisMMD
Missing Modality Missing Labels Erroneous Labels
Relative Performance Change with 10% Data Corrupted
Figure 4.3: Relative performance changes with 10% data corrupted (missing modalities vs. missing labels
vs. erroneous labels).
Finally, we introduce the simulation of erroneous labels with a ratio denoted as l ∈ {0.1, 0.3}, where l
represents the proportion of data associated with erroneous labels. Following our prior work in [102], we
set the sparsity of erroneous label transition matrix Q at 0.4. Particularly, a low sparsity error rate indicates
a higher likelihood of a label being inaccurately assigned to more unique labels. We want to highlight that
we set the data degradation ratio from {0.1, 0.5} to assess the impact of these data degradations under
both mild and severe conditions.
4.4.2.3 Lower-level Data degradation on FL Performance
In this paragraph, we present the FL performance encountering lower-level data degradation with 10%
data with missing modalities, missing labels, and erroneous labels. The complete results of the relative
model performance changes at the data corrupted ratio of 10% are shown in Figure 4.3. The comparison
indicates that the relative performance changes with missing modalities ratio and missing label ratio at
10% are small, implying that a small number of missing modalities or missing labels in FL present minimal
risks to the final performance. However, we identify a relatively larger performance decrease with the
erroneous labels. For example, we observe 8.6% and 12.7% decreases in FL performance on UCI-HAR and
CrisisMMD datasets, respectively. Overall, we find that the impact of lower-level data degradation on FL
33
Dataset Names
Relative Change (%)
-60
-40
-20
0
MELD UCI-HAR UCF101 CrisisMMD
Missing Modality Missing Labels Erroneous Labels
Relative Performance Change with 30% Data Corrupted
Figure 4.4: Relative performance changes with 30% data corrupted (missing modalities vs. missing labels
vs. erroneous labels).
Training Round
Test Accuracy (%)
0
20
40
60
80
0 50 100 150
Missing Modality Erroneous Label Missing Label
Corrupted Data Ratio = 30%
Figure 4.5: FL performance over training rounds on UCF101 dataset with missing modality, missing labels,
and erroneous labels.
performance is relatively low, with the erroneous labels presenting higher concerns in FL performance
decrease.
4.4.2.4 Higher-level Data degradation on FL Performance
In this subsection, we further study how data degradation impacts FL performance when faced with a
higher level of degradation at 30%. Similarly, we perform the three experiments with 30% of data modalities, missing labels, and erroneous labels. Figure 4.4 presents the comparisons of relative model performance changes at a 30% data corruption ratio. The results reveal that data corruption has a minimal impact
34
on FL performance in emotion recognition. However, it is worth noting that the baseline performance for
emotion recognition is relatively low. On the other hand, in the CrisisMMD dataset, the missing modalities
have a considerably greater impact on FL performance, underscoring the importance of modeling with diverse modalities in social media understanding tasks. Moreover, we observe that erroneous labels decrease
not only the overall FL performance but also the convergence speed of FL compared to missing modalities
and labels. In summary, we notice that FL performance starts to decline when the data degradation ratio
reaches 30%, implying the importance of ensuring higher data quality in the context of FL.
4.5 Conclusion
In this chapter, we presented our experimental framework for multi-modal federated learning, named FedMultimodal, which enables federated learning in multi-modal human-centered applications. We further
established a reproducible benchmark of results for 4 representative human-centered datasets for future
comparisons. We compared the FL performances to centralized training where large-scale pre-trained
models are accessible. We also benchmarked results on model robustness to missing modalities, missing
labels, and noisy labels in each task. From the extensive experimental results, we found that FL is promising
in diverse human-centered applications, with reasonably high performance in multimedia action recognition and human activity recognition. However, we identify that FL encounters challenges to compete with
centralized learning with large-scale pre-trained models being deployed. Our results regarding data degradation indicate that federated learning is also sensitive to data degradation, while missing and incorrect
labels can cause a substantial drop in FL performances.
35
Chapter 5
Privacy-enhancing Unimodal Learning Through Foundation Models
5.1 Introduction
What are the challenges with data perturbation in privacy-enhancing computing? In our previous
chapters, we investigated data perturbation to enhance privacy in human-centered applications. While
the data perturbation approach proves to be effective in cases where the complete data is available, this
approach fails in the context of low-resource training. In such cases, the trained model frequently provides
limited capabilities in inferring desired information from the data.
Transforming Privacy Problem to Low-resource Training Challenge. It is worth noting that challenges associated with low-resource training have gained popularity in recent years due to privacy constraints, preventing AI practitioners and researchers from storing, utilizing, and exchanging human-centric
signals. Specifically, various modalities, such as speech, images, and videos, carry sensitive information
that can reveal personally identifiable information (P.I.I.), prompting many legislations to safeguard them,
including the recently introduced GDPR [94].
Why would the Foundation model be beneficial? As we discussed earlier, the foundation model is an
emerging field of research involving the development of generative AI models. The user can prompt these
generative AI models, such as ChatGPT and DALL-E-2 ∗
, with prompt instructions to generate desired
∗
https://openai.com/
36
content. Moreover, these models can process multimodal information to create high-quality content in the
form of audio, text, or image. These generated data not only facilitate content creation but also present
unique opportunities in privacy-enhancing computing, as high-fidelity generated content can be regarded
as an efficient approach to creating training data. For example, previous studies have demonstrated the
potential of using pure synthetic data as training data in diverse modalities [44, 95, 46]. The positive
findings from these prior works motivate us to study the use of generative foundation models to generate
human-centered training data to assist privacy-enhancing computing in human-centered applications.
What do we aim to achieve with the foundation model? Specifically, this thesis explores using the
foundation model to assist low-resource training or zero-shot learning in the context of human-centered
signals. It is worth noting that the limitations of mobile sensors, privacy laws, and other constraints
in real-life applications can all lead to low-resource data situations. In particular, privacy laws, such as
the EU’s General Data Protection Regulation (GDPR) [94], have stated the importance of protecting user
privacy, preventing service providers from collecting or training their business models without the content
of users. This chapter investigates the generative foundation models in assisting low-resource automatic
speech understanding tasks. This chapter is based on our work in [29].
5.2 Application - Automatic Speech Understanding (ASU)
This chapter studies using generative foundation models to assist automatic speech understanding (ASU)
tasks, including speech emotion recognition (SER) and spoken language understanding (SLU). Our motivation in investigating ASU is that previous research has presented sufficient knowledge in using synthetic
images or texts as training data. At the same time, limited studies are exploring the use of foundation models to enhance speech understanding modeling. Moreover, we argue that ASU is a natural human-centered
application as speech is an important signal in daily life. ASU typically involves interpreting and comprehending human speech, making it a popular area of research in machine learning, AI, and human-computer
37
Table 5.1: Summary of dataset statistics used in this sub-work.
Datasets Unique Speakers Classes Total Utterances
IEMOCAP 10 4 5,531
MSP-Improv 12 4 7,798
SLURP 177 46 72,277
interfaces. The ASU systems have also empowered the widely-used virtual assistants, like Amazon Alexa,
Apple Siri, and Google Assistant. On the other hand, speech signals carry substantial information about a
person, as we discussed earlier, posing privacy risks in deploying ASU applications. Therefore, it is valuable to investigate privacy-enhancing machine learning in ASU applications. Specifically, the two tasks
that we are interested in studying are:
Speech Emotion Recognition, introduced in the previous chapter [54], is a popular speech understanding task that aims to infer categorized emotions from spoken utterances.
Spoken Language Understanding [93] is another widely studied ASU task involving intent classification, topic understanding, and slot filling tasks. This chapter investigates end-to-end recipes [83] to
perform SLU-related tasks directly from speech signals.
5.3 ASU Datasets
Table 5.1 displays data statistics for the three datasets included in this work. We use IEMOCAP and MSPImprov datasets for SER experiments and the SLURP dataset for SLU training.
Speech Emotion Recognition IEMOCAP [11], as used in this sub-work, includes recordings of scripted
human interactions. The complete dataset contains multimodal data from 10 participants, with half of
the participants being male. In addition to IEMOCAP data, we experiment with another similar speech
emotion dataset to predict the naturalistic emotions in improvised situations. This dataset includes 12
participants evenly split between males and females. We follow the previous literature [74] to form the
38
speech emotion classification on neutral, angry, happy, and sad emotions, as these are the most presented
emotions in both datasets.
Spoken Language Understanding We select SLURP [8] to evaluate the spoken language understanding.
The complete dataset includes 72,277 utterances from over 100 participants reading text transcripts related
to human speech commands to control in-home personal robot assistants. The transcription was acquired
from Mechanical Turk (AMT) workers by asking them how they would communicate certain intent. In
this work, we focus on investigating the intent classification task.
5.4 What Research Questions Are We Interested In?
As we discussed earlier, our focus is to study the use of generative foundation models to assist privacyenhancing human-centered applications. In the privacy-sensitive scenario, a common setting is that human data could be missing due to privacy constraints. Thus, we would like to answer the following research
questions related to severe data missing due to privacy constraints:
• Our first assumption is that training with pure synthetic data provides the strongest privacy protections,
requiring the AI practitioner to build ML models without collecting user data. Consequently, our first
research question is: can synthetic speech provide enough zero-shot ASU performance?
• Apart from the zero-shot learning with pure synthetic data, it is more reasonable to assume that limited
human data exists for training the model. Therefore, we are particularly interested in whether
synthetic speech content can be used to improve model performance under low data settings
in ASU.
39
5.5 Foundation Model Assisted Low-resource ASU Training
5.5.1 Speech Generation
Our proposed synthetic generation approach was inspired by multiple prior works [103, 44], where we
aim to generate training data relying on the label information. Our complete data generation pipeline is
demonstrated in Figure 5.1. Specifically, we break down speech synthesis into two parts: the first part
generates spoken texts through LLMs, and the other synthesizes speech data using the text-to-speech
(TTS) model.
• Spoken Text Generation To synthesize the speech data, we first need to acquire the associated text
information. To achieve this, we propose to prompt the modern LLMs to generate the wanted spoken
text content. For example, we prompt the LLM to generate emotional spoken texts using the following
prompt: Generate a spoken utterance with _ emotion. Here, we feed the prompt with the emotion labels
during the generation. Similarly, we replace the prompt message with "Generate a spoken utterance
with intent to _. " to generate spoken text with intent. Our generation process is illustrated in Figure 5.1.
• Text-to-Speech (TTS) After obtaining the spoken text content, we synthesize speech data using the
text-to-speech models. In this work, we utilize the SpeechT5 model [6] for text-to-speech generation.
5.5.2 ASU Modeling
This work uses the pre-trained speech models to train the end-to-end ASU systems. Specifically, we apply WavLM Base+, with approximately 90M parameters, as our proposed ASU backbone. Our modeling
approach follows the approach in [74]. Specifically, we train learnable weights to combine hidden outputs
from all encoder layers. The weighted output is then fed into 1D pointwise convolutional layers with an
average pooling. Finally, we apply two-layer MLP for ASU classification. We want to highlight that we
40
Neural, Angry, Happy, Sad …
Generate a spoken utterance with _
Set Alarm, Turn off volume, emotion / with intent to _ / …
Query Traffic …
……
I feel so excited about this! …
Set the alarm around 8am. …
……
1 Label Guided Prompt
2 Speech Generation 3 Training
I feel so excited about this! …
Set the alarm around 8am. …
Figure 5.1: The proposed synthetic data generation and ASU training framework. We begin with generating transcriptions of spoken utterances through label information. The transcripts are then fed into the
text-to-speech model, creating synthetic speech data. Finally, we perform end-to-end ASU training using
pre-trained WavLM models using synthetic and limited human data. The figure is from our work in [29]
froze the pre-trained speech encoder during our training. Apart from training downstream models, we integrate LoRa (Low-rank Adaptation) [45] to further enhance the ASU performance, as demonstrated
in [61, 33]. LoRa adapts the model updates with low-rank matrices during the training phase, bringing
benefits to lower inference latency and improved model performances.
5.6 Experiment Details
We choose the FLAN-T5 [18] as the LLM to generate spoken text. This model is with 11.3B parameters. We
set the temperature in generation as 1.0. Furthermore, we constrain the output limitation to a maximum
of 32 tokens, as spoken utterances typically involve simple phrases. Moreover, we prompt the FLAN-T5
to generate 4,000 text transcripts for emotion recognition, with each emotion category including 1,000
spoken texts. Similarly, we generate 4,600 transcriptions for intent classification, with each intent class of
41
100 spoken text. We then synthesize the speech samples using the speechT5 model, where we augment
the speaker information by sampling the x-vector [88] from The CMU Arctic dataset [51].
We perform 5-fold and 6-fold experiments on IEMOCAP and MSP-Improv datasets, where each session
is regarded as one test fold in the experiments. Meanwhile, we apply the standard training, validation, and
testing splits in experimenting with the SLURP dataset. We evaluate the performance of speech emotion
recognition and intent classification tasks using the unweighted average recall (UAR) and Macro-F1 scores,
respectively. Our training experiments use a learning rate 0.0005 and a batch size 64. As most of the
utterances in SLURP are short utterances with a duration of around 3 seconds, we apply the maximum
audio duration as 3 seconds in the training SLURP dataset. At the same time, we set the maximum audio
duration as 6 seconds in other experimental datasets. We use the pre-trained WavLM checkpoints from
HuggingFace [97] and conduct the training with A40 GPUs.
5.7 Results
5.7.1 Zero-shot performance using synthetic speech for training ASU models
We first investigate the zero-shot performance using only synthetic speech data. Specifically, we compare
the synthetic and real speech training, as shown in Figure 5.2. The results show that real speech training
performs substantially better than using label-guided synthetic speech. Mainly, we can find that synthetic
speech yields close to random guess performance in speech emotion recognition, while it provides a much
higher score than random guess in the intent classification task. Overall, the results indicate the challenge
of relying on synthetic speech content to replace real speech data, and real speech data are still needed for
training ASU.
42
Dataset Name
Performance
0
20
40
60
80
IEMOCAP MSP-Improv SLURP
Real Speech Data Label-guided Synthetic Speech
Real Speech Data vs. Label-guided Synthetic Speech
Figure 5.2: ASU fine-tuning performance between real and synthetic speech data.
Dataset Name
Performance
0
20
40
60
80
IEMOCAP MSP-Improv Slurp
5% Real Speech 10% Real Speech 15% Real Speech 20% Real Speech Label-guided Synthetic Speech
Low Resource Real Speech Data vs. Label-guided Synthetic Speech
Figure 5.3: Fine-tuning performance between low-resource real speech data and synthetic speech data.
5.7.2 Comparing limited speech training with synthetic speech training for ASU
We investigate whether zero-shot learning with synthetic speech can outperform limited speech training.
In this experiment, as there are only a few hundred training examples in training emotion recognition, we
set the learning rate in training SER models to 0.0001 to prevent overfitting. The comparisons between
zero-shot performance with synthetic speech and performance using limited speech samples are presented
in Figure 5.3. The results show that, even with 10% of real speech data, limited real speech training consistently outperforms synthetic speech training. This finding indicates the importance of incorporating
real speech data in training. However, we observe that synthetic speech training can outperform limited
speech training in intent classification with 5% real speech presented from SLURP.
43
Figure 5.4: Limited speech training with pre-training on synthetic speech.
5.7.3 Combining synthetic speech training with limited speech training for ASU
Our previous findings revealed that zero-shot learning, particularly with label-guided synthetic speech
content training, is challenging in ASU modeling. Therefore, in this subsection, we aim to explore whether
augmenting limited speech training with synthetic speech training can improve the ASU performance.
Based on our prior work in [103], we perform pre-training for ASU classification using synthetic speech
data, where the pre-trained classification model is used as the initialization for subsequent limited speech
training. The proposed training approach is illustrated in Figure 5.4. The comparisons between limited
speech training and synthetic speech-assisted limited speech training are shown in Figure 5.5. The results reveal that synthetic speech pre-training consistently enhances performance compared to scenarios
without pre-training. We can identify that pre-training benefits ASU training with varying percentages of
limited real speech (5%, 10%, 15%, and 20%) across all datasets. Specifically, this performance enhancement
is particularly substantial in intent classification, leading to over 50% and 20% improvements at 5% and
10% real speech, respectively. These results highlight that while relying solely on synthetic speech delivers limited ASU performance, incorporating pre-training on synthetic speech followed by limited speech
training can consistently improve ASU performance.
44
Available Real Speech Data Ratio(%)
UAR
30
40
50
60
5 10 15 20
Random Initialization Synthetic Data assisted Model Initialization
IEMOCAP
Available Real Speech Data Ratio(%)
UAR
30
40
50
60
5 10 15 20
Random Initialization Synthetic Data assisted Model Initialization
MSP-Improv
Available Real Speech Data Ratio(%)
F1
0
25
50
75
5 10 15 20
Random Initialization Synthetic Data assisted Model Initialization
Slurp
Figure 5.5: Fine-tuning performance between random initialization and synthetic data assisted model initialization in low-resource speech training. The x-axis represents the available real speech data ratio presented for low-resource training.
5.7.4 Heuristic investigation on the data generation
The previous subsections demonstrate the training performance using synthetic speech, while it is unclear
what the generated data looks like. In this subsection, we present heuristic examples of the generated
spoken text to demonstrate the quality of the spoken utterance. The below present the generated text
examples in emotion recognition, with one example in each emotion category:
• Neutral: "Okay, can you finish a glass of wine, please."
• Happy: "We had so much fun in Florida."
• Sad: "All I want to do is cry."
45
• Angry: "Why can’t you just admit that you broke my car!"
Moreover, we present examples with intent class belonging to set alarms, query contact, mute the
volume, and tell a joke:
• Set Alarms: "You want the alarm to go off in 3 hours."
• Query contact: "I’d like to get a list of contacts who are in the San Francisco area."
• Mute the volume: "Turn your volume down.",
• Tell a joke: "Hey man, it’s raining again."
These generation examples show that LLMs can provide related responses to user prompts. For example, when the prompt is to generate intent to set alarms, the LLM can output transcriptions of "You want
the alarm to go off in 3 hours.", which heuristically matches with how a human would provide spoken
instructions to a virtual assistant. However, we can identify that our chosen LLM may still have limited
capabilities in language understanding, failing to respond to the instructions to generate intent class in
telling a joke. For example, in our experiments, we find the LLM output "Hey man, it’s raining again."
when we instruct it to give an intent message to tell a joke. Moreover, we discover that some generated
text can be highly patterned. For example, in intent classes like setting alarms, the LLM outputs similar
text like "You want the alarm to go off in 3 hours.", "You want the alarm to go off in 2 hours." and "You want
the alarm to go off in 5 hours." Overall, this encourages us to explore more advanced LLMs, such as GPT-4
and Gemini.
46
5.8 Conclusion
In this chapter, we study the use of generative foundation models to assist privacy-enhancing humancentered ML, with human data missing due to privacy constraints. Specifically, we explore ASU applications in emotion recognition and intent classification. We propose a two-stage generation approach that
combines LLM and the Text-to-speech model to generate speech samples with only domain or label information. Our study finds that zero-shot performance of pure synthetic speech content delivers limited
performance compared to real speech training or limited speech training. This indicates the potential quality issue in the synthetic speech corpus. However, we identify that pre-training ASU classification models
with synthetic speech content yield consistently better performance than those without pre-training in
limited speech training. These results show the promises that synthetic speech content can bring to ASU
training, suggesting the possibility of collecting less private data in achieving competitive performances.
In the end, we also demonstrate the potential generation quality issues in the current experiments while
we aim to increase the generation quality by using more advanced LLM models. In addition to language
generation, we plan to incorporate more advanced TTS models to increase speech generation quality.
47
Chapter 6
Privacy-enhancing Multimodal Learning Through Foundation Models
6.1 Introduction
Why do we need to study multi-modal learning? In addition to uni-modal systems, multi-modal
systems are prevalent in human-centered applications, as presented in the previous chapter of Federated
Learning. The multimodal learning systems combine diverse input in audio, text, or videos, providing
capabilities to understand complex events happening around humans [62]. The recent advances in selfsupervised learning [17, 77] have offered the ML community substantial opportunities to develop advanced
multimodal AI systems to perform complex tasks like question-answering [1], captioning [58], and crossmodal retrieval [50]. Therefore, it is beneficial to extend our privacy-enhancing from unimodal settings to
multimodal scenarios.
How should we formulate privacy-enhancing computing problems in multimodal learning? Unlike unimodal learning scenarios where the model takes input from one modality, the multimodal models
require input from multimodality. In this learning scenario, one challenge is the availability of the modalities for the training and inference. Particularly in sensitive conditions, such as hospitals, private meetings,
and homes, instances of missing modalities can easily occur due to privacy concerns [94]. Moreover, visual
modalities like images and videos can frequently introduce substantial privacy concerns regardless of the
48
recording context. The visual modality can directly provide information about the gender, age, appearance, and body shape of an individual. Therefore, we study the privacy-enhancing modeling in multimodal
learning as missing modalities, where some modalities may often be unavailable due to the privacy constraints in collecting such data. Specifically, we focus on investigating multimodal learning with visual
modality missing.
What do we aim to achieve with the generative Foundation Model? The major goal of this chapter is
to investigate the use of generative foundation models to assist privacy-enhancing modeling in multimodal
learning. Inspired by our previous discussion and experiments in the uni-modal setting, we propose a
multimodal learning framework called GTI-MM: Generative-Transformer Imputation approach for MultiModal learning with visual modality missing. Similar to our approach in unimodal settings, we generate
missing visual data using the text-to-image generative models. Specifically, we prompt the text-to-image
model with domain information, such as labels, to generate an associated visual modality to impute the
multimodal data with the missing visual modality. We want to highlight that this chapter is based on our
work in [35].
6.2 Application - Multimedia Action Recognition
In this sub-work, we explore our proposed GTI-MM framework on a popular visual understanding task:
multimedia human action recognition. Human action recognition is the task of classifying the actions that
the human performs from video data involving audio and visual information. This vision recognition task
matches our problem formulation as most visual data consists of humans, leading to privacy concerns in
leaking information like body shapes, facial geometries, and other bio-metric fingerprints.
49
Pretrained
Source
Separation
Log-mel ASR
Transducer
Loss
Wearer
Audio
Partner
Audio
… BF …
K+1 beamformed outputs
… BF … Log-mel ASR Transducer
Loss
K+1 beamformed outputs
Pre-trained Source
Separation Log-mel ASR Transducer
Loss
Wearer
Audio
Partner
Audio
… BF …
K+1 beamformed outputs
Pre-trained Source
Separation Log-mel Pre-trained
ASR
Transducer
Loss
Wearer
Audio
Partner
Audio
… BF …
K+1 beamformed outputs
… IPDs ASR
Transducer
Loss
Separation
Loss
IPDs ASR Directional ASR
Joint Training
Two-stage Fusion ASR
+
Audio
Backbone
Visual
Backbone
Fusion MLPs
Captioning
Classification
Retrieval
……
Test
Complete
Training
Missing
Training
Missing
Test
Missing
Training Visual Missing Training and Testing Visual Missing
Audio
Backbone
Visual
Backbone
Fusion
MLPs
Captioning
Classification
Retrieval ……
? ?
Test
Complete
Training
Missing
Training
Missing
Test
Missing
Training Visual Missing
Training and Testing Visual Missing
? ?
Training Missing
Training Missing + Testing Complete
Training Missing + Testing Missing
Figure 6.1: Problem formulation of missing modalities in this work with audio-visual recognition as the
example. The missing modality includes cases in training data alone or any data. The figure is from our
work [35].
6.3 Multimedia Action Recognition Datasets
Our experiments include three popular multimedia action recognition datasets in Table 5.1. UCF101 [90]
is the first action recognition dataset used in this work. This dataset contains over 10k sports-related videos
from web sources. However, only 7000 videos in 51 labels are presented with both visual and audio data in
the video. The second dataset used in this chapter is ActivityNet, containing more diverse human actions
other than sports-based actions. There is a total of 18,976 videos with both audio and visual data, resulting
in 200 action categories. Lastly, we experiment with a recently released dataset called Moments in Time
(MiT). The MiT dataset includes approximately 1 million short videos covering 339 human actions. The
MiT includes videos in diverse styles, including camera recordings, animation, screencasts, and montages.
Due to the intrinsic difficulty in modeling the whole corpus, we decide to perform our experiments with
the subset of this dataset. Specifically, we create MiT51, which contains the ten most frequent labels from
50
Table 6.1: Summary of dataset statistics used in this sub-work.
Datasets Video Style Classes Data Size
UCF101 Camera 51 6,837
ActivityNet Camera 200 18,976
MiT51 Camera, Animate
51 163,038 Screencast, etc.
the original MiT dataset. Due to the computation limitations, we truncate the audio with a maximum
duration of 5 seconds in UCF101 and ActivityNet datasets.
6.4 Missing Modality Problem Formulation
As we described in the dataset and application subsections, this chapter focuses on human action recognition from videos involving audio-visual information. Moreover, as visual data may frequently associated
with P.I.I. that people wish to keep private, we decided to study the scenarios with visual information
missing. Our complete problem setup in multimodal learning with visual data missing is presented in Figure 6.1. We want to stress that we explored scenarios of missing audio modality later in the experiments
for study completeness.
6.4.1 Problem Notation
We first introduce the notation that we used in this problem formulation. We define the multimodal training dataset as D, where the dataset includes modality-complete samples DC = {x
A
i
, xV
i
, yi} and visualmodality missing samples DA = {x
A
i
, yi}, where i ∈ N, A and V represent audio and visual modalities,
respectively. More concretely, we denote multimodal dataset D = {DC, DA}.
51
6.4.2 Visual-Modality Missing in Training Data
Our first problem aims to study the case of visual modality missing in training. In this scenario, we can also
regard the use of generative foundation models to improve data efficiency in the missing visual modality.
Given the notations given above, we aim to improve the multimodal training with dataset D = {DC, DA}.
Specifically, we denote the visual modality missing rate in training data as p, where p = 0 indicates
complete modality training. Moreover, most existing studies explore training scenarios with p < 90%,
while we argue that the visual modality missing rate can be higher than 90% due to its sensitivity to
privacy. Therefore, we explore the visual missing modality with p > 90%. Specifically, we aim to answer
the research questions below for visual modality missing in training data:
• Similar to unimodal learning, our first research question is related to the zero-shot ability of unimodal
learning on synthetic visual data. Specifically, we aim to investigate whether synthetic visual data from
generative models can replace real visual modality.
• In addition to zero-shot learning, it is more prevalent to have limited visual data presented in multimodal
learning. Therefore, we want to explore whether combining synthetic visual data with limited real visual
data can enhance multimodal training performance.
• It is natural to assume that the quality, diversity, and complexity of the generation can impact the imputation of multimodal learning. So, we are wondering how the quantity, complexity, and diversity of the
synthetic data impact multi-modal training with visual modality missing.
6.4.3 Visual-Modality Missing in Any Data
Apart from investigating missing modalities in the training data, we study cases where any data sample
can have the visual modality missing. In this problem setup, we can also view visual imputation as the way
to enhance the model robustness, as the modality missing in test can degrade the multimodal performance
52
… BF … Log-mel ASR Transducer
Loss
K+1 beamformed outputs
Pre-trained Source
Separation Log-mel ASR Transducer
Loss
Wearer
Audio
Partner
Audio
… BF …
K+1 beamformed outputs
Pre-trained Source
Separation Log-mel Pre-trained
ASR
Transducer
Loss
Wearer
Audio
Partner
Audio
… BF …
K+1 beamformed outputs
… IPDs ASR
Transducer
Loss
Separation
Loss
IPDs ASR Directional ASR
Joint Training
Two-stage Fusion ASR
+
Audio
Backbone
Visual
Backbone
Fusion MLPs
Captioning
Classification
Retrieval ……
Test
Complete
Training
Missing
Training
Missing
Test
Missing
Training Visual Missing Training and Testing Visual Missing
Audio
Backbone
Visual
Backbone
Fusion
MLPs
?
Training Missing
Training Missing
Training Missing + Testing Complete
Training Missing + Testing Missing
?
“Knitting”
“Preparing Pasta”
……
Action Names
Label-only
“Drinking Coffee”
Prompt Generation
GPT-assisted ……
“A photo of a man Knitting”
“A photo of a woman Preparing Pasta”
……
“A photo of a person Drinking Coffee”
“A photo of a person Blowing Leaves”
“A photo of a child Drinking Coffee”
Label-only Prompt ……
“The craft of creating fabric by interlocking loops of yarn
with knitting needles or a knitting machine.”
“Preparing pasta refers to the process of cooking pasta by
boiling it until it reaches the desired level of tenderness.” ……
“Enjoying a cup of a beverage made by infusing ground
coffee beans with hot water, often accompanied by milk.”
……
……
……
Text-to-Image
Text-to-Image
ChatGPT
Audio
Backbone
Visual
Backbone
Fusion MLPs
Generated
Data
Imputation
Generated Data
Prompt Learning
Imputed Data
?
Training Missing
Audio
Backbone
Visual
Backbone
Fusion MLPs
Imputation
Training Data
Imputed Data
……
Imputed Data
Imputed Data
Imputed Data
Dropout Training ……
Learning Algorithm
Multimodal Model
?
Training Missing
Audio
Backbone
Visual
Backbone
Fusion MLPs
Imputation
Generated
Data
Prompt
Learning
Imputed Data
……
Imputed Data
Dropout
Training …
Learning Algorithm
Generated Data Multimodal Model
Training Data
Generated Data
Training
Data
Testing Complete
?
Testing Missing
Imputation Multimodal Learning Inference
Figure 6.2: Learning framework of GTI-MM: Imputing missing visual modality with synthetic visual content
for robust multi-modal learning. The figure is from our work [35].
in inference stage. Specifically, we define the visual modality missing in test as q. In summary, we aim to
answer the following research questions related to visual modality missing in both training and testing:
• There has been a wide range of works in enhancing model robustness against missing modalities in
inference stages such as dropout training. Consequently, we would like to explore whether GTI-MM can
adapt with existing approaches to enhance model robustness?
• In addition to modality dropout training, multimodal prompt learning is a recently proposed approach
to improve model robustness for visual recognition tasks. Therefore, we are interested in whether combining GTI-MM with multimodal prompt learning can enhance model robustness?
6.5 GTI-MM: Foundation Model to Assist Sensitive Modality Missing
In this section, we present details about our proposed GTI-MM framework.
6.5.1 Visual Data Generation
Our proposed GTI-MM framework starts with visual data generation, as demonstrated in Figure 6.3. Specifically, we prompt the text-to-image models with the domain information, such as labels, to generate the
53
Label-only
Prompt Generation
LLM-assisted ……
“A photo of a man Knitting”
“A photo of a woman Preparing Pasta” ……
“A photo of a person Drinking Coffee”
“The craft of creating fabric by interlocking loops of yarn
with knitting needles or a knitting machine.”
“Preparing pasta refers to the process of cooking pasta by
boiling it until it reaches the desired level of tenderness.” ……
“Enjoying a cup of a beverage made by infusing ground
coffee beans with hot water, often accompanied by milk.”
……
“Knitting”
“Preparing Pasta”
……
Action Names
……
“Drinking Coffee” ……
Text-to-Image
Text-to-Image
LLMs
(e.g. ChatGPT)
Label
prompt
Figure 6.3: Visual data generation process in GTI-MM. The figure is from our work [35].
synthetic images. In order to perform the image generation, we apply the following approaches to generate
the text prompt:
Label Prompt: The first prompt approach is simply a label-guided prompt. This approach has been
applied in prior work [44] exploring the zero-shot capabilities of synthetic images for image classification.
This simple approach has demonstrated effectiveness in image generation for training image classification
models. Unlike prior works that generate images with object-related classes, it is important to augment
the input prompt with different human action performers in generating human actions. Specifically, we
define the set of ACTION_PERFORMER as S ={a man, a woman, a child, a person, a group of
people}, where we sample one action performer type in each data generation. Therefore, given given the
input ACITION_NAME from the set C = {c1, ..., cn}, we can create the prompt message t = "A photo of
ACTION_PERFORMER ACTION_NAME", where ACTION_PERFORMER is randomly sampled from S.
LLM-assisted Prompt: Our second approach to creating text prompts is to leverage the advanced LLMs.
To achieve this, we prompt the LLMs to provide five text descriptions for each human action: Provide 5
definitions of action class ACTION_NAME. We aim to enhance the complexity of the prompt message to provide a more detailed description of image generation.
54
6.5.2 Diversity Enrichment in Data Generation
It has been shown from prior research [85, 44] that increasing the generation diversity can provide advantages in model training. Specifically, the author showed that increasing diversity by randomly setting
the unconditional guidance scale (UGC) of the text-to-image model between 1 and 5 can effectively
improve the model training performance. This approach enhances generation diversity, as the guidance
scale controls the creativeness in image generation, with a higher value indicating more creative generation. In addition to increasing the randomness of the guidance scale, augmenting the text prompt with
multi-domain knowledge is another widely-used approach to enrich generation diversity. Instead of generating images belonging to photos, Multi-domain knowledge approach augments the visual generation
by sampling a predefined style list. The possible styles include drawing, sketching, 3D renderings, posters,
digital arts, painting, and rock drawings.
6.5.3 Multi-modal Learning with Visual Imputation
After the visual data generation, we perform the imputation with the synthetic visual data, as shown
in Figure 6.2. Specifically, our core idea is to randomly replace a missing visual data with the synthetic
data belonging to the same label. For example, given the generated visual dataset DV
′
= {x
V
′
j
, yj} with
x
V
′
j
and yj denoting the synthetic visual data and its associated labels, our imputation is to impute the
visual modality missing samples DA with the same label yj . Here, we present the synthetic visual data as
V
′
. More concretely, we define the imputed modality-complete data as DˆA = {x
A
i
, xV
′
i
, yi}. Finally, our
training dataset becomes Dˆ = {DC, DˆA} with complete modalities in all data. As shown in Figure 6.2,
with the complete-modality dataset, we can integrate most multimodal learning algorithms in increasing
data efficiency and model robustness, such as dropout training.
55
6.6 Experimental Details
6.6.1 Visual Data Generation
In this sub-work, we apply the widely-studied text-to-image model, the Latent Diffusion Model [81], for
visual modality generation. Although authors in [44] experiment with over one million image generation, we find that image generation is very expensive, taking an average of 10 seconds on an A40 GPU.
Therefore, we decided to generate less visual data than [44]. Specifically, for each experimental dataset,
we synthesize 100 images of each action category, leading to 5,100, 20,000, and 5,000 synthetic images for
UCF101, ActivityNet, and MiT51. We conduct the generation using label-guided prompt and LLM-assisted
prompt. As imputing both training and testing visual data might cause artifacts in the evaluation process,
we decided to impute missing data only in the training set.
6.6.2 Pre-trained Multi-modal Model
In this experiment, we utilize the pre-trained audio-visual model for human action modeling. Specifically, we apply the recently developed ImageBind [37] multimodal model in this experiment. ImageBind
leverages the self-supervised learning objectives to learn multimodal representations by aligning different
modalities to the visual modality. The aligned modalities include text, audio, depth, thermal, and IMU
data. In this work, we use the visual and audio backbones for human action recognition, where the visual
backbone and audio backbone use the Vision Transformer (ViT) [25] and Audio Spectrogram Transformer
(AST) [39], respectively.
6.6.3 Model Training and Evaluation
Our training setting follows the prior work [55] in studying missing modalities in visual recognition tasks,
where instead of fine-tuning the whole pre-trained backbones, we only choose to train the downstream
classifiers. One other reason for us not to fine-tune the pre-trained backbones is the substantial memory
56
and resources required in training. Specifically, we use late fusion as the fusion approach in combining
audio and visual streams in this experiment. We want to highlight that we choose late fusion as it yields
consistently better performance in our experiment than other fusion algorithms. Our evaluation metric for
human action recognition is Top1 Accuracy, and we perform the training with the default train, validation,
and test split from each dataset. We repeat the experiments with multiple seeds in training ActivityNet
and report the average accuracy. In contrast, we train the MiT51 with a fixed seed due to the large size of
the dataset.
6.7 GTI-MM with Visual-Missing in Training
Our first experiment setting is when visual data is missing in training. Specifically, we control the missing
visual modality ratio to 90% and above, mimicking the extreme cases in which the missing visual modality
ratio can occur in real life due to privacy constraints. We apply the visual missing ratio p as 95% in training
UCF101 and ActivityNet datasets. Moreover, due to the large data size of the MiT51 dataset, we set the
visual missing ratio p as 99%. As we discussed earlier, the test set in this experiment is with the complete
modality, and our approach is essentially to increase the data efficiency with severe visual data missing in
multimodal learning.
6.7.1 Would audio be enough for action recognition?
In addition to data imputation in audio-visual learning with visual modality missing, one natural idea
is whether the audio data alone is sufficient for predicting human action. To answer this question, we
perform the training with the following approaches for the comparisons:
Complete Audio Training: In this training approach, we assume that only audio modality is available
for the model training while no visual data exists. The motivation of this training baseline is that relying
on audio data can provide sufficient performance for human action recognition.
57
Table 6.2: Comparisons among complete audio, complete visual, and complete multi-modal models across
different datasets. The visual missing ratio p = 0% in training complete visual and complete multi-modal
models.
Complete Audio Complete Visual Complete Multi-modal
UCF101 50.74 91.08 88.75
ActivityNet 16.66 70.72 65.66
MiT51 33.42 58.95 60.24
Complete Visual Data Training: Our second training experiment is based on complete visual data
training. This baseline shows human action using visual modality.
Complete Multimodal Data Training: Our last baseline training is with complete multimodal data.
We use this experiment to show whether multimodal data can outperform unimodal training.
Findings: We present the comparisons among complete audio, visual, and multimodal training in Table 6.2. The comparisons show that visual modality provides substantial advantages in human action
recognition tasks compared to complete audio training. More interestingly, we identify that complete
multimodal learning does not always outperform complete visual training, indicating the importance and
less ambiguity that visual modality can provide for human action recognition. In summary, we find that visual modality is more indicative of human action recognition, and relying on audio alone provides limited
accuracy.
6.7.2 Zero-shot results with synthetic visual data
As we presented earlier, one research question we are interested in is whether pure synthetic data can
provide competitive performance compared to real data training. To answer this question, we perform
two training experiments leveraging the GTI-MM with visual missing ratio p = 100%:
ZS-Visual Model: Our first experiment is the visual modality zero-shot learning baseline. We experiment
by training the visual classifier with pure synthetic visual data.
58
Test Acc
30
50
70
90
UCF101 ActivityNet Mit51
ZS-Visual Model ZS-Multimodal Model Complete Multimodal Training
Comparing GTI-MM Zero-shot Performance With Complete Multimodal Training
Figure 6.4: Comparisons between GTI-MM and zero-shot learning with synthetic visual data.
ZS-Multi-modal Model: We extend our first experiment by combining pure synthetic data with audio
data presented in the original data. We randomly pair a synthetic image with the audio data from the same
label class during the training.
Findings: We plot the comparisons between zero-shot learning baselines with complete multimodal training in Figure 6.4. The results show that both the ZS-Visual and the ZS-Multi-modal underperform the
complete multimodal baselines. In particular, we can find that this performance gap is the largest in the
MiT51 dataset with the most data samples, indicating the importance of using real data. On the other hand,
we discover that audio modality can benefit zero-shot learning with synthetic visual data, improving ZSmultimodal performance over ZS-visual performance in UCF101 and MiT51 datasets. In summary, this
experiment suggests the importance of using real data and other modalities for zero-shot learning with
pure synthetic visual data.
6.7.3 GTI-MM with limited visual data
Although GTI-MM in zero-shot learning scenarios underperforms complete multimodal training, it outperforms complete audio training by a large margin, showing great promise to boost training efficiency with
limited real visual data. To explore whether GTI-MM can increase training performance with limited visual
data, we perform the experiments comparing with the following baselines:
59
Table 6.3: Performance comparisons between GTI-MM and other baselines across different datasets with
low-resource visual data presents.
Training Visual Limited Limited Zero-filling Imputation GTI-MM
Missing Ratio (p) Visual Training Multi-modal Training Multi-modal Training (Ours)
UCF101 95% 67.77 32.10 53.59 78.17
ActivityNet 95% 16.66 6.44 17.49 57.98
MiT51 99% 33.42 29.62 40.18 49.70
Limited Visual Data Training: Our first training experiment is based on the limited visual data available
in the experiments. For example, when visual modality is missing ratio p = 95% in the UCF101 dataset,
we train the human action recognition model with only 5% visual data in the UCF101 dataset.
Limited Multi-modal Data Training: Our second experiment in this subsection is to train the multimodal model with the limited multimodal data available in training. Similar to the example above, with
visual modality missing ratio p = 95% in the UCF101 dataset, this approach is to train the multimodal
model with only 5% multimodal data remaining in the UCF101 dataset.
Multi-modal Training with Zero-filling Imputation: Our last training experiments focus on imputing missing visual data with zero. Unlike limited visual data or multimodal data training, zero-filling
imputation is a simple but effective approach to utilizing all the available data for training. This approach
has been widely used as the baseline comparison in prior works [55, 65].
Findings: We compare GTI-MM with other baseline approaches in Table 6.3. The comparisons show that
our proposed GTI-MM can consistently outperform approaches relying on limited visual data. Moreover,
our proposed GTI-MM can also provide substantial performance improvement compared to the zero-filling
imputation approach. We can find that these performance improvements are consistent across all experimental datasets, with the largest performance increase of 40% on the ActivityNet dataset. To sum up, the
results suggest that synthetic visual data provides sufficient quality as training data, enhancing multimodal
performance in human action recognition by a substantial margin.
60
Training Visual Modality Missing Ratio (p)
Test Acc
25
50
75
100
70 75 80 85 90 95
Limited Visual Limited Multimodal
Multimodal (Zero-filling Imputation) GTI-MM (Ours)
Training Visual Modality Missing Ratio (p)
Test Acc
0
25
50
75
70 75 80 85 90 95
Limited Visual Limited Multimodal
Multimodal (Zero-filling Imputation) GTI-MM (Ours)
Training Visual Modality Missing Ratio (p)
Test Acc
25
35
45
55
65
80 85 90 95 100
Limited Visual Limited Multimodal
Multimodal (Zero-filling Imputation) GTI-MM (Ours)
Figure 6.5: Performance comparisons among GTI-MM and other baselines at different training visual modality ratios.
6.7.4 Is GTI-MM effective when more visual data is available?
The previous results have demonstrated the effectiveness of applying our proposed GTI-MM in extreme
visual modality missing scenarios in training. One further research question we are interested in is whether
GTI-MM is effective when more visual data is available in the training stage. We answer this research
question by training GTI-MM with lower visual-modality missing ratios. More specifically, we perform the
GTI-MM training with p ∈ {70%, 80%, 90%, 95%} in UCF101 and ActivityNet datasets. We conduct the
GTI-MM training with p ∈ {80%, 90%, 95%, 99%} in the MiT51 dataset. The comparisons in Figure 6.5
61
Test Accuracy
30
50
70
90
UCF101 ActivityNet MiT51
100 per class (Default) 20 per class 5 per class 1 per class
GTI-MM - Quantity of Generated Data
Figure 6.6: Impact of generation quantity on GTI-MM performance.
show that the performance advantages that GTI-MM delivers decrease as the visual modality missing ratio
decreases. When the visual modality missing ratio reaches 70% in the UCF101 dataset, we find that limited
visual data training can even start to surpass GTI-MM. Overall, the results indicate that when more real
visual data is available, the benefits of synthetic data decrease, while synthetic visual data can still bring
smaller advantages than those without.
6.7.5 Quantity of Visual Generation
As we discussed at the beginning of this chapter, the quantity of the visual generation can impact training
performance, with a larger generation tending to increase the training performance. However, in practice,
we identify that visual generation, such as text-to-image generation, requires substantial computation
costs in academic settings. Our empirical experiments find that a single text-to-image generation can take
around 12 seconds on an A40 GPU. This urges us to understand how many unique generations we need in
GTI-MM to achieve competitive training performance. To verify this, we conduct the GTI-MM experiments
by generating the different number of images per class in {100, 20, 5, 1}. The comparisons among different
numbers of image generations are presented in Figure 6.6. The plot shows that although one generation
per class decreases training performance by a large margin compared to 100 generations per class, GTI-MM
with one generation per class can yield substantial improvements over complete audio training. Moreover,
we find that even with 20 generations per class, GTI-MM can deliver an acceptable performance decrease
62
Test Accuracy
30
50
70
90
UCF101 ActivityNet MiT51
Label-prompt (Default) Multi-domain UCG Combined
GTI-MM - Generation Tricks
Figure 6.7: Impact of generation tricks on GTI-MM performance.
compared to 100 generations per class. Overall, our results show that even with five unique generations
per class, GTI-MM can provide competitive training performances compared to other baseline approaches.
6.7.6 Diversity of Visual Generation
In addition to investigating the quantity of visual generation to multimodal learning, we further explore
the diversity of visual generation. In this sub-work, we investigate the impact of generation diversity on
imputation from prompt tricks and action performer diversity.
Prompt Tricks: We first compare the GTI-MM with different prompt tricks. The first trick we experimented with was the multi-domain prompt, where we augmented the generation in domains including
photos, screencasts, rock drawings, and montages. In addition, we experiment with the UGC approach that
augments the generation by randomly sampling the guidance scale between 1 and 5. Lastly, we conduct
the experiment combining multi-domain prompts and UGC. We present the comparisons in Figure 6.7.
The results in Figure 6.7 demonstrate that incorporating prompt tricks is not associated with better imputation, and the multimodal performance in ActivityNet and UCF101 decreases by adding these prompt
tricks. On the other hand, adding prompt tricks in image generation improves the model performance
on MiT51 with visual modality missing in the training. We suspect the reason behind this is the diverse
video styles in the MiT51 dataset, where adding multi-domain knowledge in the generation benefits the
63
Test Accuracy
30
50
70
90
UCF101 ActivityNet MiT51
Multi-domain Action Performer (Default) A Man A Woman
GTI-MM - Action Performer in Generated Data
Figure 6.8: Impact of varing action performers in image generation on GTI-MM performance.
final model performance. However, as both UCF101 and ActivityNet datasets are with videos mainly from
video recordings, adding prompt tricks poses little or even negative impact on the model performance.
Action Performers: Apart from increasing the diversity using prompt tricks, we argue that supplementing the generation with different demographic information can increase the generation diversity. Specifically, our baseline generation approach creates the text prompt through a set of action performers, and we
wondered whether performing the visual generation in certain demographics would decrease the model
performance. To verify this, we perform the visual generation by limiting the action performer to "a man"
or "a woman", as shown in Figure 6.8. The comparison results demonstrate that augmenting the action
performers with diversified demographic information can consistently improve the model performance.
This result suggests the importance of considering diversity in demographics when using synthetic data
as training sources, and synthetic data in limited demographics can decrease overall performance.
6.7.7 Complexity of the Text Prompt in Text-to-image Generation
Finally, we argue that the complexity of the text prompt can impact the visual generation. Specifically,
we utilize the LLM models to provide the descriptions of domain information to prompt the text-to-image
generation. We compare the GTI-MM using the label-guided prompt and LLM-assisted prompt in image
generation. In the LLM-assisted prompt, we obtain 5 descriptions of each human action using the ChatGPT.
We compare two prompt complexities in Table 6.4. From the results, we can discover that increasing
64
Table 6.4: Performance comparisons between label prompt and LLM-assist prompt in image generation.
UCF101 ActivityNet MiT51
Label prompt 78.17 57.98 49.87
LLM-assist prompt 78.42 58.78 50.23
the complexity of the prompt consistently increases the performance of GTI-MM. However, we find that
the performance increases by using LLM in creating prompts is small, suggesting that simple domain
information, like label names, can sufficiently provide competitive multimodal performance with visual
modality missing.
6.8 GTI-MM with Visual-Missing in Any Data
In this subsection, we extend our experimental conditions from visual modality missing in training to
visual modality missing in training and testing. This is a more severe missing condition as we need to
increase training efficiency and model robustness. Specifically, we experiment with the test missing ratio
q = {50%, 70%, 90%}. We aim to investigate whether popular training algorithms in increasing model
robustness can be adaptive with GTI-MM.
6.8.1 Dropout Training with GTI-MM
We first aim to investigate whether dropout training can be applied with GTI-MM. We want to highlight that
Dropout training [65] is widely applied to enhance the model robustness in multimodal learning against
missing modalities during inference. Specifically, we provide the experiments in the following conditions:
MM-Dropout Our first experiment is the typical multimodal dropout training. This baseline has been
used in different studies for comparisons. Specifically, we set the dropout ratio in dropout training the
same as the testing visual missing ratio q. This baseline involves the complete modality in the training
dataset.
65
Table 6.5: Comparisons between MM-Dropout and GTI-MM Dropout. p and q are training and testing
visual missing ratios.
Dataset Method p
Test Missing Ratio (q)
50 70 90
UCF101
MM-Dropout 0 62.3 56.9 51.7
GTI-MM Dropout 90 60.8 57.7 53.8
95 59.4 56.8 52.5
ActivityNet
MM-Dropout 0 33.2 22.9 17.8
GTI-MM Dropout 90 32.7 26.4 20.3
95 31.7 26.0 20.1
MiT51 MM-Dropout 0 46.1 41.5 35.7
GTI-MM Dropout 90 43.2 39.3 34.2
99 40.3 37.7 34.1
MM-Zero Imputation The second natural baseline in our experiment imputes missing training data with
zeros. This approach is a special case of MM-Dropout training, with the missing visual samples performing
the dropout during the training. It has been used as the baseline in [55], and set the training visual missing
ratio p = q.
Can GTI-MM Dropout outperform MM-Dropout? We present the comparisons between MM-Dropout
and GTI-MM Dropout in Table 6.5. In this experiment, we choose the training visual missing ratio p as
90% and 95%. Specifically, we experiment with p = 99% in the MiT51 dataset. The comparisons indicate
that when the test missing ratio q is above 70%, combining our proposed GTI-MM approach with dropout
training at p ≥ 95% can outperform complete modality dropout training. This performance advantage
starts to decrease when q reaches 50%, and the complete modality dropout training begins to outperform
dropout training with GTI-MM when the test visual modality missing ratio is q = 50%. One plausible
explanation is that only a few percent of training visual modality is presented in GTI-MM, leading to lower
model performance than complete modality dropout training where 50% of the complete visual data in
each training epoch.
Is GTI-MM Dropout effective when p = q? Here, we compare GTI-MM Dropout with MM-Zero Imputation by controlling p = q. Our results in Figure 6.9 reveal that GTI-MM Dropout provides substantial
66
Test Visual Missing Ratio (q)
Test Acc
42
52
62
72
50 60 70 80 90
MM-Zero Imputation GTI-MM Dropout
Test Visual Missing Ratio (q)
Test Acc
5
25
45
50 60 70 80 90
MM-Zero Imputation GTI-MM Dropout
Test Visual Missing Ratio (q)
Test Acc
22
34
46
58
50 60 70 80 90
MM-Zero Imputation GTI-MM Dropout
Figure 6.9: Comparisons on testing visual modality missing between MM-Zero Imputation and GTI-MM
Dropout, where training visual modality missing ratio p equals testing visual modality missing ratio q.
advantages in improving the performance compared to MM-Zero imputation in UCF101 and ActivityNet
and marginal performance gain in MiT51 with a larger p, demonstrating the effectiveness of GTI-MM in
different visual missing scenarios.
6.8.2 Prompt Learning with GTI-MM Dropout
In addition to dropout training, we are interested in whether GTI-MM is adaptive to recent training approaches in improving model robustness. Specifically, we are interested in exploring missing-aware prompt
learning [55], a simple but effective approach to enhance the model’s robustness. This approach applies
67
Table 6.6: Comparing GTI-MM DT with and without prompt learning on the condition where visual modality is missing in test data. p = 99% for MiT datasets, while p = 95% for other datasets.
Dataset Prompt Learning p(%) Test Missing Ratio (q)
50 70 90
UCF101 ✗
95 59.4 56.8 52.5
✓ 60.7 57.3 53.7
ActivityNet ✗
95 31.7 26.0 20.1
✓ 32.2 26.8 20.6
MiT51 ✗
99 40.3 37.7 34.1
✓ 42.1 37.7 34.4
to settings where the training and testing modalities are missing. In our experiment, we only insert the
learnable prompts in the last layer of the pre-trained ImageBind model due to its large parameter size. We
compare the GTI-MM with and without prompt learning in Table 6.6. The table shows that adding prompt
learning to GTI-MM provides consistent benefits to increase model robustness. Combining this result with
the dropout training results, we can conclude that our proposed GTI-MM can generalize to diverse training
approaches in improving model robustness against visual modality missing in the inference stage.
6.9 Conclusion
In this chapter, we study the generative foundation model in assisting privacy-enhancing ML in scenarios
where some modalities can be unavailable due to privacy sensitivity. Specifically, we study the humancentered multimodal application in human action recognition, where the visual modality can be lost as
the videos or photos frequently carry direct information about a person. To encounter this challenge, we
propose GTI-MM that uses the generative foundation models to synthesize images to impute the missing
visual modality for multi-modal learning. Our experiments demonstrate that it is possible to collect a few
percent of the modality-complete data to achieve competitive human action recognition through synthetic
data imputation. We find that GTI-MM is effective with simple label-guided prompts, a few generations per
class, and simple prompt tricks. Moreover, we identify that increasing the diversity, the quantity, and the
68
prompt complexity can further increase the model performance with imputation. We want to highlight
that although this chapter demonstrates the promises of using the generative foundation models to impute
missing visual data, our complete study in [35] shows that this approach encounters challenges in audio
data imputation, implying the need to develop more advanced audio generation models.
69
Chapter 7
Extending Foundation Model to Assist Federated Learning
7.1 Introduction
In the previous chapters, we have demonstrated the effectiveness of the generative foundation model in
assisting privacy-enhancing ML in human-centered applications with data missing due to privacy-related
laws and regulations. As we also presented in the previous chapter, although federated learning has shown
promise in providing privacy-enhancing solutions to model training without sharing user data, its performance is often constrained by the computation resources on the client device and data heterogeneity within
the system. Consequently, FL often exhibits lower performances compared to their centralized learning
counterparts. In this chapter, we continue exploring utilizing the generative foundation model to assist
privacy-enhancing computing in FL scenarios. This chapter is based on our work in [103].
7.2 Experimental Datasets
As image-related FL has been widely studied, we focus on FL using audio and speech modalities in this
chapter. Specifically, we experiment with Google Command and ESC-50 datasets:
70
Google Command. Our first dataset is the Google Command dataset [96]. This dataset contains speech
recordings of 35 common words in daily conversations, such as "Yes," "Left," and "Up." This dataset and application are suitable for FL as it requires less computation. The complete dataset includes over 100k audio
recordings in approximately 1-second duration from over 2k unique speakers. The training set consists
of 2,112 speakers, the validation set involves 256 speakers, and the test set contains 250 speakers. Based
on our FL benchmark in [102], we train the model using a mobile-friendly model architecture consisting
of two convolutional layers, one GRU layer, and an MLP classifier. We perform the Mel-spectrogram extraction as the preprocessing step and feed the model with the Mel-spectrogram with a dimension of 128.
More details regarding the experiment setup can be found in [102].
ESC-50. The second experimental dataset used in this dataset is called the ESC-50 dataset [75]. This
dataset includes sound recordings in 50 categories of animal, natural, human non-speech, domestic, and
urban sounds. The whole dataset consists of 2,000 recordings, each in a 5-second duration. Similar to the
Google Command dataset, we conduct the same audio preprocessing and rely on a lightweight model for
FL.
7.3 Generative Foundation Model Assisted FL
As we showed earlier, the generative foundation models can generate human-related data to enhance
centralized learning performance. Likewise, we wonder whether the generative foundation models can
be used to improve model performance in FL. Similar to the approach in the previous section, we aim to
investigate if we can generate synthetic data to assist FL. With this objective, we propose the GPT-FL as
demonstrated in Figure 7.1. We want to highlight that the complete experiments with image classification
are presented in [103]. The GPT-FL begins by prompting the generative foundation models to generate
the data of interest. In our example, we prompt the text-to-audio models to generate the audio sound
based on the domain information. Here, the sound class labels are the domain information in the audio
71
Figure 7.1: Overview of GPT-FL. In this thesis, we focus on the speech and audio application, while the
visual recognition results can be referenced in [103].
sound classification, and the words are the domain information in the keyword spotting task. Following
the synthetic data generation, we pre-train the downstream model with the synthetic dataset. We deploy
the pre-trained downstream model on synthetic data in the last stage to the typical FL framework.
7.3.1 Synthetic Data Generation
The first step in GPT-FL is synthetic data generation. As shown in previous sections, the simple labelguided prompt can generate reliable data for ML. Therefore, we create the text prompts based on the class
name and feed them into the generative models to generate audio and speech. For example, we generate
the speech commands dataset by prompting the text-to-speech model with the word. Here, we use the
SpeechT5 model [5] for text-to-speech generation. On the other hand, when generating audio sounds,
like "sea waves," we prompt the text-to-audio models with the text messages: "a sound of sea waves."
Specifically, we rely on the AudioLDM model [64] to generate the audio data. The remaining results
regarding the image generation can be found in [103].
72
7.3.2 Pre-training Downstream Model on Synthetic Data
After obtaining the synthetic data, our second step in GPT-FL involves pre-training the downstream models
with the synthetic dataset. Specifically, we choose not to distribute the synthetic data to client devices, as it
increases the communication overhead and storage spaces on the client. We then distribute the pre-trained
downstream model on synthetic data on the server to the clients who participated in FL. This initialization
concept is simple and requires no extra client communication or storage cost.
7.3.3 Finetune Trained Downstream Model on Private Client Data with FL
Our last stage is the same as the standard FL. In this process, the server distributes the pre-trained downstream model to clients at the start of FL, and the clients perform the local training with their private data
after receiving the model weights. After the local training, the updated models are sent back to the server
for aggregation. This process is repeated until the convergence.
7.4 Experimental Details
In this experiment, we first partition each client’s speech and audio data. We perform the partition on
ESC50 with the non-IID data distribution, a widely used approach to investigate the FL performance in
real-life scenarios. We follow the procedure in FedAudio Benchmark to partition ESC-50 into 100 subsets
using DirK(α) with α equal to 0.1. Moreover, we partition the clients in the Google Speech Command
dataset using the speaker IDs, as it is a natural way to partition the client data. This leads to 2,112 clients
training this dataset, corresponding to the number of unique speakers. Our training hyperparameters
follow the FedAudio benchmark [102]. Moreover, we report the best F1 score for all experiments. We
perform the FL using FedAvg and FedOpt algorithms.
73
Table 7.1: Accuracy performance of the generated downstream model and standard FL on benchmark
datasets. "1x Synthetic" represents the size of synthetic data with the same size as the real data. FedAvg
and FedOpt are both trained with complete data.
Dataset 1x Synthetic 2x Synthetic 3x Synthetic FedAvg FedOpt
Google Command 24.78 (± 0.04) 25.65 (± 0.07) 26.24 (± 0.01) 73.68 (± 0.49) 83.01 (± 0.23)
ESC-50 6.89 (± 0.29) 8.68 (± 0.35) 12.72 (± 0.31) 22.76 (± 1.01) 32.49 (± 0.57)
7.5 Can we use synthetic zero-shot learning to replace FL?
Similar to previous research threads, our first question is whether synthetic zero-shot learning can replace
FL. To answer this question, we conduct two sets of training, one using the private data to perform standard
FL and the other relying on the synthetic data to conduct centralized training. We present the comparison
between zero-shot learning and FL in Table 7.1. From the comparisons, we can identify that zero-shot
learning with synthetic learning consistently underperforms its FL counterparts, implying that synthetic
data may not provide enough quality and diversity to serve as the only training data. In addition, we
can find that increasing the data size of the synthetic data can improve the model performance, but this
performance increase falls short compared to FL. We want to highlight that these findings align with the
previous study [57] that utilizes pure synthetic speech data to train the automatic speech recognition
system, leading to considerable performance degradation. The relatively lower performance using the
synthetic data can be associated with the knowledge used to train these models. For example, many textto-speech models are trained with book reading in English. This leads to simple word generation like
"house" failing in our experiments. The generated samples of the word "house" are mostly unintelligible.
In addition, we find that the text-to-audio models used in our experiments fail to understand many sound
categories, and we empirically discover that the audio generation with water sounds as the input will
sound like music. Therefore, increasing the text-to-speech quality and text-to-audio generation is critical
for future research utilizing generative foundation models.
74
Table 7.2: Accuracy comparison between generated downstream model, standard FL, and GPT-FL.
"∆Metric" represents the accuracy increment by GPT-FL on top of the generated downstream model.
Dataset 3x Synthetic FedAvg FedOpt GPT-FL w/ FedAvg GPT-FL w/ FedOpt ∆Metric
Google Command 26.24 (± 0.01) 73.68 (± 0.49) 83.01 (± 0.23) 81.90 (± 0.20) 83.46 (± 0.11) ↑ 57.22
ESC-50 12.72 (± 0.31) 22.76 (± 1.01) 32.49 (± 0.57) 41.80 (± 0.32) 43.46 (± 0.30) ↑ 30.74
Communication Round
F1-Score
0
20
40
60
80
0 250 500 750
GPT-FL FedAvg
Figure 7.2: The figure shows the learning curve of the global model during training on the Google speech
commands dataset. The FL algorithm used is FedAvg, and the figure is from our work [103].
7.6 Can Pre-training with Synthetic Data Improve FL Performance?
Moreover, we explore whether pre-training with synthetic data can improve FL performance. Specifically,
we implement the GPT-FL experiments with commonly used FL optimizers, including FedAvg and FedOpt. We compare the results of our proposed GPT-FL with standard FedAvg and FedOpt algorithms, as
demonstrated in Table 7.2.
Effectiveness of Private Data in FL. The comparison between different experiments shows that combining private data with FL can substantially improve the model performance compared to zero-shot learning
with synthetic data. This performance improvement is consistent in both speech command classification
and audio event classification. Specifically, we can find that GPT-FL with FedOpt could achieve 2-3 times
the best test accuracy compared to zero-shot learning. Moreover, we identify that GPT-FL can consistently
outperform standard FL. For example, we discover that there is an accuracy increase of more than 10% in
75
Communication Round
Gradient Diversity
0
50
100
150
200
250
0 200 400 600 800
GPT-FL FedAvg
Figure 7.3: The figure shows the smoothed Gradient diversity of client updates during training on the
Google speech commands dataset. The FL algorithm used is FedAvg, and the figure is from our work
[103].
audio event classification. Overall, the experimental results indicate that our proposed GPT-FL can effectively leverage the private user data with FL, increasing the final model performance compared to standard
FL.
Generated Downstream Model Enhance Convergence Speed and FL Optimization. In addition to
evaluating the model performance using the GPT-FL, we are interested in why the pre-training with the
synthetic data can benefit the performance improvements. To study this, we apply the gradient diversity
between model weights initialized by GPT-FL and random initialization, as suggested by [72]. Here, the
definition of the gradient diversity is based on the work in [100], where given the client set participating
in FL S, the client i, its model updates ∆i
, the gradient diversity ∆S is:
∆S =
P
i∈S
||∆i
||2
||P
i∈S ∆i
||2
(7.1)
Specifically, we plot the gradient diversity and the test F1-score over communication round between
GPT-FL with FedAvg and FedAvg on the speech commands dataset in Figure 7.2 and Figure 7.3. The
learning curve with F1-score shows that GPT-FL has a much faster convergence speed compared to FedAvg,
and this convergence difference is particularly large at the first 250 training rounds in the FL. On the other
76
hand, from the gradient diversity plot in Figure 7.3, we can identify that the gradient diversity is lower in
GPT-FL than FedAvg at the beginning of the training. With training rounds increasing above 500, we find
that the gradient diversity levels reach a similar position between GPT-FL and standard FedAvg training.
The observations from the diversity plot correspond well to the learning curve, as the lower gradient
diversity at the beginning of the GPT-FL training may lead to faster convergence in FL training. This
demonstrates the effectiveness of the pre-training on the synthetic data in GPT-FL, minimizing the client
drift and leading to faster convergence and higher overall performance. We want to highlight that our
observation aligns with prior findings in studying model initialization in FL [72].
7.7 Conclusion
In this chapter, we study the use of the generative foundation model to assist Federated Learning. Specifically, we design a generative pre-trained model-assisted federated learning framework called GPT-FL.
Similar to the works in the previous section, GPT-FL combines the generative data with the private data to
boost FL performance. More concretely, GPT-FL generates the synthetic data through domain information,
such as label information, using the text prompts. Then, GPT-FL performs the pre-training on synthetic
data, which is then used as the initialization model in standard FL. Our experiments highlight that GPT-FL
is simple but effective, outperforming standard FL by a large margin. Moreover, GPT-FL provides benefits to faster convergence than standard FL due to the lower levels of gradient diversity between clients
at the start of the training. In the future, we are interested in further investigating the use of GPT-FL in
multimodal FL.
77
Chapter 8
Conclusions and Future Works
8.1 Conclusion
This dissertation investigates conventional approaches to addressing privacy concerns in the context of
human-centered applications. Specifically, we provide examples of privacy-enhancing computing using
data perturbation techniques and Federated learning. Although these methods deliver promising results
against privacy attacks, existing literature formulates these approaches in scenarios where data is complete
and of high quality. To this end, we show that the conventional privacy-enhancing approach encounters
significant challenges with limited data, a scenario that occurs frequently due to the privacy restrictions
in collecting or using human-related data.
To address the challenges of limited data training issues associated with data sensitivity in humancentered applications, we propose to leverage foundation models to generate training samples to minimize
the need to collect human data. We demonstrate that our approach using foundation models to generate
training data effectively increases data efficiency in a machine learning system, reducing the need to collect user data. We demonstrate that our proposed approach is adaptive to both unimodal training and
multimodal training.
78
Moreover, we extend our findings with a foundation model that assists ML in centralized learning and
transforms it into federated learning. Specifically, we present GPT-FL, a generative pre-trained modelassisted federated learning framework. Our proposed GPT-FL uses the generative pre-trained model to
generate synthetic data that matches the domain of interest before FL training. Our experimental results
demonstrate the efficacy of GPT-FL, improving the performance and convergence of conventional FL approaches by a large margin.
Overall, our thesis showcases the great potential of foundation models in generating training data
to replace real human data in ML systems. These findings suggest that relying on synthetic training data
makes it feasible for AI practitioners to minimize data collection or sharing while achieving similar performance. Apart from the promising results in privacy-enhancing ML with foundation models, our findings
imply that limited human data remains critical in building a reliable model.
8.2 Future Works
Although limited human-centered data training can obtain substantial benefits through synthetic data
generated from foundation models, the quality of the generated content remains largely unexplored. For
example, through our inspection of the generation examples, as demonstrated in Figure 8.1, the generation
pattern of classes like hiking share extensive similarities that most humans in the pictures face backward
towards the cameras. Moreover, we identify the relatively lower generation quality in audio samples compared to image data. For example, using human inspection, we discovered that the TTS model fails to
synthesize simple spoken words like "house". One plausible explanation is related to the relatively small
training data sizes (approximately 400M sentences) and constrained domain knowledge (book corpus)
compared to training other generative pre-trained models like Stable Diffusion. On the other hand, our
manual inspection revealed that the audio generation model frequently encounters difficulties in generating audio samples, such as generated audio related to water sounds, which often sounds like music. This
79
issue could be primarily associated with the relatively small data size in pre-training compared to other
foundation models. One future work is to conduct comprehensive studies of the quality of the generated
content.
In addition to investigating the quality of the synthetic data, we would aim to develop generative foundation models for human-centered data streams like wearable data. As we highlight in the Introduction of
this thesis, we have carried out large-scale studies in the hospital environment to understand the stress of
the hospital workers ([99, 71]) involving a massive amount of human-centered signals from wearable data.
These wearable data, often presented in time series, have received limited attention in developing related
foundation models. Our future scope is to extend the foundation model and generative foundation models
to broader domains in human-centered signals, particularly from wearable sensing.
80
Original Dataset
Class Prompt
Multi-domain
ChatGPT
Original Dataset
Class Prompt
Multi-domain
ChatGPT
Cooking
Hiking
Figure 8.1: Visual generation examples for classes of cooking and hiking. The image examples are from
our work in [35].
81
Bibliography
[1] Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and
Boqing Gong. “Vatt: Transformers for multimodal self-supervised learning from raw video, audio
and text”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 24206–24221.
[2] Mehmet Berkehan Akçay and Kaya Oğuz. “Speech emotion recognition: Emotional models,
databases, features, preprocessing methods, supporting modalities, and classifiers”. In: Speech
Communication 116 (2020), pp. 56–76.
[3] Firoj Alam, Ferda Ofli, and Muhammad Imran. “Crisismmd: Multimodal twitter datasets from
natural disasters”. In: Twelfth international AAAI conference on web and social media. 2018.
[4] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra Perez, and Jorge Luis Reyes Ortiz. “A
public domain dataset for human activity recognition using smartphones”. In: Proceedings of the
21th international European symposium on artificial neural networks, computational intelligence
and machine learning. 2013, pp. 437–442.
[5] Junyi Ao, Rui Wang, Long Zhou, Shujie Liu, Shuo Ren, Yu Wu, Tom Ko, Qing Li, Yu Zhang,
Zhihua Wei, Yao Qian, Jinyu Li, and Furu Wei. “SpeechT5: Unified-Modal Encoder-Decoder
Pre-Training for Spoken Language Processing”. In: Annual Meeting of the Association for
Computational Linguistics. 2021.
[6] Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li,
Yu Zhang, et al. “Speecht5: Unified-modal encoder-decoder pre-training for spoken language
processing”. In: arXiv preprint arXiv:2110.07205 (2021).
[7] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. “wav2vec 2.0: A
framework for self-supervised learning of speech representations”. In: Advances in neural
information processing systems 33 (2020), pp. 12449–12460.
[8] Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojanski, and Verena Rieser. “SLURP: A spoken
language understanding resource package”. In: arXiv preprint arXiv:2011.13205 (2020).
[9] Yoshua Bengio, Yann Lecun, and Geoffrey Hinton. “Deep learning for AI”. In: Communications of
the ACM 64.7 (2021), pp. 58–65.
82
[10] Daniel Bone, Chi-Chun Lee, Theodora Chaspari, James Gibson, and Shrikanth Narayanan. “Signal
Processing and Machine Learning for Mental Health Research and Clinical Applications”. In: IEEE
Signal Processing Magazine 34.5 (Sept. 2017), pp. 189–196.
[11] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim,
Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. “IEMOCAP: Interactive emotional
dyadic motion capture database”. In: Language resources and evaluation 42.4 (2008), pp. 335–359.
[12] Carlos Busso, Srinivas Parthasarathy, Alec Burmania, Mohammed AbdelWahab,
Najmeh Sadoughi, and Emily Mower Provost. “MSP-IMPROV: An acted corpus of dyadic
interactions to study emotion perception”. In: IEEE Transactions on Affective Computing 8.1 (2016),
pp. 67–80.
[13] Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečny, `
H Brendan McMahan, Virginia Smith, and Ameet Talwalkar. “Leaf: A benchmark for federated
settings”. In: arXiv preprint arXiv:1812.01097 (2018).
[14] Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and
Ragini Verma. “Crema-d: Crowd-sourced emotional multimodal actors dataset”. In: IEEE
Transactions on Affective Computing 5.4 (2014), pp. 377–390.
[15] Jiayi Chen and Aidong Zhang. “FedMSplit: Correlation-Adaptive Federated Multi-Task Learning
across Multimodal Split Networks”. In: Proceedings of the 28th ACM SIGKDD Conference on
Knowledge Discovery and Data Mining. 2022, pp. 87–96.
[16] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li,
Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. “Wavlm: Large-scale self-supervised
pre-training for full stack speech processing”. In: IEEE Journal of Selected Topics in Signal
Processing 16.6 (2022), pp. 1505–1518.
[17] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. “A simple framework
for contrastive learning of visual representations”. In: International conference on machine
learning. PMLR. 2020, pp. 1597–1607.
[18] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li,
Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. “Scaling instruction-finetuned
language models”. In: arXiv preprint arXiv:2210.11416 (2022).
[19] Leigh Clark, Philip Doyle, Diego Garaialde, Emer Gilmartin, Stephan Schlögl, Jens Edlund,
Matthew Aylett, Joäo Cabral, Cosmin Munteanu, Justin Edwards, et al. “The state of speech in
HCI: Trends, themes and challenges”. In: Interacting with Computers 31.4 (2019), pp. 349–371.
[20] Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
[21] Paul Cuff and Lanqing Yu. “Differential privacy as a mutual information constraint”. In:
Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 2016,
pp. 43–54.
83
[22] James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi,
Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, et al. “The YouTube video recommendation
system”. In: Proceedings of the fourth ACM conference on Recommender systems. 2010, pp. 293–296.
[23] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-training of deep
bidirectional transformers for language understanding”. In: arXiv preprint arXiv:1810.04805 (2018).
[24] Josep Domingo-Ferrer, Oriol Farras, Jordi Ribes-González, and David Sánchez.
“Privacy-preserving cloud computing on sensitive data: A survey of methods, products and
challenges”. In: Computer Communications 140 (2019), pp. 38–60.
[25] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, et al. “An image is
worth 16x16 words: Transformers for image recognition at scale”. In: arXiv preprint
arXiv:2010.11929 (2020).
[26] Cynthia Dwork. “Differential privacy: A survey of results”. In: International conference on theory
and applications of models of computation. Springer. 2008, pp. 1–19.
[27] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. “Calibrating noise to sensitivity
in private data analysis”. In: Theory of cryptography conference. Springer. 2006, pp. 265–284.
[28] Tiantian Feng, Brandon M Booth, Brooke Baldwin-Rodríguez, Felipe Osorno, and
Shrikanth Narayanan. “A multimodal analysis of physical activity, sleep, and work shift in nurses
with wearable sensor data”. In: Scientific reports 11.1 (2021), p. 8693.
[29] Tiantian Feng, Digbalay Bose, Xuan Shi, and Shrikanth Narayanan. “Unlocking Foundation
Models for Privacy-Enhancing Speech Understanding: An Early Study on Low Resource Speech
Training Leveraging Label-guided Synthetic Speech Content”. In: arXiv preprint arXiv:2306.07791
(2023).
[30] Tiantian Feng, Digbalay Bose, Tuo Zhang, Rajat Hebbar, Anil Ramakrishna, Rahul Gupta,
Mi Zhang, et al. “FedMultimodal: A Benchmark For Multimodal Federated Learning”. In: arXiv
preprint arXiv:2306.09486 (2023).
[31] Tiantian Feng, Hanieh Hashemi, Murali Annavaram, and Shrikanth S Narayanan. “Enhancing
Privacy Through Domain Adaptive Noise Injection For Speech Emotion Recognition”. In: ICASSP
2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
2022, pp. 7702–7706.
[32] Tiantian Feng, Rajat Hebbar, Nicholas Mehlman, Xuan Shi, Aditya Kommineni, and
Shrikanth Narayanan. “A Review of Speech-centric Trustworthy Machine Learning: Privacy,
Safety, and Fairness”. In: APSIPA Transactions on Signal and Information Processing 12.3 (2023).
doi: 10.1561/116.00000084.
[33] Tiantian Feng and Shrikanth Narayanan. “PEFT-SER: On the Use of Parameter Efficient Transfer
Learning Approaches For Speech Emotion Recognition Using Pre-trained Speech Models”. In:
arXiv preprint arXiv:2306.05350 (2023).
84
[34] Tiantian Feng and Shrikanth S Narayanan. “Discovering optimal variable-length time series
motifs in large-scale wearable recordings of human bio-behavioral signals”. In: ICASSP 2019-2019
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019,
pp. 7615–7619.
[35] Tiantian Feng, Daniel Yang, Digbalay Bose, and Shrikanth Narayanan. “Can Text-to-image Model
Assist Multi-modal Learning for Visual Recognition with Visual Modality Missing?” In: arXiv
preprint arXiv:2402.09036 (2024).
[36] Yaroslav Ganin and Victor Lempitsky. “Unsupervised domain adaptation by backpropagation”. In:
International conference on machine learning. PMLR. 2015, pp. 1180–1189.
[37] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, et al. “ImageBind: One embedding space to bind
them all”. In: Proceedings of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition. 2023,
pp. 15180–15190.
[38] Neil Zhenqiang Gong and Bin Liu. “Attribute inference attacks in online social networks”. In:
ACM Transactions on Privacy and Security (TOPS) 21.1 (2018), pp. 1–30.
[39] Yuan Gong, Yu-An Chung, and James Glass. “Ast: Audio spectrogram transformer”. In: arXiv
preprint arXiv:2104.01778 (2021).
[40] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han,
Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. “Conformer: Convolution-augmented
transformer for speech recognition”. In: arXiv preprint arXiv: 2005.08100 (2020).
[41] Chuan Guo, Awni Hannun, Brian Knott, Laurens van der Maaten, Mark Tygert, and Ruiyu Zhu.
“Secure multiparty computations in floating-point arithmetic”. In: arXiv preprint arXiv:2001.03192
(2020).
[42] Hanieh Hashemi, Yongqin Wang, Chuan Guo, and Murali Annavaram. “Byzantine-Robust and
Privacy-Preserving Framework for FedML”. In: arXiv preprint arXiv:2105.02295 (2021).
[43] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image
recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2016, pp. 770–778.
[44] Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing Zhang, Philip Torr, Song Bai, and
Xiaojuan Qi. “Is synthetic data from generative models ready for image recognition?” In: arXiv
preprint arXiv:2210.07574 (2022).
[45] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang,
and Weizhu Chen. “Lora: Low-rank adaptation of large language models”. In: arXiv preprint
arXiv:2106.09685 (2021).
[46] Ting-Yao Hu, Mohammadreza Armandpour, Ashish Shrivastava, Jen-Hao Rick Chang,
Hema Koppula, and Oncel Tuzel. “Synt++: Utilizing imperfect synthetic data to improve speech
recognition”. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE. 2022, pp. 7682–7686.
85
[47] Amna Irum and Ahmad Salman. “Speaker verification using deep neural networks: A”. In:
International Journal of Machine Learning and Computing 9.1 (2019).
[48] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis,
Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings,
et al. “Advances and open problems in federated learning”. In: Foundations and Trends® in
Machine Learning 14.1–2 (2021), pp. 1–210.
[49] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and
Ananda Theertha Suresh. “Scaffold: Stochastic controlled averaging for federated learning”. In:
International Conference on Machine Learning. PMLR. 2020, pp. 5132–5143.
[50] Wonjae Kim, Bokyung Son, and Ildoo Kim. “ViLT: Vision-and-Language Transformer Without
Convolution or Region Supervision”. In: International Conf. on Machine Learning. 2021.
[51] John Kominek and Alan W Black. “The CMU Arctic speech databases”. In: Fifth ISCA workshop on
speech synthesis. 2004.
[52] Jakub Konečny, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and `
Dave Bacon. “Federated learning: Strategies for improving communication efficiency”. In: arXiv
preprint arXiv:1610.05492 (2016).
[53] P Ravi Kumar, P Herbert Raj, and P Jelciana. “Exploring data security issues and solutions in
cloud computing”. In: Procedia Computer Science 125 (2018), pp. 691–697.
[54] Chul Min Lee and Shrikanth S Narayanan. “Toward detecting emotions in spoken dialogs”. In:
IEEE Transactions on speech and audio processing 13.2 (2005), pp. 293–303.
[55] Yi-Lun Lee, Yi-Hsuan Tsai, Wei-Chen Chiu, and Chen-Yu Lee. “Multimodal Prompting with
Missing Modalities for Visual Recognition”. In: IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). 2023.
[56] Ming-Che Lee, Shu-Yin Chiang, Sheng-Cheng Yeh, and Ting-Feng Wen. “Study on emotion
recognition and companion Chatbot using deep neural network”. In: Multimedia Tools and
Applications 79.27 (2020), pp. 19629–19657.
[57] Jason Li, Ravi Gadde, Boris Ginsburg, and Vitaly Lavrukhin. “Training neural speech recognition
systems with synthetic speech augmentation”. In: arXiv preprint arXiv:1811.00707 (2018).
[58] Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. “BLIP: Bootstrapping
Language-Image Pre-training for Unified Vision-Language Understanding and Generation”. In:
International Conference on Machine Learning. 2022. url:
https://api.semanticscholar.org/CorpusID:246411402.
[59] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith.
“Federated optimization in heterogeneous networks”. In: Proceedings of Machine Learning and
Systems 2 (2020), pp. 429–450.
86
[60] Wu Li, Yanhui Zhang, and Yingzi Fu. “Speech emotion recognition in e-learning system based on
affective computing”. In: Third International Conference on Natural Computation (ICNC 2007).
Vol. 5. IEEE. 2007, pp. 809–813.
[61] Yingting Li, Ambuj Mehrish, Rishabh Bhardwaj, Navonil Majumder, Bo Cheng, Shuai Zhao,
Amir Zadeh, Rada Mihalcea, and Soujanya Poria. “Evaluating Parameter-Efficient Transfer
Learning Approaches on SURE Benchmark for Speech Understanding”. In: ICASSP 2023-2023 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2023, pp. 1–5.
[62] Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. “Foundations and recent trends in
multimodal machine learning: Principles, challenges, and open questions”. In: arXiv preprint
arXiv:2209.03430 (2022).
[63] Bo Liu, Ming Ding, Sina Shaham, Wenny Rahayu, Farhad Farokhi, and Zihuai Lin. “When
machine learning meets privacy: A survey and outlook”. In: ACM Computing Surveys (CSUR) 54.2
(2021), pp. 1–36.
[64] Haohe Liu, Zehua Chen, Yiitan Yuan, Xinhao Mei, Xubo Liu, Danilo P. Mandic, Wenwu Wang,
and MarkD . Plumbley. “AudioLDM: Text-to-Audio Generation with Latent Diffusion Models”. In:
ArXiv abs/2301.12503 (2023).
[65] Mengmeng Ma, Jian Ren, Long Zhao, Davide Testuggine, and Xi Peng. “Are multimodal
transformers robust to missing modality?” In: Proc. of the IEEE/CVF Conf. on Computer Vision and
Pattern Recognition. 2022, pp. 18177–18186.
[66] Mishaim Malik, Muhammad Kamran Malik, Khawar Mehmood, and Imran Makhdoom.
“Automatic speech recognition: a survey”. In: Multimedia Tools and Applications 80.6 (2021),
pp. 9411–9457.
[67] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas.
“Communication-efficient learning of deep networks from decentralized data”. In: Artificial
intelligence and statistics. PMLR. 2017, pp. 1273–1282.
[68] Riccardo Miotto, Fei Wang, Shuang Wang, Xiaoqian Jiang, and Joel T Dudley. “Deep learning for
healthcare: review, opportunities and challenges”. In: Briefings in bioinformatics 19.6 (2018),
pp. 1236–1246.
[69] Fatemehsadat Mireshghallah, Mohammadkazem Taram, Ali Jalali, Ahmed Taha Taha Elthakeb,
Dean Tullsen, and Hadi Esmaeilzadeh. “Not all features are equal: Discovering essential features
for preserving prediction privacy”. In: Proceedings of the Web Conference 2021. 2021, pp. 669–680.
[70] Fatemehsadat Mireshghallah, Mohammadkazem Taram, Praneeth Vepakomma, Abhishek Singh,
Ramesh Raskar, and Hadi Esmaeilzadeh. “Privacy in deep learning: A survey”. In: arXiv preprint
arXiv:2004.12254 (2020).
[71] Karel Mundnich, Brandon M Booth, Michelle l’Hommedieu, Tiantian Feng, Benjamin Girault,
Justin L’hommedieu, Mackenzie Wildman, Sophia Skaaden, Amrutha Nadarajan,
Jennifer L Villatte, et al. “TILES-2018, a longitudinal physiologic and behavioral data set of
hospital workers”. In: Scientific Data 7.1 (2020), p. 354.
87
[72] John Nguyen, Jianyu Wang, Kshitiz Malik, Maziar Sanjabi, and Michael G. Rabbat. “Where to
Begin? On the Impact of Pre-Training and Initialization in Federated Learning”. In: ArXiv
abs/2210.08090 (2022).
[73] Srinivas Parthasarathy and Shiva Sundaram. “Training strategies to handle missing modalities for
audio-visual expression recognition”. In: Companion Publication of the 2020 International
Conference on Multimodal Interaction. 2020, pp. 400–404.
[74] Leonardo Pepino, Pablo Riera, and Luciana Ferrer. “Emotion Recognition from Speech Using
wav2vec 2.0 Embeddings”. In: Proc. Interspeech 2021. 2021, pp. 3400–3404. doi:
10.21437/Interspeech.2021-703.
[75] Karol J. Piczak. “ESC: Dataset for Environmental Sound Classification”. In: Proceedings of the 23rd
Annual ACM Conference on Multimedia. Brisbane, Australia: ACM Press, Oct. 13, 2015,
pp. 1015–1018. isbn: 978-1-4503-3459-4. doi: 10.1145/2733373.2806390.
[76] Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and
Rada Mihalcea. “Meld: A multimodal multi-party dataset for emotion recognition in
conversations”. In: arXiv preprint arXiv:1810.02508 (2018).
[77] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, et al. “Learning transferable visual models from natural language supervision”. In:
International Conf. on Machine Learning. PMLR. 2021, pp. 8748–8763.
[78] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever.
“Robust speech recognition via large-scale weak supervision”. In: OpenAI Blog (2022).
[79] Srinivasan Ramakrishnan and Ibrahiem MM El Emary. “Speech emotion recognition approaches
in human computer interaction”. In: Telecommunication Systems 52.3 (2013), pp. 1467–1478.
[80] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečny, `
Sanjiv Kumar, and H Brendan McMahan. “Adaptive federated optimization”. In: arXiv preprint
arXiv:2003.00295 (2020).
[81] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.
“High-Resolution Image Synthesis with Latent Diffusion Models”. In: 2022 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 10674–10685.
[82] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman,
Mehdi Cherti, et al. “LAION-5B: An open large-scale dataset for training next generation
image-text models”. In: ArXiv abs/2210.08402 (2022). url:
https://api.semanticscholar.org/CorpusID:252917726.
[83] Dmitriy Serdyuk, Yongqiang Wang, Christian Fuegen, Anuj Kumar, Baiyang Liu, and
Yoshua Bengio. “Towards end-to-end spoken language understanding”. In: 2018 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2018, pp. 5754–5758.
88
[84] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. “Conceptual Captions: A
Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning”. In: Melbourne,
Australia: Association for Computational Linguistics, July 2018, pp. 2556–2565. doi:
10.18653/v1/P18-1238.
[85] Jordan Shipard, Arnold Wiliem, Kien Nguyen Thanh, Wei Xiang, and Clinton Fookes. “Diversity
is Definitely Needed: Improving Model-Agnostic Zero-shot Classification via Stable Diffusion”.
In: 2023.
[86] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. “Membership inference
attacks against machine learning models”. In: 2017 IEEE symposium on security and privacy (SP).
IEEE. 2017, pp. 3–18.
[87] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre,
George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam,
Marc Lanctot, et al. “Mastering the game of Go with deep neural networks and tree search”. In:
nature 529.7587 (2016), pp. 484–489.
[88] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur.
“X-vectors: Robust dnn embeddings for speaker recognition”. In: 2018 IEEE international
conference on acoustics, speech and signal processing (ICASSP). IEEE. 2018, pp. 5329–5333.
[89] Jiaming Song, Pratyusha Kalluri, Aditya Grover, Shengjia Zhao, and Stefano Ermon. “Learning
controllable fair representations”. In: The 22nd International Conference on Artificial Intelligence
and Statistics. PMLR. 2019, pp. 2164–2173.
[90] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. “UCF101: A dataset of 101 human
actions classes from videos in the wild”. In: arXiv preprint arXiv:1212.0402 (2012).
[91] Krishna Srinivasan, Karthik Raman, Jiecao Chen, Mike Bendersky, and Marc Najork. “WIT:
Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning”. In: Proc. of
the 44th International ACM SIGIR Conf. on Research and Development in Information Retrieval.
2021. url: https://arxiv.org/abs/2103.01913.
[92] Hamed Tabrizchi and Marjan Kuchaki Rafsanjani. “A survey on security challenges in cloud
computing: issues, threats, and solutions”. In: The journal of supercomputing 76.12 (2020),
pp. 9493–9532.
[93] Gokhan Tur and Renato De Mori. Spoken language understanding: Systems for extracting semantic
information from speech. John Wiley & Sons, 2011.
[94] Paul Voigt and Axel Von dem Bussche. “The eu general data protection regulation (gdpr)”. In: A
Practical Guide, 1st Ed., Cham: Springer International Publishing 10.3152676 (2017), pp. 10–5555.
[95] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and
Hannaneh Hajishirzi. “Self-Instruct: Aligning Language Model with Self Generated Instructions”.
In: arXiv preprint arXiv:2212.10560 (2022).
89
[96] Pete Warden. “Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition”. In:
ArXiv abs/1804.03209 (2018).
[97] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,
Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao,
Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. “Transformers:
State-of-the-Art Natural Language Processing”. In: Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing: System Demonstrations. Oct. 2020, pp. 38–45.
[98] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. “Hierarchical
attention networks for document classification”. In: Proceedings of the 2016 conference of the North
American chapter of the Association for computational linguistics: Human Language Technologies.
2016, pp. 1480–1489.
[99] Joanna C Yau, Benjamin Girault, Tiantian Feng, Karel Mundnich, Amrutha Nadarajan,
Brandon M Booth, Emilio Ferrara, Kristina Lerman, Eric Hsieh, and Shrikanth Narayanan.
“TILES-2019: A longitudinal physiologic and behavioral data set of medical residents in an
intensive care unit”. In: Scientific Data 9.1 (2022), p. 536.
[100] Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, and
Peter Bartlett. “Gradient diversity: a key ingredient for scalable distributed learning”. In:
International Conference on Artificial Intelligence and Statistics. PMLR. 2018, pp. 1998–2007.
[101] Qiying Yu, Yimu Wang, Ke Xu, Yang Liu, and Jingjing Liu. “Multimodal Federated Learning via
Contrastive Representation Ensemble”. In: International Conference on Learning Representations.
2023. url: https://openreview.net/forum?id=Hnk1WRMAYqg.
[102] Tuo Zhang, Tiantian Feng, Samiul Alam, Sunwoo Lee, Mi Zhang, Shrikanth S Narayanan, and
Salman Avestimehr. “Fedaudio: A federated learning benchmark for audio tasks”. In: ICASSP
2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
2023, pp. 1–5.
[103] Tuo Zhang, Tiantian Feng, Samiul Alam, Mi Zhang, Shrikanth S Narayanan, and
Salman Avestimehr. “Gpt-fl: Generative pre-trained model-assisted federated learning”. In: arXiv
preprint arXiv:2306.02210 (2023).
[104] Yuanyuan Zhang, Jun Du, Zirui Wang, Jianshu Zhang, and Yanhui Tu. “Attention-based fully
convolutional network for speech emotion recognition”. In: 2018 Asia-Pacific Signal and
Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE. 2018,
pp. 1771–1775.
90
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Responsible artificial intelligence for a complex world
PDF
Computational models for multidimensional annotations of affect
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
Enhancing privacy, security, and efficiency in federated learning: theoretical advances and algorithmic developments
PDF
Optimizing privacy-utility trade-offs in AI-enabled network applications
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Learning controllable data generation for scalable model training
PDF
Computational narrative models of character representations to estimate audience perception
PDF
Fair Machine Learning for Human Behavior Understanding
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Improving modeling of human experience and behavior: methodologies for enhancing the quality of human-produced data and annotations of subjective constructs
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Coding centric approaches for efficient, scalable, and privacy-preserving machine learning in large-scale distributed systems
PDF
Identifying and mitigating safety risks in language models
PDF
Towards more human-like cross-lingual transfer learning
PDF
Modeling dynamic behaviors in the wild
PDF
Parasocial consensus sampling: modeling human nonverbal behaviors from multiple perspectives
PDF
Controlling information in neural networks for fairness and privacy
PDF
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
PDF
Security and privacy in information processing
Asset Metadata
Creator
Feng, Tiantian
(author)
Core Title
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
03/01/2024
Defense Date
12/11/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
federated learning,foundation model,generative model,human-centered learning,privacy
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Dehghani, Morteza (
committee member
), Lerman, Kristina (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
tiantiaf@gmail.com,tiantiaf@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113842364
Unique identifier
UC113842364
Identifier
etd-FengTianti-12676.pdf (filename)
Legacy Identifier
etd-FengTianti-12676
Document Type
Thesis
Format
theses (aat)
Rights
Feng, Tiantian
Internet Media Type
application/pdf
Type
texts
Source
20240304-usctheses-batch-1127
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
federated learning
foundation model
generative model
human-centered learning
privacy