Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Beatboxing phonology
(USC Thesis Other)
Beatboxing phonology
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
BEATBOXING PHONOLOGY
by
Gifford Edward Reed Blaylock
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(LINGUISTICS)
August 2022
Copyright 2022 Gifford Edward Reed Blaylock
For Ellen, sine qua non.
ii
Acknowledgments
There are not enough pages to give everyone who made this dissertation a reality the thanks
they deserve, but I’ll try anyway.
The first thanks goes to Louis Goldstein for support and endless patience. His aptly
timed pearls of wisdom and nuggets of clarity have triggered more major shifts in my
thinking than I can count. And when I’ve finished reeling from a mental sea change (or even
when the waters are calm), little has been more comforting than his calm demeanor and
readiness to help me accept the new situations I find myself in. Louis, thank you for taking
the time to drag me toward some deeper understanding of language and life.
As for the other members of my committee, I am grateful to Khalil Iskarous for
showing me how to reconceptualize complicated topics into simple problems and for
reinforcing the understanding that the superficial differences we see in the world are only
illusions. And I am grateful to Jason Zevin for letting me contribute to his lab meetings
despite not knowing what I was talking about and for offering me summer research funding
even though I’m pretty sure the project moved backward instead of forward because of me.
And thanks to my committee as a whole who together, though perhaps without realizing it,
did something I would never in a million years have predicted: they sparked my interest in
history—a subject which only a few years ago I cared nothing at all for but which I now find
indispensable. Thanks to all three of you for helping me make it this far.
I have been lucky to have the guidance of USC Linguistics department faculty (some
now moved on to other institutions) like Dani Byrd, Elsi Kaiser, Rachel Walker, Karen
Jesney, and Mary Byram Washburn. Substantial credit for any of my accomplishments at
iii
USC goes to Guillermo Ruiz: he has always worked hard to help me, even when I seemed
determined to shoot myself in the foot; he can never be compensated enough. Many of the
insights in this dissertation can be traced back to conversations with my fellow beatboxing
scientists: Nimisha Patil and Timothy Greer at USC and Seunghun Lee, Masaki Fukuda, and
Kosei Kimura at International Christian University in Tokyo. I have also greatly benefited
from the camaraderie and insights of many of my fellow USC graduate students in
Linguistics including Caitlin Smith, Brian Hsu, Jessica Campbell, Jessica Johnson, Yijing Lu,
Ian Rigby, Jesse Storbeck, Luis Miguel Toquero Perez, Adam Woodnutt, Yifan Yang, Hayeun
Jang, Miran Oh, Tanner Sorensen, Maury Courtland, Binh Ngo, Samantha Gordon Danner,
Alfredo Garcia Pardo, Yoonjeong Lee, Ulrike Steindl, Mythili Menon, Christina Hagedorn,
Lucy Kim, and Ben Parrell.
Special thanks to my cohort-mate Charlie O’Hara who gave me massive amounts of
support and whom I hope I supported in kind. Charlie encouraged my nascent
teaching-community endeavors and showed me—through both his teaching and
research—that it’s possible to actually do things and not just talk about them. Outside of
academia, I’m indebted to Charlie for coaxing me out of my reclusive comfort zone in our
first few grad school years with invitations to his holiday parties and improv performances;
Los Angeles had been intimidating, but Charlie made it less so and opened the door to
almost a decade of confident exploration (though I’ve barely scratched the surface of LA).
Thanks to the USC Speech Production and Articulation kNowledge (SPAN)
group—especially Shri Narayanan and Asterios Toutios—for the research opportunities, for
giving me some chances to flex my coding skills a little, and for collecting the beatboxing
iv
data used in this dissertation (not to mention for securing the NIH and NSF grants that
made this work possible in the first place). Special thanks to Adam Lammert for teaching me
just enough about image analysis to be dangerous during my first year of grad school; our
more recent conversations about music, language, and teaching have been a treat and I hope
to continue them for as long as we can.
The members of the LSA’s Faculty Learning Community (FLC) have shown me time
and time again the importance of community for thoughtful, justice-anchored teaching and
for staying sane during a global pandemic. Our meetings have been the only stable part of
my schedule for years now, and if they ever end I doubt I’ll know what to do with myself.
Thanks to Kazuko Hiramatsu, Christina Bjorndahl, Evan Bradley, Ann Bunger, Kristin
Denham, Jessi Grieser, Wesley Leonard, Michael Rushforth, Rosa Vallejos, and Lynsey Wolter
for helping me understand more than I thought I could. I am particularly indebted to Michal
Temkin Martinez, my academic big sister, who generously created a me-sized opening in the
FLC and who has been an unending font of encouragement since the moment we met.
My colleagues from the USC Ballroom Dance Team are responsible for years of
happiness and growth. I am grateful to Jeff Tanedo, Sanika Bhargaw, Katy Maina, Kim
Luong, Alison Eling, Dana Dinh, Andrew Devore, Alex Hazen, Max Pflueger, Eric
Gauderman, Mark Gauderman, Ashley Grist, Rachel Adams, Sayeed Ahmed, Zoe Schack,
Queenique Dinh, Michael Perez, and so many others for their leadership, support, and
camaraderie during our time together. Tasia Dedenbach was a superb dance partner and
friend; she gave me the confidence to trust my eye in my academic visualizations and also
gave me my first academic poster template which I abuse to this day. Alexey Tregubov is an
v
absolute gem of a human being who, simply by demonstrating his own work ethic, is the
reason I was able to actually finish any of my dissertation chapters. Sara Kwan has been the
most thoughtful friend a person could ask for. She is a brilliant and patient sounding-board
for all of my overthinking—whether it be about personal crises, professional crises, or crises
that develop as we indulge our mutual fondness for board games—and her well-timed snack
deliveries (like the mint chocolate Milano cookies I’m eating at this very moment) are always
appreciated. Lorena Bravo and Jonathan Atkinson have gone from being just my dance
teachers to being cherished friends. Thank you for the encouragement, the life lessons, and
for showing me how to be an Angeleno.
Sarah Harper deserves special recognition, as all who know her can attest. There have
been many consequences to her decade-long scheme of unabashedly hijacking of my social
circles at two different universities. Along with all the good times we’ve had with the friends
we now share, she has also had perhaps the most punishing job of any of my friends—having
to deal with me in my crankiest moods. Sarah, thank you for your unwavering support
through it all.
Thanks to Joyce Soares, Ed Soares, Kethry Soares, and Gunnar Jaffarian for treating
me like family even when I wasn’t officially family yet. Lane Stowell, thank you for being a
true friend to Erin for these last several years, and for taking care of her when I have been
unable. And thanks to Angela Boyer, who holds the record for being friends with me the
longest despite three time zones and a bunch of miles coming between us when I moved to
California. It was Angela who suggested I take my first introductory Linguistics class, which
in turn triggered all the events that led me to writing these words.
vi
Because this is a Linguistics dissertation, this is the paragraph where I am supposed to
thank my parents for instilling in me an early passion for language and learning and thank
my grandmother for teaching me to love reading—all of which is perfectly true and for which
I am indeed grateful. But in the context of this particular dissertation, perhaps even more
credit is owed to them for passing along to me their love of music.
Mom and Dad, I know you were embarrassed, when I was small and you took me to a
concert where Raffi started asking kids what kind of music they listened to at home, because
you thought that we didn’t listen to very much music at all. But my life has been filled with
music thanks to you: listening to Dad’s tuba on his bands’ cassette tapes or at Tuba
Christmas; hearing Mom ring out the descant of my favorite hymns; listening to the two of
you harmonizing on songs from the ancient past; singing in Mom-Mom’s choir at Christmas;
learning musical mnemonics in children’s choir that haunt me to this day; and watching in
awe (and listening in some agony) as Dad started learning the violin during a mid-life stroke
of inspiration. All of this, plus the fourteen years of piano lessons you paid for—for what
little use I made of them—and now a dissertation about vocal music. Altogether, I’d say you
can safely put any worries of my musical impoverishment to rest.
That leaves two very important women left to thank. Erin Soares, thank you for being
patient with me every time I moved the goalposts on you; I am happy to report with some
confidence that my dissertation is finally, truly finished. I owe that to you: you have given me
a lifestyle that makes me feel safe and comfortable enough to write a dissertation—no small
feat. With you I am confident, capable, and loved, and I can’t wait to spend the rest of my life
vii
making you feel the same. And Mairym Llorens Monteserín, thank you for… everything. I
couldn’t have done this without you.
viii
Table of contents
Dedication ii
Acknowledgements iii
List of tables x
List of figures xii
Abstract xvii
Chapter 1: Introduction 1
Chapter 2: Method 21
Chapter 3: Sounds 44
Chapter 4: Theory 124
Chapter 5: Alternations 153
Chapter 6: Harmony 176
Chapter 7: Beatrhyming 242
Chapter 8: Conclusion 286
References 293
Appendix 308
ix
List of tables
Table 1. Notation and descriptions of the most frequent beatboxing sounds. 67
Table 2. The most frequent beatboxing sounds displayed according to constrictor
(top) and airstream (left). 67
Table 3. The most frequent sounds displayed according to constrictor (top) and
constriction degree (left). 68
Table 4. The most frequent sounds displayed according to constrictor (top) and
musical role (left). 68
Table 5. Notation and descriptions of the medium-frequency beatboxing sounds. 77
Table 6. High and medium frequency beatboxing sounds displayed by constrictor
(top) and airstream mechanism (left). 78
Table 7. High and medium frequency sounds displayed by constrictor (top) and
constrictor degree (left). Medium frequency sounds are bolded. 78
Table 8. High and medium frequency beatboxing sounds displayed by constrictor
(top) and musical role (left). Medium frequency sounds are bolded. 79
Table 9. Notation and description of the low-frequency beatboxing sounds. 89
Table 10. High, medium, and low (bolded) frequency sounds displayed by
constrictor (top) and airstream mechanism (left). 90
Table 11. High, medium, and low (bolded) frequency sounds displayed by
constrictor (top) and constriction degree (left). 91
Table 12. High, medium, and low (bolded) frequency sounds displayed by
constrictor (top) and musical role (left). 91
Table 13. Notation and descriptions for the lowest frequency beatboxing sounds. 104
Table 14. All the described beatboxing sounds that could be placed on a table,
arranged by constrictor (top) and airstream mechanism (left). 106
Table 15. All the described beatboxing sounds that could be placed on a table,
arranged by constrictor (top) and constriction degree (left). 107
Table 16. All the described beatboxing sounds that could be placed on a table,
arranged by constrictor (top) and musical role (left). 108
Table 17. 22 beatboxing sounds/sound families, 37 minimal differences. 116
Table 18. 21 sounds with maximal dispersion, 20 minimal differences. 116
Table 19. 23 English consonants, 57 minimal differences ([l] conflated with [r]).
Voiceless on the left, voiced on the right. 117
Table 20. Summary of the minimal sound pair and entropy (place) analyses for
beatboxing, a hypothetical maximally distributed system, and English
consonants. 117
Table 21. Non-exhaustive lists of state-, parameter-, and graph-level properties for
dynamical systems used in speech. 132
x
Table 22. Unforced Kick Drum environments. 174
Table 23. Kick Drum environment type observations. 174
Table 24. Kick Drum token observations. 175
Table 25. Summary of the five beat patterns analyzed. 188
Table 26. The beatboxing sounds used in this chapter. 191
Table 27. Sounds of beatboxing used in beat pattern 5. 192
Table 28. Sounds of beatboxing used in beat pattern 9. 199
Table 29. Sounds of beatboxing used in beat pattern 4. 202
Table 30. Sounds of beatboxing used in beat pattern 10. 207
Table 31. Sounds of beatboxing used in beat pattern 1. 213
Table 32. Non-exhaustive lists of state-, parameter-, and graph-level properties for
dynamical systems used in speech. 231
Table 33. Sounds of beatboxing used in this chapter. 248
Table 34. Contingency table of beatboxing sound constrictors (top) and the speech
sounds they replace (left). 266
xi
List of figures
Figure 1. A simple hierarchical tree structure with alternating strong-weak nodes. 24
Figure 2. Hierarchical strong-weak alternations in which one level (“beats”) is
numbered. 24
Figure 3. Hierarchical strong-weak alternations. 25
Figure 4. Two levels below the beat level have further subdivisions. 26
Figure 5. A metrical grid of the rhythmic structure of the first two lines of an
English limerick. 26
Figure 6. A material grid representation of the metrical structure of Figure 4. 27
Figure 7. A metrical grid representation in which each beat has three subdivisions. 27
Figure 8. A metrical grid in which beats 1 and 3 have four sub-divisions while beats
2 and 4 have three sub-divisions. 28
Figure 9. A metrical grid of the beatboxing sequence {B t PF t B B B PF t}. 28
Figure 10. A drum tab representation of the beat pattern in Figure 9, including a
label definition for each sound. 29
Figure 11. A simplification of a drum tab from Chapter 5: Alternations. 30
Figure 12. Waveform, spectrogram, and text grid of three Kick Drums produced at
relatively long temporal intervals. 33
Figure 13. LAB region, unfilled during a Vocalized Tongue Bass (left) and filled
during the Kick Drum that followed (right). 37
Figure 14. LAB2 region filled during a Liproll (left) and empty after the Liproll is
complete (right). 38
Figure 15. COR region, filled by an alveolar tongue tip closure for a Closed Hi-Hat
{t} (left), filled by a linguolabial closure {tbc} (center), and empty (right). 38
Figure 16. DOR region, filled by a tongue body closure during a Clickroll (left) and
empty when the tongue body is shifted forward for the release of an
Inward Snare (right). 39
Figure 17. FRONT region for Liproll outlined in red, completely filled at the
beginning of the Liproll (left) and empty at the end of the Liproll (right). 39
Figure 18. VEL region demonstrated by a Kick Drum, completely empty while the
velum is lowered for the preceding sound (left) and filled while the Kick
Drum is produced (right). 40
Figure 19. LAR region demonstrated by a Kick Drum (an ejective sound),
completely empty before laryngeal raising (left) and filled at the peak of
laryngeal raising (right). 40
Figure 20. Beatboxing sounds organized by maximal dispersion in a continuous
phonetic space (top) vs organization along a finite number of phonetic
dimensions (bottom). 47
Figure 21. Rank-frequency plot of beatboxing sounds. 55
Figure 22. Histogram of the residuals of the power law fit. 55
Figure 23. Scatter plot of the residuals of the power law fit (gray) against the
expected values (black). 56
xii
Figure 24. Log-log plot of the token frequencies (gray) against the power law fit
(black). 56
Figure 25. The discrete cumulative density function for the token frequencies of the
sounds in this data set (gray) compared to the expected function for
sounds following a power law distribution (black). 57
Figure 26. The discrete cumulative density function (token frequency) of sounds in
this beat pattern (gray, same as Figure 25) against the density function of
the same sounds re-ordered by beat pattern frequency order (black). 58
Figure 27. The forced Kick Drum. 61
Figure 28. The PF Snare. 62
Figure 29. The Inward K Snare. 62
Figure 30. The unforced Kick Drum. 63
Figure 31. The Closed Hi-Hat. 64
Figure 32. The dental closure. 70
Figure 33. The linguolabial closure (dorsal). 70
Figure 34. The linguolabial closure (non-dorsal). 71
Figure 35. The alveolar closure. 71
Figure 36. The alveolar closure (frames 1-2) vs the Water Drop (Air). 71
Figure 37. The Spit Snare. 72
Figure 38. The Throat Kick. 73
Figure 39. The Inward Liproll. 74
Figure 40. The Tongue Bass. 75
Figure 41. Humming. 80
Figure 42. The Vocalized Liproll, Inward. 80
Figure 43. The Closed Tongue Bass. 81
Figure 44. The Liproll. 82
Figure 45. The Water Drop (Tongue). 82
Figure 46. The (Inward) PH Snare. 83
Figure 47. The Inward Clickroll. 84
Figure 48. The Open Hi-Hat. 84
Figure 49. The lateral alveolar closure. 85
Figure 50. The Sonic Laser. 85
Figure 51. The labiodental closure. 86
Figure 52. The Clop. 92
Figure 53. The D Kick. 93
Figure 54. The Inward Bass. 93
Figure 55. The Low Liproll. 94
Figure 56. The Hollow Clop. 94
Figure 57. The Tooth Whistle. 95
Figure 58. The Voiced Liproll. 95
Figure 59. The Water Drop (Air). 96
Figure 60. The Clickroll. 96
Figure 61. The D Kick Roll. 97
Figure 62. The High Liproll. 97
xiii
Figure 63. The Inward Clickroll with Liproll. 98
Figure 64. The Lip Bass. 98
Figure 65. tch. 99
Figure 66. The Liproll with Sweep Technique. 99
Figure 67. The Sega SFX. 100
Figure 68. The Trumpet. 100
Figure 69. The Vocalized Tongue Bass. 101
Figure 70. The High Tongue Bass. 101
Figure 71. The Kick Drum exhale. 102
Figure 72. Histogram of 10,000 random sound pair trials in a 6 x 7 x 2 matrix. 118
Figure 73. Histogram of 10,000 random sound pair trials in a 4 x 7 x 2 matrix. 119
Figure 74. A lip closure time function for a spoken voiceless bilabial stop [p], taken
from real-time MRI data. 133
Figure 75. Schematic example of a spring restoring force point attractor. 134
Figure 76. Schematic example of a critically damped mass-spring system. 135
Figure 77. Schematic example of a critically damped mass-spring system with a
soft spring. 136
Figure 78. Position and velocity time series for labial closures for a beatboxing Kick
Drum {B} (left) and a speech voiceless bilabial stop [p] (right). 144
Figure 79. Parameter values tuned for a specific speech unit are applied to a point
attractor graph, resulting in a gesture. 150
Figure 80. Speech-specific and beatboxing-specific parameters can be applied
separately to the same point attractor graph, resulting in either a speech
action (a gesture) or a beatboxing action. 150
Figure 81. Forced/Classic Kick Drum. Larynx raising, no tongue body closure. 156
Figure 82. Unforced Kick Drum. Tongue body closure, no larynx raising. 158
Figure 83. Spit Snare vs Unforced Kick Drum. 159
Figure 84. Forced Kick Drum beat patterns. 165
Figure 85. Unforced Kick Drum beat patterns. 166
Figure 86. Beat patterns with both forced and unforced Kick Drums. 168
Figure 87. An excerpt from a PointTier with humming. 171
Figure 88. A sequence of a lateral alveolar closure {tll}, unforced Kick Drum {b},
and Spit Snare {SS}. 176
Figure 89. A beat pattern that demonstrates the beatboxing technique of humming
with simultaneous oral sound production. 180
Figure 90. This beat pattern contains five sounds: a labial stop produced with a
tongue body closure labeled {b}, a dental closure {dc}, an lateral closure
{tll}, and lingual egressive labial affricate called a Spit Snare {SS}. All of
the sounds are made with a tongue body closure. 181
Figure 91. Drum tab of beat pattern 5. 193
Figure 92. Regions for beat pattern 5. 194
Figure 93. Time series of vocal tract articulators used in beat pattern 5, captured
using a region of interest technique. 195
Figure 94. Time series and rtMRI snapshots of forced and unforced Kick Drum 196
xiv
Figure 95. Drum tab of beat pattern 9. 200
Figure 96. Time series and gestures of beat pattern 9. 200
Figure 97. Drum tab notation for beat pattern 4. 203
Figure 98. Regions used to make time series for the Liproll beat pattern. 204
Figure 99. Time series of the beat pattern 4 (Liproll showcase). 206
Figure 100. Drum tab for beat pattern 10. 208
Figure 101. The regions used to make the time series for beat pattern 10. 210
Figure 102. Time series of beat pattern 10. 211
Figure 103. Drum tab notation for beat pattern 1. 214
Figure 104. Regions for beat pattern 1 (Clickroll showcase). 216
Figure 105. Time series of beat pattern 1. 217
Figure 106. The DOR region for the Clickroll showcase (beat pattern 1) in the first
{CR dc B ^K}. 218
Figure 107. Each forced Kick Drum in the beat pattern in order of occurrence. 218
Figure 108. Time series and real-time MRI snapshots of forced and unforced Kick
Drums. 219
Figure 109. A schematic coupling graph and gestural score of a Kick Drum and Spit
Snare. 234
Figure 110. A schematic coupling graph and gestural score of a Kick Drum,
humming, and a Spit Snare. 235
Figure 111. A schematic coupling graph and gestural score of a {b CR B ^K}
sequence. 237
Figure 112. Waveform, spectrogram, and text grid of the beatrhymed word
“dopamine”. 248
Figure 113. Bar plot of the expected counts of constrictor matching with no task
interaction. 251
Figure 114. Bar plot of the expected counts of constrictor matching with task
interaction. 251
Figure 115. Bar plots of the expected counts of K Snare constrictor matching with
no task interaction 253
Figure 116. Bar plots of the expected counts of K Snare constrictor matching with
task interaction. 253
Figure 117. Serial and hierarchical representations of a 16-bar phrase (8 lines with 2
measures each). 256
Figure 118. Example of a two-line beat pattern. 263
Figure 119. Bar plot showing measured totals of constrictor matches and
mismatches. 265
Figure 120. Bar plots with counts of the actual matching and mismatching
constrictor replacements everywhere except the back beat. 268
Figure 121. Bar plot with counts of the actual matching and mismatching
constrictor replacements on just the back beat. 269
Figure 122. Four lines of beatrhyming featuring two replacement mismatches
(underlined). 270
xv
Figure 123. Counts of replacements by beatboxing sounds (bottom) against the
manner of articulation of the speech sound they replace (left). 272
Figure 124. Counts of replacements by beatboxing sounds (bottom) against the
speech sound they replace (left). 272
Figure 125. Four 16-bar beatboxing (sections B and D) and beatrhyming (sections
C and E) phrases with letter labels for each unique sound sequence. 275
Figure 126. Beat pattern display and repetition ratio calculations for sections B, C,
D, and E. 276
Figure 127. Tableau in which a speech labial stop is replaced by a K Snare on the
back beat. 283
Figure 128. Tableau in which a speech labial stop is replaced by a Kick Drum off
the back beat. 283
Figure 129. Waveform, spectrogram, and text grid of the beatrhymed word “move”
with a Kick Drum splitting the vowel into two parts. 287
Figure 130. Waveform, spectrogram, and text grid of the beatrhymed word “sky”
with a K Snare splitting the vowel into two parts. 288
Figure 131. The anthropophonic perspective. 296
xvi
Abstract
Beatboxing is a type of non-linguistic vocal percussion that can be performed as an
accompaniment to linguistic music or as a standalone performance. This dissertation is the
first major effort to begin to probe beatboxing cognition—specifically beatboxing
phonology—and to develop a theoretical framework relating representations in speech and
beatboxing that can account for phonological phenomena that speech and beatboxing share.
In doing so, it contributes to the longstanding debate about the domain-specificity of
language: because hallmarks of linguistic phonology like contrastive units (Chapter 3),
alternations (Chapter 5), and harmony (Chapter 6) also exist in beatboxing, beatboxing
phonology provides evidence that beatboxing and speech share not only the vocal tract but
also organizational foundations, including a certain type of mental representations and
coordination of those representations.
Beatboxing has phonological behavior based in its own phonological units and
organization. One could choose to model beatboxing with adaptations of either features or
gestures as its fundamental units. But as Chapter 4: Theory discusses, a gestural approach
captures both domain-specific aspects of phonology (learned targets and parameter settings
for a given constriction) and domain-general aspects (the ability of gestural representations
to contrast, to participate in class-based behavior, and to undergo qualitative changes).
Gestures have domain-specific meaning within their own system (speech or beatboxing)
while sharing a domain-general conformation with other behaviors. Gestures can do this by
explicitly connecting the tasks specific to speech or to beatboxing with the sound-making
potential of the vocal substrate they share; this in turn creates a direct link between speech
xvii
gestures and beatboxing gestures. This link is formalized at the graph level of the dynamical
systems by which gestures are defined.
The direct formal link between beatboxing and speech units makes predictions about
what types of phonological phenomena beatboxing and speech units are able to
exhibit—including phonological alternations and harmony mentioned above. It also predicts
that the phonological units of the two domains will be able to co-occur, with beatboxing and
speech sounds interwoven together by a single individual. This type of behavior is known as
“beatrhyming” (Chapter 7: Beatrhyming).
These advantages of the gestural approach for describing speech, beatboxing, and
beatrhyming underscore a broader point: that regardless of whether phonology is modular or
not, the phonological system is not encapsulated away from other cognitive domains, nor
impermeable to connections with other domains. On the contrary, phonological units are
intrinsically related to beatboxing units—and, presumably, to other units in similar
systems—via the conformation of their mental representations. As beatrhyming helps to
illustrate, the properties that the phonological system shares with other domains are also the
foundation of the phonological system’s ability to flexibly integrate with other (e.g., musical)
domains.
xviii
CHAPTER 1: INTRODUCTION
Beatboxing is a type of non-linguistic vocal percussion that can be performed as an
accompaniment to linguistic music (e.g. rapping or a cappella singing) or as a standalone
performance—the latter being primarily the focus here. Beatboxers are increasingly
recognized in both scientific and popular literature as artists who push the limits of the vocal
tract with unspeechlike vocal articulations that have only recently been captured with
modern imaging technology. Scientific study of beatboxing is valuable on its own merits,
especially for beatboxers hoping to teach and learn beatboxing more effectively. But much of
beatboxing science also serves as a type of speech and linguistic science, aimed at
incorporating beatboxing into innovative speech therapy techniques or understanding the
nature of speech.
This dissertation contributes to both beatboxing science and speech science. As a
piece of beatboxing science, the contribution is the first major effort (that I know of) to
begin to probe beatboxing cognition—specifically beatboxing phonology, including the
discovery of phonological alternations, phonological harmony, and the development of a
theoretical framework relating representations in speech and beatboxing that can account for
these findings. As a type of linguistic science, the dissertation contributes to the longstanding
debate about the domain-specificity of language: because hallmarks of linguistic phonology
like alternations and harmony also exist in beatboxing, beatboxing phonology provides
further evidence that phonology is rooted in domain-general cognition (rather than existing
as, say, a unique and innate component of a modular language faculty).
1
Section 1 introduces the art of beatboxing and briefly summarizes the current state of
beatboxing science, particularly with an eye to beatboxing cognition. Section 2 establishes
the context for how research on a distinctly non-linguistic behavior like beatboxing can be
considered relevant to linguistic inquiry.
1. The art and science of beatboxing
1.1 Beatboxing art
The foundation of beatboxing lies in hip hop. The “old school” of beatboxing began as
human mimicry of the sounds of a beat box, a machine that synthesizes percussion sounds
and other sound effects. The beat box created music that an MC could rap over; when a beat
box wasn’t available, a human could perform the role of a beat box by emulating it vocally.
The two videos below demonstrate how beatboxing was used by early artists like Doug E
Fresh and Buffy to give other MCs a beat to rap over.
Doug E. Fresh and Slick Rick, “La Di Da Di” (1985)
https://www.youtube.com/watch?v=FYHm-B0tnCs -
(The beat pattern starts in earnest around 0:48. Before that, you can hear Doug E. Fresh
using his signature Clickrolls—a lateral lingual ingressive trill..)
Fat Boys, “Human Beat Box” (1984)
https://www.youtube.com/watch?v=jJewbFZHI34
2
(Buffy was well-known for his “bass-heavy breathing technique” (source) that you can hear
from 0:10-0:15.)
The last four decades have given beatboxers plenty of time to innovate in both artistic
composition and beatbox battles that demonstrate mechanical skill. Modern beatboxing
performances often stand alone: if there are any words, they are only occasional and woven
by the beatboxer into the beat pattern rather than said by a second person. (There are art
forms like beatrhyming where singing/rapping and beatboxing are fully integrated, but this is
a different vocal behavior; see Chapter 7: Beatrhyming. Combining words or other vocal
behaviors into beatboxing is sometimes called multi-vocalism.) The next two videos show
that beat patterns in the “new school” of beatboxing may be faster, reflecting contemporary
popular music styles..
Reeps One “Metal Jaw” (2013)
https://www.youtube.com/watch?v=uRnqu2YE95A
Mina Wening (2017)
https://www.youtube.com/watch?v=eCIbkozHj4A
Beatboxing evolves through innovation of new sounds or sound variations, patterns (e.g.,
combinations of sounds or styles of breathing), and integration with other behaviors (e.g.,
beatboxing flute, beatboxing cello, beatrhyming, beatboxing with other beatboxers). For
3
novice beatboxers, the goal is to learn how to sound as good as experts; for expert
beatboxers, the goal is to create art through innovation while keeping up with trends. This
innovation is constrained by both physical and cultural forces. The major physical constraint
is the vocal tract itself which limits the speed and quality (i.e., constriction degree and
location) of possible movements; new beatboxing sounds and patterns are thought to arise
from testing these physical limitations. As for cultural forces, both the musical genres that
inspire beatboxing and the preferences of beatboxers themselves have a role. Three examples
follow.
First, beatboxing started without words, and today most beatboxers still rarely speak
during a beatboxing performance. Though it is not uncommon to hear a word or short
phrase during a beat pattern, usually with non-modal phonation, the fact that beatrhyming
has its own name to distinguish it from beatboxing implies that it is not the same art form.
Second, since the initial role of beatboxing was to provide a clear beat by emulating drum
sounds, non-continuant stops and affricates became very common while continuants like
vowels are almost never used. When drawing on inspiration from other musical sources,
related genres like electronic dance music would have been appealing for their percussive
similarities. Contemporary beat patterns keep the percussive backbone, though some
sustained sounds (i.e., modal or non-modal phonation for pitch) can be used concurrently as
well. And third: more broadly, beatboxing shares musical properties with a broad range of
(Western) genres, resulting in common patterns. One common property is 4/4 time, which
signifies that the smallest musical phrases each contain four main events (which can be
thought of as being grouped into pairs of two). Another common property is the placement
4
of emphasis on the “back beat” (beat 3 in 4/4 time) via snare sounds (Greenwald, 2002).
These types of properties, together with the vocal modality, shape the musical style and
evolution of beatboxing. Innovation in beatboxing is done within these constraints.
Common advice in beatboxing pedagogy is to learn incrementally. New aspiring
beatboxers are encouraged to start by drilling the fundamentals: basic sounds like Kick
Drums {B} [p’], Closed Hi-Hats {t} [t’], and PF Snares {PF} [pf’] should become familiar
first in isolation, then in combos and beat patterns to practice them in a rhythmic context.
(Curly bracket notation indicates a beatboxing sound, while square bracket notation
indicates International Phonetic Alphabet notation.) Once the relatively small set of sounds
is secure, it is time to learn new sounds that facilitate breath management—this is important
for performing progressively more complex and intensive beat patterns that demand more
air regulation. At the same time, new beatboxers also need to focus on “technicality”, a jargon
word in the beatboxing community that refers to how accurately and precisely a sound is
performed. Reference to and imitation of other beatboxers is common for establishing ideals
and task targets. All of these basics are the foundations from which a beatboxer can start to
innovate by making novel sounds and beat patterns; and, beatboxers continue to revisit these
different facets of their art to make improvements at multiple time scales (i.e., improving one
sounds, improving a combination of sounds, developing a flow or a new style). As a
consequence of all this, beatboxers are often aware of or focusing on some facet of their
beatboxing as they perform in a way that fluent speakers of a language may not be aware of
their own performance; moreover, beatboxers at different stages in the learning process (or
5
even at the same stage) may beatbox very differently depending on the sounds they know
and which facet of beatboxing they are practicing.
All of these details are important for later chapters. The fact that beatboxers are
aiming for particular sound qualities and flow patterns means that we should expect to find
beatboxing patterns that balance aesthetics and motor efficiency (Chapter 6: Harmony). The
lack of words in beatboxing, the interest in imitating instruments/sound effects, and the
drive to innovate through the use of new sounds are all hints that beatboxing phonology is
not a variation of speech phonology but a sound organization system in its own right. The
metrical patterns of sounds (e.g., Snares on 3) frames observations about beatboxing sound
alternations (Chapter 5: Alternations) and the relationship between speech and beatboxing
sounds in beatrhyming (Chapter 7: Beatrhyming). And the fact that beatboxers are actively
focusing on different things and cultivating different styles goes a long way to explaining
qualitative variation among beatboxers, including differences in their sound inventories and
the productions of individual sounds (Chapter 3: Sounds).
1.2 Beatboxing science
A guiding theme in beatboxing science is the study of vocal agility and capability
(Dehais-Underdown, 2021). The complex unspeechlike sounds and patterns of beatboxing
inform our understanding of what kinds of vocal sound-producing movements and patterns
can be performed efficiently—and sometimes surprise us when we see articulations that we
didn’t think were possible. This in turn offers a better general phonetic framework for
6
studying the relationship between linguistic tasks, cognitive limitations, physical limitations,
and motor constraints in the evolution of speech.
Likewise, knowing more about the physical abilities of the vocal tract also informs
our understanding of disordered or otherwise non-normative speech production strategies.
Some researchers advocate for using beatboxing for speech therapy (Pillot-Loiseau et al.,
2021). The BeaTalk strategy has been used to improve speech in adults (Icht, 2018, 2021; Icht
& Carl, 2022); and beatboxers Martin & Mullady (n.d.) use beatboxing in their work with
children. (See also Himonides et al., 2018; Moors et al., 2020.) Although beatboxing
interventions for therapeutic purposes are still quite new, the tantalizingly obvious
connection between beatboxing and speech as vocal behaviors has been generating interest
within the beatboxing and academic communities.
Crucial to both these branches of inquiry but almost completely undeveloped within
the field is a theory of beatboxing cognition. The literature offers just three claims about
beatboxing cognition so far, none of which are firmly established: one about the intentions of
beatboxers, and two about the fundamental units of beatboxing. There is a general consensus
that, based on the origins of beatboxing as a tool for supporting hip hop emcees, a
beatboxer’s primary intention is to imitate the sounds of a drum kit, electronic beat box, and
a variety of other sound effects (Lederer, 2005; Stowell & Plumbley, 2008; Pillot-Loiseau et
al., 2020). But treating beatboxing as simple imitation is reductive and disingenuous to the
primacy of the art form (Woods, 2012). Even in the earliest days, old school beatboxers
established distinctive vocal identities that were surely not just attempts to mimic different
electronic beat boxes. The new school of beatboxing has come a long way since then and
7
shows rapidly evolving preferences in artistic expression that a drive for pure imitation seems
unlikely to motivate.
As for the cognitive representations of the sounds themselves, Evain et al (2019) and
Paroni et al. (2021) posit the notion of a “boxeme” by analogy to the phoneme—an
acoustically and articulatorily distinct building block of a beatboxing sequence. While they
imply that boxemes are meant to be a hypothesis of cognitive units, they do not address
other questions begged by the phoneme analogy (Dehais-Underdown, 2021). Are boxemes
the smallest compositional units or are they composed of even smaller elements, as
phonemes are thought to be composed of features? Does beatboxing exhibit phonological
patterns that require a theory with some degree of abstraction? And are boxemes symbolic
units, action units, or something else? Separately, Guinn & Nazarov (2018) argue for the
active role of phonological features in beatboxing based on evidence from variations in beat
patterns and phonotactic place restrictions (an absence of beatboxing coronals in prominent
metrical positions). They do not link features back to larger (i.e., segment-sized) unit; while
they offer the possibility that speech and beatboxing features are linked (perhaps in the same
way that the features of a language learned later in life are linked to the features of a
language spoken from birth), it remains unclear whether or how speech representations and
beatboxing representations (whatever they are) should be considered cognitively linked.
The lack of work on beatboxing cognition is understandable: the field of beatboxing
science is still in its infancy with less than 20 years of research, and the few scientists
involved in the field have had their hands full with other more tractable questions. But it will
be difficult to use beatboxing to inform an account of the physical and cognitive factors that
8
shape speech without both physical and cognitive accounts of beatboxing. And while the
viability of beatboxing as a tool for speech therapy is ultimately an empirical question, a
theory of beatboxing cognition that is explicit about whether and how speech and
beatboxing sounds are cognitively related should help decide what interventions are more or
less likely to work.
This dissertation’s major contribution to beatboxing science is the initiation of a
systematic inquiry into beatboxing cognition—specifically, a hypothesis about the
fundamental units of beatboxing phonology. Chapter 3: Sounds describes beatboxing sounds
and the articulatory properties along which they are organized. Chapter 4: Theory lays out
the hypothesis that those articulatory properties can be formalized as the fundamental
cognitive units of beatboxing, akin to the fundamental linguistic gestures of Articulatory
Phonology (Browman & Goldstein, 1986, 1989). Rooting beatboxing cognition in gesture-like
units offers two benefits: the same types of empirically-testable predictions as Articulatory
Phonology, and a theoretical link between the cognitive units of speech and beatboxing. Both
benefits are advantageous for developing theories of speech informed by beatboxing and for
developing therapeutic beatboxing interventions. Chapter 5: Alternations and Chapter 6:
Harmony support this hypothesis with an example of beatboxing phonology—phonological
harmony complete with triggers, undergoes, and blockers—and offer an account based on
gestures. Finally, Chapter 7: Beatrhyming goes a step further to provide evidence for a direct
link between the cognitive units of speech and beatboxing via the art of simultaneous
production of beatboxing and singing known as beatrhyming.
9
2. Beatboxing as a lens for linguistic inquiry
With respect to linguistic inquiry, the longstanding debate addressed here is one of
domain-specificity: Does the human capacity for language consist only of a specialized
composite of other cognitive systems, or is there some component that is unique to language
and cannot be attributed to specialization of other cognitive systems (Anderson, 1981)? The
question has been central in the development of major linguistic paradigms over the last
several decades, including the Minimalist program that views the human language faculty as
only minimally domain-specific (the language faculty in the narrow sense) and otherwise
composed of a unique assembly of other cognitive functions (e.g., Hauser et al., 2002; Collins,
2017 provides an overview).
One of the strongest theories of domain-specificity in cognition comes from Fodor
(1983) who offers a modular approach in which a cognitive domain constitutes its own
system. In the original conception, modules are low-level (mostly sensory input) systems
which are likely to be encapsulated, automatic, innate, and which perform computations
exclusively over inputs relevant to their domain—hence, domain-specific. Modules are
distinct from the non-specific handling of general cognitive processing. Liberman &
Mattingly’s (1985) Motor Theory couched speech perception as a linguistic module built
around the relationship between intended phonetic gestures and their acoustic output. The
Motor Theory proposes that speech perception is a parallel system to general auditory
processing, a claim supported by duplex perception tasks (Liberman et al., 1981; Mann &
Liberman, 1983). Modularity has been conceived of many different ways by now, and
whether or not a system like language shows all of the typical traits (e.g., encapsulation,
10
innateness) are open to empirical testing, but domain-specificity remains key to the modular
theory (Coltheart, 1999). Even when phonology is not considered a module in the strictest
sense, it is still common to make reference to the modular “interface” between phonetics and
phonology which implies that the linguistic system of sounds is distinct from the physical
implementation of sounds (cf Ohala, 1990).
One of the key arguments in favor of domain-specificity is tied up with innateness:
there are substantial barriers for the infant attempting to learn language, including lack of
segmentability and lack of invariance in the acoustic signal of the ambient language(s); given
how quickly and effectively newborns learn speech production and perception, it stands to
reason that humans may be born with a language faculty that provides a universal starting
point for the acquisition process. This language faculty is domain-specific insofar as the
innate cognitive scaffolding is tailored to address linguistic issues. Werker & Tees (1984) and
related work demonstrated that infants are born with the ability to distinguish speech
sounds across the same categorical boundaries that adults use.
Others argue in favor of accounting for speech patterns using only
language-independent, domain-general information, without relying on an innate,
species-specific language capacity (Universal Grammar) (e.g., Lindblom, 1983; Archangeli &
Pulleyblank, 2015, 2022). This approach has foregrounded major questions in phonology
over the last few decades, all shaped around developing an understanding of how phonetics
shapes phonology. Quantal Theory (Stevens, 1989; Stevens & Keyser, 2010) derives common
phonological categories from quantal regions in the vocal tract where coarticulation is less
likely to interfere with perception. The Theory of Vowel Dispersion (Liljencrants &
11
Lindblom, 1972; Lindblom et al., 1979) generates typologically common vowel patterns using
the principle of maximal contrast but without presupposing any particular phonological
categories. Likewise, proponents of the Auditory Enhancement Hypothesis (Diehl &
Kluender, 1989; Diehl et al., 1991) argue that the common covariation of certain phonological
features is explained by their mutual compatibility in enhancing perceptual contrasts. The
frame/content theory (MacNeilage, 1998) posits that the origins of speech come not from a
spontaneous mutation but rather evolved from homeostatic motor functions; in this case,
phonological syllable structure (the frame) descended from the chewing action.
The question of domain-specificity is an undercurrent of much research in cognitive
science and evolutionary psychology and often involves comparing speech and language to
other types of human or non-human cognition (Hauser et al., 2002). Categorical perception
has been found in chinchillas (Kuhl & Miller, 1978) and crickets (Wyttenbach et al., 1996), as
well as for human perception of non-speech sounds (Fowler & Rosenblum, 1990) and faces
(e.g., Beale & Keil, 1995). Language and music share certain rhythmic (see Ravignani et al.,
2017 for a recent discussion), syntactic, (Lerdahl & Jackendoff, 1983) and neurological
qualities (Maess et al., 2001), with other apparently cross-domain ties (Feld & Fox, 1994;
Bidelman et al., 2011). Comparison of neurotypical speech and disordered speech contributes
to a neurological aspect of the discussion such as whether the motor planning in speech uses
specialized or domain-general circuitry (Ballard et al., 2003; Ziegler, 2003a, 2003b).
Despite the evidence suggesting that language and phonology may not have a
domain-specific component, domain-specific generative models remain the norm in much of
phonological theory. This dissertation’s contribution to the domain-specificity conversation
12
is to argue that domain-general models of phonology have more predictive power than
domain-specific models for modeling phonological behavior that exists both in and outside
of speech.
Models of a theory help scientists describe and explain natural phenomena, and in
doing so also predict what related phenomena we should expect to find. Domain-specific
models are meant to describe and predict only phenomena within their own domain: in a
domain-specific computational phonological model, for example, the inputs and outputs are
exclusively linguistic and the grammar operates only over those linguistic elements. If the
same model were used to try to account for the inputs and outputs of a different cognitive
domain, then by definition the model would either fail or be subject to alterations that make
it no longer domain-specific.
1
And when the model predicts phenomena that are not
observed within its domain, the model is said to be imperfect because it overgenerates. As a
consequence, domain-specific models are unable to describe, explain, or predict phenomena
outside of their domain.
The domain-specificity of computational phonological models was entrenched at
least as early as the divorcing of phonetics from phonology (de Saussure, 1916; Baudouin de
Courtenay, 1972) which led to interest in only those aspects of phonology which are
essentially linguistic (Sapir, 1925; Hockett, 1955; Ladefoged, 1989). In programs descended
from this tradition, the features and grammar of phonological theory are domain-specific
because they deal exclusively with phonological inputs, outputs, and processes. The inputs
1
If a domain-specific model needs to be used to account for the phenomena in a different domain,
domain-specificity can be preserved by copying the model’s form and adapting its units/computations to the
new context. This would result in two non-overlapping domain-specific models. This might happen in a case of
cognitive parasitism; see below for more discussion on this point.
13
and outputs are typically expressed as phonological features—atomic representations of
linguistic information defined by their relationship with each other, whose purpose it is to
encode meaningful contrast, and which are the basis of phonological change (Dresher, 2011;
Mielke, 2011). Phonological features are meant to be representations of linguistic meaning
and organization—they are crucially not meant to be representations of any other domain.
Depending on the strictness of a model’s commitment to domain-specificity,
sometimes explanation in phonology may come from outside language. Widespread interest
in the relationship between phonetics and phonology was renewed with the advent of
acoustically-grounded distinctive features (Jakobson et al., 1951) and the mapping of gradient
phonetic features to scalar phonetic (phonological) features in SPE (Chomsky & Halle, 1968;
see Keating, 1996 for the dual role of phonetics in SPE). Phonological grammars commonly
use phonetic grounding to constrain their outputs (Prince & Smolensky, 1993/2004; Hayes,
Kirchner, & Steriade, 2004). On the other hand, other programs based on strict
domain-specific modularity argue that phonetics should have no role in the makeup of the
grammar (e.g., Hale & Reiss, 2000). But in neither case is phonology expected to explain
anything about phonetics—except perhaps at the phonetics-phonology interface where
outputs from the phonological system are transduced into the inputs of the phonetic system
(Keating, 1996; Cohn, 2007). Even then, the interface is not intended to account for any
phonetic phenomenon that is not clearly the result of a linguistic intent, nor is it capable of
doing so without becoming a domain-general model. Regardless of whether there is an overt
commitment to an innate Universal Grammar, the resulting phonological systems are
domain-specific by design.
14
A domain-specific model can of course be of great practical benefit in the interest of
developing a scientific account of phonology. But the issue of the domain-specificity of
language is a hypothesis (not a fact) about the relationship between language and the rest of
human cognition. If we were to discover that phonological phenomena typically described
with a domain-specific approach are also present in another nonlinguistic behavior, then a
single model that encompasses both domains may be preferable to two domain-specific
models that provide separate accounts of their shared phenomena. For this dissertation, the
search for nonlinguistic phonological behavior takes place in the domain of beatboxing.
Beatboxing is particularly useful in the search for the nonlinguistic presentation of
phonology because beatboxing and speech have many qualitative articulatory properties in
common. For both beatboxing and speech, sound is produced when the vocal tract
articulators make constrictions that manipulate air pressure. As discussed in Chapter 3:
Sounds, many of these articulations have similar constriction locations and degrees to
articulations in speech. Most beatboxing sounds require coordination among multiple
articulators. Like speech sounds, beatboxing sounds have a domain-specific classification
system, in this case based on their musical function (e.g., “snare”, “bass”, “kick”) and their
articulation (see Chapter 3: Sounds). The sounds of beatboxing can be combined and
recombined into an unlimited number of different beat patterns—hierarchically structured
phrases of beatboxing sounds produced sequentially—but with certain phonotactic
restrictions as discussed earlier (e.g., “beat 3 must have a snare sound”). And, some common
beatboxing sounds resemble speech sounds enough that they can replace speech sounds in
an utterance (Chapter 7: Beatrhyming). Given the articulatory and organizational similarities
15
between beatboxing and speech, beatboxing is an ideal nonlinguistic behavior against which
to compare speech in the search for phenomena that are unique to phonology (if any).
Assuming for the moment that beatboxing does exhibit phonology-like patterns (a
claim which this dissertation attempts to support), the different approaches to
domain-specificity and domain-generality in phonology described above offer two
explanations for how beatboxing ended up looking phonological. One way starts with
domain-generality as a baseline assumption: phonology and beatboxing are grounded in the
same cognitive capacities, so whatever their shared capacities provide as a publicly available
resource (e.g., phonological harmony) will automatically be available to both phonology and
beatboxing—though not every language or beatboxer will necessarily use it.
On the other hand, phonology could be a domain-specific system from which
beatboxing copies cognitive properties. In this view, beatboxing is parasitic on phonology.
Evidence from this dissertation shows that the strongest sense of parasitism, where
beatboxing copies the actual phonological representations and grammar from phonology,
cannot be true: though there are similarities in the composition of sounds and phonological
behavior, the beatboxing sound system uses cognitive representations that are not used as
phonological units (neither in the beatboxer’s language nor in any universal feature system).
The beatboxing system must be more innovative than strict parasitism allows for.
The weaker hypothesis of parasitism is that beatboxing might take certain qualities of
phonological units and grammar—like the combinatorial nature of the representations and
the framework of a computational grammar (e.g., Optimality Theory)—and re-use them to
create beatboxing representations and beatboxing grammar. Beatboxing would not be
16
constrained to be essentially identical to speech as in the strong parasitic hypothesis, but its
beatboxing-phonological phenomena would be constrained by the limitations of the
representations and grammar whose form it borrowed. Those aspects which beatboxing
borrowed would then technically be domain-general, at least for those two domains, even if
they did not start that way. The weaker parasitic hypothesis is more plausible than the strong
one. Neophyte beatboxers commonly learn beatboxing sound patterns from adaptations of
speech phrases (e.g., “boots and cats” → {B t ^K t}; see Chapter 3: Sounds for a description of
the symbols). Using the physical vocal apparatus to perform similar maneuvers (Chapter 3:
Sounds, Chapter 4: Theory) could in some sense “unlock” access to phonological potential.
(Hauser, Chomsky, & Fitch [2002] suggest that recursion may have similarly been adopted
into speech from domain-specific use in another cognitive domain like navigation.)
This dissertation makes no attempt to provide evidence that distinguishes between
the domain-general hypothesis and the weaker parasitic hypothesis. The difference doesn’t
matter because both approaches arrive at the same (almost paradoxical) conclusion: that
beatboxing and speech share many properties and yet are qualitatively completely different
behaviors governed by non-overlapping intentions and tasks. Instead, this dissertation
focuses on developing a single-model approach that encompasses both domains and predicts
their shared behavior (as opposed to creating two purely domain-specific models). The
starting point for this model is Articulatory Phonology.
Articulatory Phonology (1986, 1989) is the hypothesis that the fundamental cognitive
units of phonology are not symbolic features, but actions called “gestures”. Gestures have
been argued to be advantageous for phonological theory because they unite the discrete,
17
context-invariant properties usually attributed to phonological units with the dynamic,
continuous, context-dependent properties observed in speech. These two sides of gestures
are encoded together in the language of dynamical systems: the system parameters are
invariant during the execution of a speech action, but the state of the system changes
continuously (Fowler, 1980). Chapter 4: Theory argues that dynamical systems also
simultaneously contain domain-specific and domain-general properties. This is because, as
actions, gestures are not unique to speech but they are specialized for speech: by design, the
dynamical equations in the task dynamic framework of motor control can characterize any
goal-oriented action from any domain (Saltzman & Munhall, 1989). This means that gestures
are on the one hand domain-general because the dynamical system that defines them can
serve as the basis for any goal-oriented action, but on the other hand domain-specific
because a given gesture is specialized for a speech-specific (and language-specific) goal
(Browman & Goldstein, 1991:314-315):
“Second, we should note that the use of dynamical equations is not restricted
to the description of motor behavior in speech but has been used to describe
the coordination and control of skilled motor actions in general (Cooke, 1980;
Kelso, Holt, Rubin, & Kugler, 1981; Kelso & Tuller, 1984a, 1984b; Kugler, Kelso,
& Turvey, 1980). Indeed, in its preliminary version the task dynamic model we
are using for speech was exactly the model used for controlling arm
movements, with the articulators of the vocal tract simply substituted for those
of the arm. Thus, in this respect the model is not consistent with Liberman
and Mattingly’s (1985) concept of language or speech as a separate module,
with principles unrelated to other domains. However, in another respect, the
central role of the task in task dynamics captures the same insight as the
“domain-specificity” aspect of the Modularity hypothesis—the way in which
vocal tract articulators is yoked is crucially affected by the task to be achieved
(Abbs, Gracco, & Cole, 1984; Kelso, Tuller, Vatikiotis-Bateson, & Fowler,
1984).”
18
For an approach to phonological theory that can also describe non-linguistic behaviors,
dynamical action units should be preferred over features (or other purely domain-specific
phonological units) because they have domain-general roots but can be specialized for any
domain. When specialized for speech, these action units are gestures; when specialized for
another domain, they are the gesture-like building block of that domain instead.
Beyond their descriptive power, however, gestures can also make predictions about
the organization of sounds in other domains whereas features cannot. Assuming that
beatboxing has gesture-like fundamental units of cognition, any behavior of gestures
determined by their domain-general side is predicted to be relevant to beatboxing as well
(Chapter 4: Theory). Chapter 6: Harmony demonstrates this in the phenomenon of
beatboxing harmony: beatboxing harmony has signature traits of speech harmony including
trigger, undergoer, and blocker sounds, the behavior of all of which is predicted by gestural
approaches to harmony. The gestural model also predicts the possibility of multi-tasking by
using speech and beatboxing gestures simultaneously. Chapter 7: Beatrhyming shows not
only that beatboxing and speech can be produced simultaneously, but also that their
fundamental cognitive units are cognitively related with each other through their tasks of
making constrictions in the vocal apparatus they share.
In contrast, domain-specific phonological models make no predictions about whether
beatboxing harmony could exist or what traits it might have because the features and
grammar are designed only to target linguistic information. Generative linguistic grammars
also cannot generate beatrhyming because they cannot deal with non-linguistic sounds. Of
course, there are ways around these limitations—new models can be constructed that use
19
beatboxing features and beatboxing grammars to generate beatboxing harmony, and
speech-beatboxing cognitive interfaces can be postulated that do computations over the joint
domain of speech and beatboxing sounds. But ultimately all these strategies require making
multiple separate models to account for phenomena that speech and beatboxing share;
compared to a gestural approach that accounts for both speech and beatboxing without any
additional theoretical overhead, the domain-specific starting point is inferior.
20
CHAPTER 2: METHOD
1. Participants and data acquisition
Two novice beatboxers, one intermediate beatboxer, and two expert beatboxers were asked
to produce beatboxing sounds in isolation and in musical rhythms (“beat patterns”), and to
speak several passages while lying supine in the bore of a 1.5 T MRI magnet. Skill level
designations were given by the intermediate beatboxer who had also contacted the
beatboxers, was present for the collection of their data, and provided a beatboxer’s insight at
several points in the earlier stages of analysis. Of those five beatboxers, the productions of
just one expert are reported in the present study. The two novices and the intermediate
beatboxer are not discussed because the aim of this dissertation is to characterize expert
beatboxing, not beatboxing acquisition. (See Patil et al., 2017 for a brief study of the basic
sounds of all five beatboxers.) Data from the second expert beatboxer are not reported
because the beatboxer exhibited large head movements during image acquisition, making
kinematic analysis using the methods described below impossible. The beatboxer studied
here reported being a monolingual speaker of English.
Each beatboxer was asked in advance to provide a list of sounds they know written
with orthographic notation they would recognize. During the scanning session, each sound
label they had written was presented back to them as a visual stimulus. For each sound,
beatboxers were asked to produce the sound three times slowly and three times quickly, and
then to produce the sound in a beat pattern (sometimes referred to hereafter as a “showcase”
beat pattern). The beatboxers were also invited to perform beat patterns of their choosing
21
that were not meant to showcase any particular sound. For the analyzed expert beatboxer,
there were over 50 different showcase or freestyle beat patterns. The beatboxers were paid
for participation in the experiment.
Data were collected using an rtMRI protocol developed for the dynamic study of
vocal tract movements, especially during speech production (Narayanan et al., 2004; Lingala
et al., 2017). The subjects’ upper airways were imaged in the midsagittal plane using a
gradient echo pulse sequence (TR = 6.004 ms) on a conventional GE Signa 1.5 T scanner
(Gmax = 40 mT/m; Smax = 150 mT/m/ms), using an 8- channel upper-airway custom coil.
The slice thickness for the scan was 6 mm, located midsagittally over a 200 mm × 200 mm
field-of-view; image size in the sagittal plane was 84 × 84 pixels, resulting in a spatial
resolution of 2.4 × 2.4 mm. The scan plane was manually aligned with the midsagittal plane
of the subject’s head. The frames were retrospectively reconstructed to a temporal resolution
of 12ms (2 spirals per frame, 83 frames per second) using a temporal finite difference
constrained reconstruction algorithm (Lingala et al., 2017) and an open-source library
(BART). Audio was recorded at a sampling frequency of 20 kHz inside the MRI scanner
while the subjects were imaged, using a custom fiber-optic microphone system. The audio
recordings were noise-canceled, then reintegrated with the reconstructed MR-imaged video
(Bresch et al., 2008). The result allows for dynamic visualization and synchronous audio of
the performers’ vocal tracts.
22
2. Annotation methods
Beat patterns from the real-time MR videos were annotated using a concise plaintext
percussion notation called “drum tabs” and point tier TextGrids in Praat (Boersma &
Weenink, 1992-2022). Beat patterns are performed with a rhythmic structure related to a
musical meter, so each annotation included labels for the beat pattern sounds and the
metrical position of that sound. This section explains how each annotation style was created,
but first begins with an introduction to musical meter.
2.1 Musical meter
Just as a sequence of syllables in languages with alternating stress can be grouped
hierarchically into prosodic feet, words, and phrases, so too is musical meter composed of
strong-weak alternations hierarchically grouped into measures and phrases. But music and
beatboxing are performed isochronously, meaning that there is roughly consistent temporal
spacing between events at the same level of the hierarchy.
The rhythmic structure of the beatboxing under consideration here can be
represented as a binary tree structure resulting in strength alternations (Lerdahl &
Jackendoff, 1983; Palmer & Kelly, 1992; Figure 1). Each branch has two end nodes: a Strong
node (S) on the left, and a Weak node (W) on the right. And, each node can be the parent of
another Strong-Weak pair.
23
Figure 1. A simple hierarchical tree structure with alternating strong-weak nodes.
S
/ \
/ \
/ \
S W
/ \ / \
S W S W
Strong and Weak events at a certain level are sometimes called “beats” and are often marked
with the numbers 1, 2, 3, and 4; the process of finding these beats, say in order to move to
them in dance, is sometimes called beat-induction (Large, 2000). Musical phrases often last
for more than four beats, but it is customary to reset the count back to 1 instead of
continuing on to 5 (Figure 2). When counting music at this level, a musician is likely to say
“one, two, three, four, one, two, three, four, one…”. Each beat 1 is the beginning of a musical
chunk called a “measure.” Since counting the beat resets to 1 after every 4, musicians reading
musical notation might refer to a specific beat in the meter by both measure number and
beat number, as in “measure 2, beat 3.”
Figure 2. Hierarchical strong-weak alternations in which one level (“beats”) is numbered.
/ \ / \
/ \ / \
S W S W
/ \ / \ / \ / \
/ \ / \ / \ / \
/ \ / \ / \ / \
S W S W S W S W
1 2 3 4 1 2 3 4
Each beat can be further divided into sub-beats in which the Strong node retains the
numerical label of its parent and the Weak node is called “and” (here abbreviated to “+”)
24
(Figure 3). When speaking the meter aloud at this level, a musician would say “one and two
and three and four and one and two and three and four and one and…”.
Figure 3. Hierarchical strong-weak alternations. The beat level is numbered as in Figure 2.
The child nodes of that level inherit the same numbering on the strong nodes and a + on the
weak nodes.
/ \ / \
/ \ / \
S W S W
/ \ / \ / \ / \
/ \ / \ / \ / \
/ \ / \ / \ / \
S W S W S W S W
1 2 3 4 1 2 3 4
/ \ / \ / \ / \ / \ / \ / \ / \
S W S W S W S W S W S W S W S W
1 + 2 + 3 + 4 + 1 + 2 + 3 + 4 +
These sub-beats can be divided even more. In these sub-sub-beats, the Strong nodes once
again retain the label of the parent node, while the Weak nodes are given different names
(Figure 5). The Weak sub-sub-beat between the beat node (a number) and the “and” node is
called “y” (pronounced [i]), and the Weak sub-sub-beat between the “and” and the next beat
note is called “a” (pronounced [ ə]). When a musician speaks the meter at this level of
granularity, they say “one y and a two y and a three y and a four y and a one y and a two y
and a three y and a four y and a…”.
25
Figure 4. Two levels below the beat level have further subdivisions.
/ \ / \
/ \ / \
S W S W
/ \ / \ / \ / \
/ \ / \ / \ / \
/ \ / \ / \ / \
S W S W S W S W
1 2 3 4 1 2 3 4
/ \ / \ / \ / \ / \ / \ / \ / \
S W S W S W S W S W S W S W S W
1 + 2 + 3 + 4 + 1 + 2 + 3 + 4 +
|\ |\ |\ |\ |\ |\ |\ |\ |\ |\ |\ |\ |\ |\ |\ |\
S W S W S W S W S W S W S W S W S W S W S W S W S W S W S W S W
1 y + a 2 y + a 3 y + a 4 y + a 1 y + a 2 y + a 3 y + a 4 y + a
Metrical Phonology uses a more compact representation for hierarchical metrical structure, a
notation with stacks of Xs called a metrical grid (Liberman & Prince, 1977; Hayes, 1984;
Figure 5). In each column, the number of Xs represents the strength of a metrical position
relative to the other metrical positions in the same phrase. In the example below, the lowest
row of Xs correspond to the syllable, the Xs above those to the head of each trisyllabic foot,
and the top Xs to binary groups of feet.
Figure 5. A metrical grid of the rhythmic structure of the first two lines of an English
limerick.
x x x x
x x x (x) x x x
x x x x x x x x x (x) (x) (x) x x x x x x x x x (x)
There once was a man from Nantucket who kept all his cash in a bucket
The example in Figure 6 below is the metrical grid notation of the metrical tree example in
Figure 4.
26
Figure 6. A material grid representation of the metrical structure of Figure 4.
x
x x
x x x x
x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
1 y + a 2 y + a 3 y + a 4 y + a 1 y + a 2 y + a 3 y + a 4 y + a
Just as speech can have trisyllabic feet, in some cases a sub-division of a beat has three
terminal nodes instead of two (or instead of sub-dividing further to four nodes). These
sequences of three are called “triplets” and are counted by musicians as “one and a two and a
three and a four and a one and a…”. For the purposes of this research, it is not important
whether a subdivision with three terminal nodes is a ternary-branching tree or a structure
with two levels; but for simplicity in the metrical grid the two weaker sub-beats in a triplet
are marked as equally weak (Figure 7).
Figure 7. A metrical grid representation in which each beat has three subdivisions.
x
x x
x x x x
x x x x x x x x x x x x
1 + a 2 + a 3 + a 4 + a
If triplets occur in a beat pattern in this research, they are often mixed in among binary
divisions. In the example in Figure 8, beats 1 and 3 have full binary branching while beats 2
and 4 branch into triplets.
27
Figure 8. A metrical grid in which beats 1 and 3 have four sub-divisions while beats 2 and 4
have three sub-divisions.
x
x x
x x x x
x x x x x x
x x x x x x x x x x x x x x
1 y + a 2 + a 3 y + a 4 + a
The preceding description of musical structure has been looking at metrical positions—slots
of abstract time. But not all metrical positions are necessarily used in a beatboxing
performance. For example, in the beat pattern in Figure 9 below each beat (1, 2, 3, or 4) holds
a musical event, but the available metrical positions after each beat (“y + a”) are silent—with
the exception of the “a” of the first 4 on which musical event {B} is produced just before
another {B} on the next beat 1. ({B}, {t}, and {PF} are the beatboxing sounds Kick Drum,
Closed Hi-Hat, and PF Snare, respectively; beatboxing shorthand is denoted by curly
brackets as described in Chapter 3: Sounds.)
Figure 9. A metrical grid of the beatboxing sequence {B t PF t B B B PF t}. All sounds except
the second {B} are produced on a major beat; the second {B} is produced on the fourth
sub-division of beat 4 of the first measure.
x
x x
x x x x
x x x x x x x x
x x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
1 y + a 2 y + a 3 y + a 4 y + a 1 y + a 2 y + a 3 y + a 4 y + a
B t PF t B B B PF t
2.2 Drum tab notation
Metrical grids are useful for representing the relative strength of each metrical position
compared to the others in its phrase. But since the relative strengths of positions in the
28
metrical structure of beatboxing is highly regular (1 > 3 > {2, 4} > “+” > {“y”, “a”}), a more
consolidated type of metrical notation can be used. For beatboxing and some other
percussive music that does not require pitch to be encoded, a drum tab may be used (e.g.,
Figure 10).
Figure 10. A drum tab representation of the beat pattern in Figure 9, including a label
definition for each sound.
B |x--------------x|x---x-----------
t |----x-------x---|------------x---
PF|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
B=Kick Drum (labial ejective)
t=Closed Hi-Hat (coronal ejective)
PF=PF Snare (labial ejective affricate)
Drum tablature (or drum tabs) is an unstandardized form of drum beat/pattern notation
(Drum tablature, 2022; DrumTabs, n.d.). Each drum tab (the whole figure) represents a
musical utterance. Except for the last row, which marks out the meter, each drum tab row
indicates the timing of a particular musical event in the meter. Drum tab notation has two
major advantages over metrical grid notation. First, the metrical pattern of each sound is
easier to see because it sits alone on its tier. Second, multiple events can be marked as
occurring on the same metrical position—a common occurrence in many musical
performances including beatboxing. (The metrical grid notation, on the other hand, only
permits a single musical event per metrical position).
The first symbol of each row (except the last row) is the abbreviation for a beatboxing
sound in Standard Beatbox Notation or a different notation if no Standard Beatbox Notation
exists for that sound (Stowell, 2003; Tyte & SPLINTER, 2014). The names of the sounds
29
corresponding to each symbol are listed beneath the drum tab in a key. The symbol x on a
drum tab row marks the occurrence of a sound, and the symbol - (hyphen) indicates that the
sound represented in that row is not performed at that metrical position. When a sound is
sustained, the initiation of the sound is marked with an x and the duration of its sustainment
is marked with ~ (tilde). For example, the Liproll {LR} in the drum tab in Figure 11
(simplified from a longer and more complicated sequence for illustrative purposes) is
sustained for a full beat or slightly longer each time it is produced. (The sounds {b} and {pf}
are alternants of {B} and {PF} as discussed in Chapter 5: Alternations.)
Figure 11. A simplification of a drum tab from Chapter 5: Alternations. Sounds sustained
across multiple beat sub-divisions are marked by tildes “~”.
b |x-----x-----x---|--x---x-----x---|x-----x-----x---|--x-------x---x-
pf|--------x-------|--------x-------|--------x-------|------x---------
LR|x~~~~~------x~~~|~~----------x~~~|x~~~~~------x~~~|~~--------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
The bottom row of the drum tab shows the metrical positions available in the musical
utterance. The first beat of the meter is marked with the number 1. The rest of the beats of
the tactus are marked 2, 3, and 4, with the “+” of each beat evenly spaced between them. As
described in the previous section, each beat can be divided as much as required. Generally in
this research, the labels for the “y” and “a” of each beat are omitted in an attempt to improve
overall legibility of the meter, but their positions exist in the space between the numbered
beats and their “+”s. Pipes (|) visually separate each group of four beats from the next
(separate “measures”) but do not have any role in the meter.
A drum tab transcription was created for each beat pattern in the data set by repeated
audio-visual inspection of each beat pattern’s real-time MRI video. Portions of beat patterns
30
with rapid or unclear articulations were examined frame by frame using the Vocal Tract ROI
Toolbox (Blaylock, 2021). Articulations in the beat pattern were matched to the articulations
of sounds the beatboxer had named and performed in isolation (see Chapter 3: Sounds) in
order to establish which sound labels to use in the drum tab. In many cases, it was easiest to
start by identifying the sounds at the beginning of a phrase (which were often Kick Drums)
and the snare sounds (which fall on the back beat, notated in this dissertation as beat 3),
then look at the sounds in between. Sounds in the beat pattern that did not clearly match a
sound the beatboxer had performed in isolation were identified by cross-reference to
beatboxing tutorial videos and insight from other beatboxers; in cases where the sound could
not be identified, a new symbol and descriptive name was created for it. Initial drum tab
transcriptions were revised based on feedback from spectrogram and waveform
visualizations of the audio while making text grids in Praat and from time series created from
regions of interest in the rtMR videos (see below).
2.3 Praat TextGrid
After creating transcriptions of the beat patterns in drum tabs, MIR Toolbox (v1.7.2)
(specifically the mirevents(..., ‘Attack’) function) was used to automatically find acoustic
events in the audio channel of each video in the data set (Lartillot et al., 2008; n.d.). These
events were converted into points on a Praat PointTier using mPraat (Bo řil & Skarnitzl,
2016). MIR Toolbox sometimes identified events that did not correspond to beatboxing
sounds, mostly because the MRI audio (or its reconstruction) led to many sounds having an
“echo” in the signal. For example, Figure 12 shows that the acoustic release of a Kick Drum
31
was followed by several similar but lower amplitude pulses which were not related to any
articulatory movements and which create the illusion that there are several quieter Kick
Drums. Events determined not to be associated with the articulation of a beatboxing sound
(including these duplicate/extra events) were manually removed, keeping only the event with
the highest amplitude (which was also usually the first).
Less commonly, MIR Toolbox sometimes failed to identify low-amplitude events.
When a sound was made but no event was found by MIR Toolbox, a point was manually
placed on the Praat PointTier by selecting a portion of the spectrogram that corresponded to
the sound in question (confirmed by audio inspection). The intensity of that selection was
then extracted, the time point of maximum intensity queried (in Praat, using Praat defaults),
and a point placed on the PointGrid at that time. (In a small sample of comparisons between
the result of this method and points that MIR toolbox had already found, this Praat method
placed points 1-3 ms after the points placed by MIR Toolbox.) If this method failed (either
because the selection window was too small or the lack of maximum in Praat's intensity
signal), a point was manually placed by visual inspection of the waveform at the highest
amplitude point of a stop/affricate release or the middle of a brief moment of phonation.
A label was added to each event in the PointTier corresponding to the appropriate
sound in the drum tab transcription, resulting in a one-to-one correspondence between
drum tab and PointTier labels. A second point tier with meter labels for each musical event
in a beat pattern was created automatically using mPraat: for each beatboxing event, the time
of the event in the label point tier was duplicated onto a meter PointTier and assigned the
corresponding beat value from the drum tab transcription.
32
In some cases, one beat was judged to corresponded to multiple events. For example,
a Kick Drum and the beginning of a Liproll might both occur on beat 1. In all such cases it
was possible to annotate distinct acoustic events for each sound on that beat. On the meter
tier, the beat (1 here) would be used for both events—in this example, both the Kick Drum
label point and Liproll label point.
Figure 12. Waveform, spectrogram, and text grid of three Kick Drums produced at relatively
long temporal intervals. The text grid label of each sound is associated with the true acoustic
release of the sound; the subsequent smaller bursts are artefacts from audio reconstruction.
3. Kinematic visualizations
3.1 Time series from regions of interest
Time series were created from rtMR video pixel intensities using a region of interest method
(Lammert et al., 2010; Blaylock, 2021). Regions of interest reduce the complexity of image
33
processing by isolating relatively small sets of pixels for analysis. The regions distill the
intensities (brightnesses) of all their pixels into a single value (or in the case of a centroid
method, two values). In a video, the region of interest is static but its pixel intensities change
frame by frame; assembling the frame-by-frame intensity aggregates into a list creates a time
series. Regions are generally devised so that pixel intensity changes reflect changes in the
state of a single constriction type relevant to the articulation of a sound. For example, a Kick
Drum {B} is a labial ejective stop (see Chapter 3: Sounds) and so requires a region for lip
aperture and another for larynx height. As the tissue of the relevant articulator(s) moves into
the space encoded by the pixels in a region, the region’s overall pixel intensity increases.
The region of interest analysis technique is versatile, with different region shape types
and time series calculation methods that can be highly effective when used appropriately.
The VocalTract ROI Toolbox (Blaylock, 2021) offers three region shapes: rectangular regions,
pseudocircular regions (Lammert et al., 2013), and regions formed by automatically finding
groups of pixels which covary in intensity (Lammert et al., 2010). (Pseudocircular regions are
“pseudo”-circular because actual circles cannot be constructed from arrangements of square
pixels.) In this dissertation, rectangular regions were used for articulator movements
designated as horizontal or vertical with respect to their absolute orientation in the video,
pseudocircular regions were used for oblique articulator movements, and statistically
correlated regions were used for especially large tongue body movements (i.e., in the Liproll).
Time series calculation methods include averaging the intensities of all the pixels in the
region, transforming the intensities into a binary mode, and tracking the centroid of tissue
34
within the region (Oh & Lee, 2018). This dissertation uses only average pixel intensity time
series.
When regions of interest tracking average pixel intensity are used for capturing the
kinematics of movement along a given vocal dimension (see below), each region needs to be
placed so that it covers the widest aperture of its intended tract variable. At the lowest
average pixel intensity in the region, the relevant articulator(s) should be just outside the
region; pixel intensity will then increase as the relevant articulator(s) move into the region,
up to maximum intensity at the narrowest/tightest constriction. For laryngeal movements
used in glottalic egressive sounds, the region should have maximum intensity when the
arytenoids are at their maximum height; in their lowest position, the arytenoids should be
just below the lower edge of the region. Defining the regions in this way ensures that the
time series will capture—as accurately as possible—the temporal landmarks corresponding to
the start of an articulator’s movement into a constriction, the maximum velocity of the
articulator as it moves into its constriction, and the moment of maximum constriction.
Regions were placed manually as follows:
LAB. (Figure 13.) A rectangular region to measure lip aperture. Vertically, the region
was arranged so that the upper and lower lip were just outside the region at their widest
aperture. Horizontally, the region was wide enough to include the full width of the lips
during bilabial closures as well as the protrusion of the lips during labiodental closures.
LAB2. (Figure 14.) A rectangular region for measuring labial constrictions in which
the lips are pulled inward between the upper and lower teeth. The region is placed adjacent
35
to, posterior of, and non-overlapping with LAB. The width and height of the region
encompassed the pixels of the retracted portions of the upper and lower lip.
COR. (Figure 15.) A rectangular region for measuring alveolar, dental, and
linguolabial tongue tip constrictions. The region is placed so that the anterior edge is
adjacent to the lips and the posterior edge is far enough right to not have the tongue tip
inside while the tongue is pulled back or down. The upper edge of the region is level with the
alveolar ridge.
DOR. (Figure 16.) A pseudocircular region for measuring tongue body constrictions
near the velum. The region is placed adjacent to the lowered velum such that the region is
filled when the tongue body connects with the lowered velum or for narrow tongue body
constrictions while the velum is raised.
FRONT. (Figure 17.) A region for the most anterior tongue body position of the
Liproll. The region was designed so that the anterior edge of the region traced the anterior
edge of the tongue body during its most anterior Liproll constriction, the upper edge of the
region traced the air-tissue boundary along the palate, and the lower/posterior edge traced
the anterior edge of the tongue body at its most posterior Liproll constriction. This shape
was most successfully generated from the aggregate of two adjacent regions of statistically
correlated pixels, one of which contained the front of the tongue body in only its most
anterior Liproll constriction and the other of which contained the front of the tongue body
only while the tongue was in the velar closure posture it adopted during that beat pattern
(see Chapter 6: Harmony).
36
VEL. (Figure 18.) A region for tracking velum height. This was a pseudocircular region
of radius 2 pixels placed over the pixels that contained the velum in its most raised position
and adjacent to the pixels containing the velum in its most lowered state.
LAR. (Figure 19.) A rectangular region placed on the pixels containing the arytenoid
cartilages in their most elevated position.
A default subset of regions was created from inspection of the first few beat patterns
JR performed including beat patterns that highlighted the Kick Drum and Closed Hi-Hat.
These regions were modified for other videos as needed—usually to account for head
movement between videos.
Figure 13. LAB region, unfilled during a Vocalized Tongue Bass (left) and filled during the
Kick Drum that followed (right).
37
Figure 14. LAB2 region filled during a Liproll (left) and empty after the Liproll is complete
(right).
Figure 15. COR region, filled by an alveolar tongue tip closure for a Closed Hi-Hat {t} (left),
filled by a linguolabial closure {tbc} (center), and empty (right).
38
Figure 16. DOR region, filled by a tongue body closure during a Clickroll (left) and empty
when the tongue body is shifted forward for the release of an Inward K Snare (right).
Figure 17. FRONT region for Liproll outlined in red, completely filled at the beginning of the
Liproll (left) and empty at the end of the Liproll (right).
39
Figure 18. VEL region demonstrated by a Kick Drum, completely empty while the velum is
lowered for the preceding sound (left) and filled while the Kick Drum is produced (right).
Figure 19. LAR region demonstrated by a Kick Drum (an ejective sound), completely empty
before laryngeal raising (left) and filled at the peak of laryngeal raising (right).
3.2 Gestural scores
In Articulatory Phonology, gestural scores represent the temporal organization of
fundamental phonological elements called “gestures” (Browman & Goldstein, 1986, 1989).
Gestures are defined with respect to a dynamical system (Chapter 4: Theory). At the level
40
they can be observed, gestures typically involve the motion of a single constriction system
called a vocal tract variable (like the lips or tongue tip) toward a task-relevant goal—often a
spatial target in the vocal tract in terms of some constriction location (where a constriction is
being made in the vocal tract) and degree (how constricted the vocal tract is in that
location). A gesture has a finite life span; while a gesture is active, the dynamical system
parameters that determine a gesture’s behavior (like its intended spatial goal) remain
invariant, but its influence over a tract variable causes continuous articulatory changes
(Fowler, 1980).
Gestural scores are visual representations of the gestures active in a given utterance. A
gestural score often includes two things: a kinematic time series for each tract variable that
estimates the continuous change of that tract variable; and, inferences about when a gesture
is thought to be active and exerting control over a tract variable—its finite duration,
represented by a box or shading accompanying the time series. Gestural scores are here used
to visualize beatboxing movements, though in this case the “gestures” found are intended to
represent only the interval of time during which a constriction is formed and released within
a given region of interest—these constriction intervals do not necessarily correspond to
theoretical beatboxing gestures (though Chapter 4: Theory argues in favor of this
interpretation).
Gestures were found semi-automatically from time series generated by the region of
interest method (Blaylock, 2021). Each beatboxing sound was associated with one or more
regions of interest in a lookup table; for example, the Kick Drum is a glottalic egressive
bilabial stop and so was associated to the LAB (labial) and LAR (laryngeal) regions. Each
41
beatboxing sound in a beat pattern was marked by a point on a Praat point tier as described
earlier. For each sound, the point was used as the basis for automatic use of the DelimitGest
function (Tiede, 2010) on each of the time series associated with that sound. The algorithm
defines seven temporal landmarks for each gesture based on the velocity of the time series
(calculated via the central difference) within a specified search range—in this case, the entire
time series was the search range. The time of maximum constriction (MAXC) is the time of
the velocity minimum nearest that sound’s time point from the point tier. The times of peak
velocity into (PVEL) and out of (PVEL2) the constriction are the times of the nearest
velocity maxima greater than 10% of the maximum velocity of the search range before and
after MAXC, respectively. The time of the onset of movement (GONS) is the time at which
movement velocity is 20% of the range of velocities between the peak velocity into the
constriction and the nearest preceding local velocity minimum; the time of movement end
(GOFFS) was calculated the same way but for the range of velocity between the peak
velocity out of the constriction and the nearest following velocity minimum. Finally, the time
of constriction attainment (NONS) was calculated as the time at which the velocity was 20%
of the range between the the peak velocity into a constriction and the minimum velocity
associated with the time of MAXC; the time at which a constriction began to be released was
likewise calculated as the time of the same velocity threshold but between the velocity
associated with the time MAXC and the peak velocity out of the constriction.
In some cases the automatic gesture-finding algorithm defined gestural landmarks
that were grossly unaligned with the actual articulator movement. Often this was because the
MAXC time point taken from the point tier was placed on a local minimum in pixel intensity
42
rather than a local maximum. Those gestures were manually corrected via the MviewRT GUI
(Tiede, 2010) using the same DelimitGest function and parameters by selecting different
starting frames than the ones generated from the Praat point tier. Some of those manually
placed gestures had temporal landmarks that were still grossly unaligned with their expected
relative pixel intensity values, as when a gestural offset landmark was placed halfway into the
constriction of a later gesture; these landmarks were corrected in MviewRT by eye. Gestural
scores and their time series were plotted for dissertation figures in MATLAB using a branch
of the VocalTract ROI Toolbox. In these plots, manually-adjusted gestures are marked by a
black box around the gesture.
43
CHAPTER 3: SOUNDS
This chapter introduces some of the most frequent sounds of beatboxing and identifies
critical phonetic dimensions along which the inventory of beatboxing sounds appears to be
distributed. There are three major conclusions. First, the sounds of beatboxing have a
roughly Zipf’s Law (power law) token frequency distribution, a pattern that has been
identified for word frequency in texts and corpora but not for sound frequency; this is
interpreted as a reflection of the status of individual beatboxing sounds as meaningful
vocabulary items in the beatboxer’s sound inventory. Second, beatboxing sounds are
organized combinatorially insofar as they can largely be described as combinations of a
relatively small set of articulatory dimensions, though the organization of these dimensions
is not as periodic or economic as the organization of sounds in speech. Third, beatboxing
sounds are contrastive with one another because changing one of the articulatory
components of a sound generally leads to a change in the sound’s meaning. Speech and
beatboxing therefore appear to share not just the vocal apparatus but also distributional and
compositional properties—even though the beatboxing sounds have no meaningful relation
to a beatboxer’s phonological or lexical knowledge.
1. Introduction
At least at some level of representation, beatboxing sounds have intrinsic meaning. The
meaning of a sound often refers to the musical referent it imitates, whether that be part of a
drum kit like a Kick Drum {B} or a synthetic sound effect like a laser (e.g., Sonic Laser
{SonL}). One could therefore compile a list of beatboxing sounds and structure it so that
44
sounds with similar musical roles are listed near each other: kicks, hi-hats, snares, basses,
rolls, sound effects, and more. Catalogs of sounds like this have been assembled by
beatboxers. The boxeme of Paroni et al. (2021) seems to refer to sounds at this level of
granularity which experienced beatboxers are likely able to distinguish and use appropriately
in the context of a beat pattern—a beatboxing utterance. Beatboxing sounds are cognitively
organized by their musical function, at the very least.
But perhaps there is more to the organization of beatboxing sounds than just their
musical function. Other cognitive domains have been described as using a sort of “mental
chemistry” (Schyns et al., 1998:2) in which a few domain-relevant dimensions are variously
combined to create a myriad of representations. Speech is one such system: the sounds of a
language are composed of discrete choices along a relatively small set of phonetic
dimensions like voicing, place, and duration; these dimensions are thought to encode
linguistic meaning through contrast, and are often considered to be the compositional,
cognitive building blocks of speech (i.e., features or gestures).
Abler (1989; see also Studdert-Kennedy & Goldstein, 2003) describes three properties
shared by self-diversifying systems like the systems of speech sounds, genes, and chemical
elements: multiple levels of organization, sustained variation via combinations instead of
blending, and periodicity—the repeated use of a relatively small set of dimensions in
different combinations (referred to by other scholars as feature economy (Ohala, 1980, 2008;
Clements, 2003; Dunbar & Dupoux, 2016). Beatboxing does have at least two levels of
organization—the meanings of the sounds themselves in terms of musical roles (e.g., kick,
45
snare) and their organization into hierarchically structured beat patterns. Less clear is
whether beatboxing sounds are composed of combinatorial, periodically organized units.
If meaningful beatboxing sounds are also composed of smaller units, they should be
classifiable over repeated use of a small set of dimensions—some of which may happen to
overlap with the dimensions along which speech sounds are classified because they share the
same phonetic potential via the vocal tract. Alternatively, if beatboxing sounds are not
composed combinatorially, they might instead be organized dispersively throughout the
vocal tract so that each sound is maximally distinct from the other. This would be
reminiscent of Ohala’s (1980:184) “deliberately provocative” suggestion that, if consonants
are maximally dispersed within a language’s phonological inventory as vowels often seem to
be (Liljencrants & Lindblom, 1972; Lindblom, 1986), then consonant systems like [ ɗ k’ ts ɬ m
r ǀ] should be typologically common (which they are not; see Lindblom & Maddieson, 1988).
If beatboxing sounds are organized to be distinctive but not combinatorial, then beatboxing
sounds should not be classifiable over repeating dimensions. Figure 20 schematically
demonstrates these types of organization.
Note that being able to classify beatboxing sounds along articulatory dimensions is
not enough to claim that those dimensions constitute cognitive beatboxing units. The
properties of compositionality and periodicity do not guarantee that the composite
dimensions play a role in the cognitive representation and functioning of the system. In
linguistics, evidence for the cognitive reality of organizing atomic features comes from the
different behavioral patterns speech sounds exhibit depending on which features they are
composed of. This chapter only goes so far as to describe and analyze the articulatory
46
dimensions along which beatboxing sounds appear to be dispersed; later chapters revisit the
question of the cognitive status of some of these dimensions.
Figure 20. Beatboxing sounds organized by maximal dispersion in a continuous phonetic
space (top) vs organization along a finite number of phonetic dimensions (bottom).
This chapter presents two novel analyses of beatboxing sound organization. The first
(Analysis 1) measures the token and beat pattern frequency of beatboxing sounds, providing
the first quantitative account of beatboxing sound frequency. The second (Analysis 2) builds
on the first by evaluating whether higher frequency beatboxing sounds can be analyzed as
composites of choices from a relatively small set of phonetic dimensions. In the process, the
chapter contributes to the still-expanding literature of the phonetic documentation of
beatboxing sound production.
47
2. Method
Describing the organization of beatboxing sounds and assessing whether they are composed
combinatorially requires first making a list of beatboxing sounds to analyze. New beatboxing
sounds continue to be invented, and there is no fully comprehensive documentation of
beatboxing in which to find a list of all the sounds that have been invented so far (though
resources like humanbeatbox.com offer an attempt). The list of sounds for this analysis was
assembled through inspection of real-time MRI videos of a single expert beatboxer,
supplemented by discussion with other beatboxers and YouTube tutorials of beatboxers
explaining how to produce various sounds. Two particular methodological concerns about
this process merit discussion in advance: how to decide which articulations a beatboxer uses
are and are not beatboxing sounds, and how to determine which of those sounds to include
in an analysis of beatboxing sound organization.
The decision of what counts as a beatboxing sound is rooted in the observations and
opinions of the beatboxer and the analyst (who may be the same but in this case are not). In
the process of data collection for this study, each beatboxer was asked to make a list of
sounds they can produce and then showcase each one in a beat pattern. But more sounds
might be used in those beat patterns than were listed and showcased by the beatboxer, either
because the beatboxer forgot to list them or does not overtly recognize them as a distinct
sound. Likewise, a beatboxer might distinguish between two or more sounds that the analyst
detects no difference between—because the differences were either not imageable,
nonexistent, or not detected. And, some sounds may be different only in ornamentation,
with secondary articulations used to create different aesthetics without fundamentally
48
changing the nature of the sound. The analyst must choose to either rely only on a
beatboxer’s overt knowledge of their sound inventory or add and remove sounds in the list
based on the analysis of their usage. Thus a catalog of a beatboxer’s sounds is biased by the
beatboxer’s knowledge and the analyst’s assumptions, and therefore not likely to be a
complete or fully accurate representation of a beatboxer’s cognitive sound inventory.
The second methodological issue is deciding which of those beatboxing sounds to
include in the analysis, as not all sounds of a beatboxer’s inventory might not have the same
status—just as not all the sounds of a language are equally contrastive (Hockett, 1955). Some
beatboxing sounds may just be entering or leaving the inventory, and some may be less
common than others. If the whole sound inventory is analyzed equally, less stable beatboxing
sounds may throw off the analysis by muddying the dimensions that compose more stable
sounds.
At the same time, if beatboxing sounds are organized combinatorially, beatboxing
sound inventories may fill open “holes” in the inventory over time: given the current state of
sounds in a beatboxer’s inventory, they may be more likely to next learn a sound that is
composed of phonetic dimensions already under cognitive control than to learn a sound
requiring the acquisition of one or more new phonetic dimensions. There is not sufficient
diachronic data in the corpus to measure this directly; however, if we assume that a
beatboxing sound’s corpus frequency is proportional to how early it was learned (higher
frequency indicating earlier acquisition) then cataloging beatboxing sounds from high
frequency to low frequency should yield a growing phonetic feature space. The highest
frequency sounds would be expected to differ along relatively few phonetic dimensions; as
49
sounds of lesser frequency are added to the inventory, we would expect to find that they tend
to fill gaps in the existing phonetic dimension space when possible before opening new
phonetic dimensions. But if the sounds are dispersed non-combinatorially, we may instead
expect to find that even the earliest or most frequent sounds make use of as many phonetic
dimensions as possible to maximize their distinctiveness, with the rest of the sounds fitting
into the spaces between.
The initial list of sounds was designed to be as encompassing as possible in this study.
The 39 sounds which the beatboxer overtly identified and 16 more which were determined
by the analyst to have qualitatively distinct articulation were combined into a list of 55
sounds. The frequency of each sound was calculated by counting how many times it
appeared in the data set overall (token frequency) and how many separate beat patterns it
appeared in (beat pattern frequency). The token frequency distribution analysis of the full
set of sounds is presented in section 3.1. To minimize the impact of infrequent sounds on the
featural/dimensional organization analysis (section 3.2), the list of sounds is presented in
four groups from high to low beat pattern frequency.
Details about the acquisition and annotation of beatboxing real-time MR videos can
be found in Chapter 2: Method. Counts of each beatboxing sound were collected from 46
beat patterns based on their drum tab annotations. Each ‘x’ in a drum tab corresponded to
one token of a sound. Each sound was counted according to both its token frequency (how
many times it shows up in the whole data set) and its beat pattern frequency (how many
different beat patterns it shows up in).
50
Certain sounds and articulations were labeled in drum tabs but excluded from the
analyses: audible exhalations and inhalations, lip licks (linguolabial tongue tip closures)
which presumably help to regulate moisture on the lips but not to make sound,
non-sounding touches of the tongue to the teeth or alveolar ridge, and lip
spreading/constricting, akin to lip rounding in speech and useful for raising and lowering the
frequency of higher amplitude spectral energy. None of these were identified by the
beatboxer as distinct sounds, nor were they clearly associated with the articulation of any
nearby sounds.
3. Results
Section 3.1 examines the overall frequency distribution of the beatboxing sounds in the data
set. Section 3.2 digs further into the production of the most frequent sounds in order to
evaluate whether they are organized combinatorially.
3.1 Frequency distribution
Figure 21 shows the token frequency of each beatboxing sound in decreasing order of
frequency. Lighter shaded bars show the token frequency for sounds that only occurred in
one beat pattern in the data set, and the darker bars are sounds that occurred in two or more
best patterns. Beat pattern frequency does not factor into the power law fitting procedure,
but will be used in section 3.2. The most frequent sound appears much more often than any
of the others; the next few most frequent sounds rapidly decrease in frequency from there.
The bulk of the sounds have relatively low and gradually decreasing frequency.
51
There are many different types of frequency distributions, but one commonly
associated with language that results in a similar distribution is Zipf's Law—a discrete power
law (zeta distribution) with a particular relationship among the relative frequencies of the
items observed (Zipf, 1949). A distribution is Zipfian when the second most frequent item is
half as frequent as the most frequent item, the third most frequent item is one third as
frequent as the most frequent item, and so on. To put numbers to it, if there were 100
instances of the most common item in a corpus, the second most common item should occur
50 times, the third most common item 33 times, the fourth 25, and so on. With respect to
language, Zipf’s Law is known for describing the frequency distribution of words in a corpus:
function words tend to be very frequent, accounting for large portions of the token
frequency, while other words have relatively low frequency. On the other hand, the
distribution of sound types (phones) in linguistic corpora is non-Zipfian: Zipf's Law
overestimates the frequencies of both the highest and lowest frequency phones while
under-estimating the frequencies of phones in the middle (Lammert et al., 2020).
Zipf’s Law is expressed mathematically below, where represents an item’s frequency rank (i.e., the third most frequent item is ) and represents the frequency of the th = 3 word.
= 1
·
1
With respect to this data set, a Zipfian rank-frequency distribution of beatboxing sounds is
predicted to fit the equations above with because there were 330 instances of the 1
= 330
most frequent sound, the Kick Drum.
52
= 330 ·
1
Power laws take the more general form in the equation below; Zipf’s Law is the special case
where (and ). In this form, the parameters and can be estimated by = 1 = 1
non-linear least squares regression using MATLAB’s fit function set to “power1”.
( ) = − It is difficult to demonstrate conclusively that the frequency distribution of beatboxing
sounds actually follows Zipf’s Law, or even that it follows a power law versus, say, a sum of
exponentials or log-normal distribution—any distribution that is similar in general but with
somewhat different mathematical properties. Even so, estimating the parameters and from the data as described above yields and = 325. 6 (316. 9, 334. 3)
, putting the hypothesized model parameters and = 1. 025 (1. 054, 0. 996) = 330
within the 95% confidence intervals of both parameter estimates of Zipf’s Law. The fit = 1
has a sum-squared error of 1152.1 and root-mean-square error of 4.66, with 2
= 0. 9914
(adjusted , dfe=53). A visualization of the Zipf’s Law parameters is overlaid on 2
= 0. 9912
the token frequencies in Figure 21 as a black line.
The goodness of fit to the power law can be evaluated from other graphs. Figures 22
and 4 show the residuals of the fit: the fit model slightly underestimates tokens of frequency
rank 11-21, then slightly over-estimates the rest of the sounds in the long tail of the
distribution. The systematicity of the residuals suggests that the model may not be an ideal
fit, though overestimating the frequency of items in the tail is a relatively common finding in
other domains where Zipf's Law is said to apply. Figure 24 shows the log-log plot of the
53
frequency distribution and the Zipf's Law fit. Power laws plotted this way resemble a straight
line with a slope equal to the exponent in power law notation; distributions with Zipf’s Law
should therefore resemble a line with a slope of -1. Figure 25 shows the cumulative
probability of the sounds, representing for each sound type (x axis) what proportion of all
the tokens in the data set is that sound or a more frequent sound. The benefit of the
cumulative probability graph is to quickly estimate how much of the data can be accounted
for with groups of sounds of a certain frequency or higher; for example, the five most
frequent sounds account for over 50% of tokens. Again, the first few most frequent sounds
are disproportionately represented in the data while the majority of sound types appear only
rarely. The figure also shows the cumulative probability of the power law fit to the data,
again as a black line.
Regardless of the specific nature of the distribution, the frequency distribution of
beatboxing sounds seems to resemble the Zipfian frequency distribution of words, but not
the frequency distribution of phones.
54
Figure 21. Rank-frequency plot of beatboxing sounds. Beatboxing sound frequencies roughly
follow a power law: the few most frequent sounds are very frequent and most of the sounds
are much less frequent.
Figure 22. Histogram of the residuals of the power law fit. Most sounds have a token
frequency within 5 tokens of their expected frequency.
55
Figure 23. Scatter plot of the residuals of the power law fit (gray) against the expected values
(black). The middle-frequency sounds are a little under-estimated and the lower-frequency
sounds are a little over-estimated.
Figure 24. Log-log plot of the token frequencies (gray) against the power law fit (black).
56
Figure 25. The discrete cumulative density function for the token frequencies of the sounds
in this data set (gray) compared to the expected function for sounds following a power law
distribution (black).
3.2 Sounds and compositionality
In this section, beatboxing sounds are presented in decreasing order of beat pattern
frequency instead of token frequency under the premise that the most stable and flexible
beatboxing sounds will occur in multiple beat patterns. Sounds with low beat pattern
frequency often have low token frequency, but certain high token frequency sounds were
only performed in one pattern are omitted (like a velar closure {k} which is the 7th most
frequent token) or deferred until a later section in according with their beat pattern
frequency (like the Clop {C} which is the 12th most frequent token). Whenever reference is
57
made to a sound's relative frequency or to the cumulative frequency of a set of sounds,
however, those high token frequency sounds are still part of the calculation. Figure 26 shows
a revision of the cumulative probability distribution in which sounds are ordered by beat
pattern frequency (black) instead of token frequency (lighter gray).
Figure 26. The discrete cumulative density function of the token frequency of sounds in this
beat pattern (gray, same as Figure 25) against the density function of the same sounds
re-ordered by beat pattern frequency order (black).
The analysis of the compositionality of beatboxing sounds is presented in five parts. Sections
3.2.1-3.2.4 introduce beatboxing sounds with articulatory descriptions, then summarize the
phonetic dimensions involved in making those sounds. The sounds are presented according
to their beat pattern frequency: section 3.2.1 presents the five sounds that appear in more
than 10 beat patterns each, covering more than 50% of the cumulative token frequency of the
data set; section 3.2.2 adds seven sounds that appear in four or more beat patterns; and,
58
section 3.2.3 introduces ten sounds that each appear in 2 or more beat patterns. Section 2.3.4
adds another 20 lowest-frequency sounds for a total of 43 sounds. Section 3.2.5 summarizes
with an account of the overall compositional makeup of all the presented beatboxing sounds.
Articulatory descriptions of each sound are accompanied by images from real-time
MRI videos representing stages in the articulation of the sound (see Chapter 2: Method for
details of video acquisition and sound elicitation). Usually the images come from one
instance of the sound performed in isolation; some sounds were only performed in beat
patterns, so for those sounds the images come from one instance of a sound in a beat pattern.
Some of the videos from which these images were taken are available online at
https://sail.usc.edu/span/beatboxingproject.
While most articulatory descriptions will rely on well-established phonetic
terminology, the phonetic dimension of constriction degree will involve three terms that are
not usually used or may be unfamiliar: compressed, contacted, and narrow. A compressed
constriction degree involves a vocal closure in which an articular pushes itself into another
surface (or in the case of labial sounds, the lips may push each other). Compressed
constriction degree is used for many speech stops and affricates, and will be a key property of
many beatboxing sounds as well. Contacted constriction degree refers to a lighter closure in
the vocal tract which results in a trill when air is passed through it. Narrow constriction
degree refers to a constriction that is sufficiently tight to cause airflow to become turbulent;
it is used the same way in Articulatory Phonology (Browman & Goldstein, 1989).
Abbreviations for the sounds are provided in two notation formats: IPA and BBX.
Transcription in IPA notation incorporates symbols from the extensions to the International
59
Phonetic Alphabet for disordered speech (Duckworth et al., 1990, Ball et al., 2018b) and the
VoQS System for the Transcription of Voice Quality (Ball et al., 1995; Ball et al., 2018a). The
BBX notation (an initialism deriving from the word “beatbox”) is the author’s variant of
Standard Beatbox Notation (SBN; Stowell, 2003; Tyte & SPLINTER, 2014). At the time of
writing, Standard Beatbox Notation does not include annotations for many newer or less
common sounds. BBX is not meant to contribute to standardization, but simply to provide
functional labels for the sounds under discussion. In a few cases, BBX uses alternative labels
for sounds that SBN already has a symbol for (for example, the Inward Liproll in SBN is
{BB^BB} and in BBX is {LR}). BBX and SBN notations are indicated with curly brackets {}.
Unlike IPA transcriptions in which a single symbol is intended to correspond to a single
sound, BBX and SBN annotations frequently use multiple symbols to denote a single sound
(e.g., {PF} to represent a single PF Snare).
3.2.1 High-frequency sounds
3.2.1.1 Articulatory description of high-frequency sounds
Table 1 at the end of this section summarizes the high frequency beatboxing sounds in list
form. Tables 2-4 show the organization of the sounds based on their place of articulation,
constriction degree, airstream mechanism, and musical role. Unless otherwise indicated, the
MRI images presented in the figures below represent a sequence of snapshots at successive
temporal stages in the production of a sound.
60
Kick Drum
Figure 27. The forced Kick Drum.
The Kick Drum {B} mimics the kick drum sound of a standard drum set. It is one of the most
well-studied sounds in beatboxing science literature, and consistently described as a voiceless
glottalic egressive bilabial plosive (Proctor et al., 2013; de Torcy et al., 2014; Blaylock et al.,
2017; Patil et al., 2017; Dehais-Underdown, 2019). First a complete closure is made at the lips
and glottis, then larynx raising increases intraoral pressure so that a distinct “popping” sound
is produced when lip compression is released. The high-frequency rank of the Kick Drum is
likely due to a variety of factors: it is common in the musical genres on which beatboxing is
based; it replaces the [b] in the “boots and cats” phrase commonly used to introduce new
English beatboxers to their first beat pattern; and, is frequently co-produced with other
sounds like trills (basses and rolls).
61
PF Snare
Figure 28. The PF Snare.
The PF Snare {PF} is a labial affricate; it begins with a full labial closure, then transitions to a
brief labio-dental fricative. That the PF Snare is a glottalic egressive sound is evidenced by
the raised larynx height in the third image.
Inward K Snare
Figure 29. The Inward K Snare.
The Inward K {^K} (sometimes referred to simply as a K Snare due to its high frequency) is a
voiceless pulmonic ingressive lateral velar affricate. In producing the Inward K, the tongue
body initially makes a closure against the palate. It then shifts forward, with at least one side
lowering to produce a moment of pulmonic ingressive frication. The lateral quality is not
62
directly visible in these midsagittal images; however, laterality can be deduced by observing
that the tongue body does not lose contact with the palate in the midsagittal plane: if the
tongue is blocking the center of the mouth, then air can only enter the mouth past the sides
of the tongue.
Unforced Kick Drum
Figure 30. The unforced Kick Drum.
The Kick Drum is sometimes referred to as a “forced” sound. An “unforced” version of the
Kick Drum has also been observed in some beatboxing productions. This unforced Kick
Drum {b} has no observable larynx closure and raising like that of the forced Kick Drum;
instead, it is produced with a dorsal closure along with the closure and release of the lips.
Note however that the tongue body does not generally shift forward or backward during the
production of this unforced Kick Drum; the airstream is therefore neither lingual egressive
nor lingual ingressive, but neutral—a “percussive”, a term for a sound lacking airflow
initiation due to pressure or suction buildup. The source of the sound in a percussive is the
noise produced by the elastic compression then release of the contacting surfaces (Catford,
1977). Section 3.2.2 expands the scope of percussive sounds slightly in the context of
63
beatboxing to include sounds with a relatively small amount of tongue body retraction which
signals the presence of lingual ingressive airflow (“relatively” here compared to other lingual
ingressive sounds which have much larger tongue body retraction).
The extensions to the IPA (Ball et al., 2018) offer the symbol [ ʬ] for bilabial
percussives. The unforced Kick Drum is likely a context-dependent alternative form of the
more common forced Kick Drum, as discussed at greater length in Chapter 5: Alternations.
(The same chapter also includes an articulatory comparison between three compressed
bilabial sounds—the forced Kick Drum, the unforced Kick Drum, and the Spit Snare.)
Closed Hi-Hat
Figure 31. The Closed Hi-Hat.
The Closed Hi-Hat {t} is a voiceless glottalic egressive apical alveolar affricate. The tongue tip
rises to the alveolar ridge to make a complete closure while the vocal folds close and the
larynx lifts to increase intraoral pressure.
3.2.1.2 Composition summary of high-frequency sounds
Tables 2-4 presents the first five most common beatboxing sounds in this data set, all of
which appear in at least 10 beat patterns and which collectively make up more than 50% of
the cumulative token frequency. These frequently used sounds are spread across three
64
primary constrictors: labial (bilabial, labio-dental), coronal (alveolar), and dorsal. Three of
the sounds {B, PF, t} are glottalic egressive, one {^K} is pulmonic ingressive, and one {b} is
percussive (Table 2). Some beatboxers also use glottalic egressive dorsal sounds (e.g.,
Rimshot), but the Inward K Snare is commonly used as a way to inhale while vocalizing. The
unforced Kick Drum appears to be a context-dependent variety of Kick Drum (see Chapter
5: Alternations), indicating that the glottalic egressive Kick Drum is the default form. With
respect to airstreams, this effectively places the most common beatboxing sounds along two
airstreams: glottalic egressive for the majority, with pulmonic ingressive for the important
inhalation function of the Inward K Snare.
Of the same sounds, three {PF, t, ^K} are phonetically affricates, and two {B, b} are a
stops (Table 3). Proctor et al., (2013) describe the Kick Drum {B} as another affricate; its
production may vary among beatboxers. But the phonological distinction between affricate
and stop that exists in some languages does not have as clear a role in beatboxing; with only
five sounds under consideration so far that mostly vary by constrictor, a simpler description
is that all of these sounds are produced with a compressed closure similar to what both stops
and affricates in speech require. The nature of the release—briefly sustained or not—likely
enhances the similarity of each sound to its musical referent on the drum kit, but may not be
a phonetic dimension along which beatboxing sounds vary.
If beatboxing sounds were organized to maximize distinctiveness without any other
organizational constraints, these five most frequent sounds should be expected to be
completely different with respect to common articulatory dimensions like constriction
degree (similar to manner of articulation), constrictor (place of articulation), coordination of
65
primary and pressure-change-initiator actions (airstream mechanism), as well perhaps as
duration, nasality, voicing, or other phonetic dimensions. Instead, the sounds vary by
constrictor but share the same qualitative constriction degree, lack of nasality, lack of
voicing, and all but one share the same airstream mechanism.
66
Table 1. Notation and descriptions of the most frequent beatboxing sounds.
Sound name BBX IPA Description Token
frequency
Cumulative
probability
Beat pattern
frequency
Forced Kick Drum {B} [p’] Voiceless glottalic egressive bilabial stop 330 23.44% 34
PF Snare {PF} [p ͡f'] Voiceless glottalic egressive labiodental affricate 136 33.10% 23
Inward K Snare {^K} [k ͡ʟ ̝ ̊ ↓] Voiceless pulmonic ingressive lateral velar
affricate
91 39.56% 16
Unforced Kick Drum {b} [ ʬ] Voiceless percussive bilabial stop 117 47.87% 14
Closed Hi-Hat {t} [ts’] Voiceless glottalic egressive alveolar affricate 70 52.94% 12
Table 2. The most frequent beatboxing sounds displayed according to constrictor (top) and
airstream (left).
Airstream Bilabial Labiodental Coronal (alveolar) Dorsal
Glottalic egressive B PF t
Pulmonic ingressive ^K
Percussive b
Table 3. The most frequent sounds displayed according to constrictor (top) and constriction
degree (left).
Constriction degree Bilabial Labiodental Coronal (alveolar) Dorsal
Compressed B, b PF t ^K
67
Table 4. The most frequent sounds displayed according to constrictor (top) and musical role
(left).
Musical role Bilabial Labiodental Coronal (alveolar) Dorsal
Kick B, b
Hi-Hat t
Snare PF ^K
68
3.2.2 Medium-frequency sounds
3.2.2.1 Articulatory description of medium frequency sounds
Dental closure, linguolabial closure, and alveolar closure
The dental closure, linguolabial closure, and alveolar closure were not identified as distinct
sounds by this beatboxer, and therefore were not given names referring to any particular
musical effect. They are each categorized as a percussive coronal stop, made with the tongue
tip just behind the teeth (dental), touching the alveolar ridge (alveolar), or placed between
the lips (linguolabial).
“Percussive” may be somewhat misleading for these sounds. Each of these sounds is
produced with a posterior dorsal constriction, just like the percussive unforced Kick Drum.
But unlike the unforced Kick Drum, in each of these sounds there is a relatively small
amount of tongue body retraction. This makes them phonetically lingual ingressive sounds
rather than true percussives which are described as sounds produced without inward or
outward airflow. (The linguolabial closure is also found without a dorsal closure, and in
those cases is definitely not lingual ingressive.)
Earlier, the choice was made to not distinguish between constriction release types
stop and affricate because there is no evidence here that beatboxing requires such a
distinction. For the dental, linguolabial, and alveolar clicks, however, there is evidence to
suggest that they should not be grouped with other lingual ingressive sounds that will enter
the sound inventory in section 3.2.3. Articulatorily, there is a great difference between these
“percussives” and other lingual ingressive sounds with respect to the magnitude of their
69
tongue body retraction. The image sequence in Figure 36 shows the production of an alveolar
closure followed immediately by a Water Drop (Air). Both sounds have tongue body
retraction that indicates a lingual ingressive airstream, but the movement of the tongue body
in the alveolar closure (frames 1-2) is practically negligible compared to the movement of the
tongue body in the Water Drop (Air) (frames 3-4). The same holds for the other sounds
coded as lingual ingressive in this chapter. In later chapters, we will also see evidence that the
dental closure and perhaps some other of these “percussive” sounds are context-dependent
variants of other more common sounds (the Closed Hi-Hat and PF Snare).
Figure 32. The dental closure.
Figure 33. The linguolabial closure (dorsal).
70
Figure 34. The linguolabial closure (non-dorsal).
Figure 35. The alveolar closure.
Figure 36. The alveolar closure (frames 1-2) vs the Water Drop (Air). The jaw lowering and
tongue body retraction for the alveolar closure is of lesser magnitude.
71
Spit Snare
Figure 37. The Spit Snare.
The Spit Snare corresponds to the Humming Snare of Paroni et al. (2021), which seems to
have two variants in the beatboxing community: the first, which Paroni et al. (2021)
reasonably describe as a lingual egressive bilabial stop with a brief high frequency trill
release; and the second, sometimes also called a Trap Snare, BMG Snare, or Döme Snare
(due to its popularization by beatboxing artists BMG and Döme (Park, 2017)), which appears
to be a bilabial affricate. The latter articulation is the one described here.
This Spit Snare is a lingual egressive bilabial affricate, produced by squeezing air
through a tight lip compression, creating a short spitting/squirting sound reminiscent of a
hand clap. To create the high oral air pressure that pushes the air through the lip closure, the
volume of the oral cavity is quickly reduced by tongue body fronting and jaw raising. The
lips appear to bulge slightly during this sound, either due to the high air pressure or to the
effort exerted in creating the lip compression.
The IPA annotation for the Spit Snare is composed of the symbol for a bilabial click
(lingual ingressive) tied to the symbol for a voiceless bilabial fricative (pulmonic egressive)
72
followed by an upward arrow. The upward arrow was part of the extensions to the IPA until
the 2008 version, meant to be used as a diacritic in combination with pre-existing click
symbols to represent “reverse clicks” (Ball et al., 2018:159), but was removed in later versions
because such articulations are rarely encountered even in disordered speech (Ball et al.,
2018). The same notation of a bilabial click with an upward arrow was used by Hale & Nash
(1997) to represent the lingual egressive bilabial “spurt” attested in the ceremonial language
Damin. Note that the downward arrow is not complementarily used for lingual ingressive
sounds; instead, its use both in the extension to the IPA and here mark to mark pulmonic
ingressive sounds (designated “Inward” sounds by beatboxers) like the Inward K Snare.
Throat Kick
Figure 38. The Throat Kick.
Another member of the Kick family of sounds is the Throat Kick (also called a Techno Kick,
808 Kick, Techno Kick, or Techno Bass,
https://www.humanbeatbox.com/techniques/sounds/808-kick/). Throat Kicks are placeless
implosives: while there is always an oral closure coproduced with glottal adduction, lowering,
and voicing, it does not seem to matter where the oral constriction is made. In isolation, this
73
beatboxer produces the Throat Kick with full oral cavity closure from lips to velum; in the
beat pattern showcasing the Throat Kick, the oral closure is an apical alveolar one. (This
latter articulation is the origin of the chosen IPA notation for this sound, an unreleased
alveolar implosive [ ɗ
̚ ]). Supralaryngeal cavity expansion (presumably to aid the brief voicing
and also to create a larger resonance chamber) is achieved through tongue root fronting,
slight retraction of the pharynx, and lowering of the larynx.
Inward Liproll
Figure 39. The Inward Liproll.
The Inward Lip Roll is a voiceless pulmonic ingressive bilabial trill. It is usually performed
with lateral labial contact. Note that in this example, as in others, the Inward Liproll is
initiated by a forced Kick Drum. Frames 1-3 show the initial position of the vocal tract, the
initiation of the Kick Drum, and the release of the Kick Drum. In frame 4, the
lips—particularly the lower lip—have been pulled inward over the teeth. Frame 5 shows the
final position the tongue body adopts during this sound.
74
Tongue Bass
Figure 40. The Tongue Bass.
The Tongue Bass is a pulmonic egressive alveolar trill. The tongue tip makes loose contact
with the alveolar ridge, then air is expelled from the lungs through the alveolar closure,
causing the tongue tip to vibrate. The arytenoid cartilages appear to be in frame in the later
images, but the thyroarytenoid muscles (which would appear as a bright spot separating the
trachea from the supralaryngeal airway) are not; this means that the sound is voiceless. This
beatboxer distinguishes between the Tongue Bass here and a Vocalized Tongue Bass which
does have voicing (as well as a High Tongue Bass in which the thyroarytenoid muscles are
even clearer).
3.2.2.2 Composition summary of medium-frequency sounds
Table 5 adds the next 7 most common beatboxing sounds (a total of 12 sounds), all of which
appear in four or more beat patterns in the data set and comprise about 70% of the
cumulative token frequency. Three dimensional expansions are made by the introduction of
these seven sounds to the earlier most frequent five. First, a new constriction degree: in
addition to the earlier compressed closures, now light contact that results in trills is used as
well. Second, while the tongue tip was earlier only responsible for one sound which was an
75
alveolar closure, it now performs five sounds—three of which are alveolar, and two of which
are different constriction location targets. Third is the addition of glottalic ingressive,
pulmonic egressive, and lingual egressive airstreams for the Throat Kick, Tongue Bass, and
Spit Snare respectively.
Five of the seven sounds use the same compressed constriction degree type as the
most frequent sounds while filling out different constriction location options—though
bilabial and coronal sounds are more popular than the others. The Tongue Bass and Inward
Liproll open a new constriction degree value of light contact but capitalize on the bilabial
and alveolar constrictor locations that already host the most compressed sounds, doubling
down on these two particular constriction locations.
Airstream mechanism is expanded by these sounds. Whereas the five most common
sounds used three airstreams (and only two if you don’t count the percussive unforced Kick
Drum because it almost always occurs in restricted environments), adding the new sounds
increases airstream mechanism types to six (or five, if again you count the percussives as
alternants of other sounds). The airstream expansions do not follow any particular trend: the
glottalic ingressive sound is a laryngeal kick, the pulmonic egressive sound is a coronal bass,
and the lingual egressive sound is bilabial snare.
Overall, places of articulation and the compressed constriction degree established by
the highest frequency sounds continue to be used by the medium frequency sounds, but the
new sounds also expand the system’s dimensions in a few directions.
76
Table 5. Notation and descriptions of the medium-frequency beatboxing sounds.
Sound name BBX IPA Description Token
frequency
Cumulative
probability
Beat pattern
frequency
Dental closure {dc} [k ͜ǀ] Voiceless percussive dental stop 37 55.47% 9
Linguolabial closure {tbc} [ ʘ ̺ , t ̼] Voiceless percussive labiodental stop 23 57.10% 9
Spit Snare {SS} [ ʘ
͡ ɸ↑] Voiceless lingual egressive bilabial affricate 29 59.16% 6
Throat Kick {u} [ ɗ
̚ ] Voiced glottalic ingressive unreleased
placeless stop
50 62.71% 5
Inward Liproll {^LR} [ ʙ ̥ ↓] Voiceless pulmonic ingressive bilabial trill 31 64.91% 5
Tongue Bass {TB} [r] Voiced pulmonic egressive alveolar trill 27 66.83% 5
Alveolar closure {ac} [k ͜ǃ] Voiceless percussive alveolar stop 27 68.75% 4
Table 6. High and medium frequency beatboxing sounds displayed by constrictor (top) and
airstream mechanism (left). Medium frequency sounds are bolded.
Airstream Bilabial Labiodental Coronal Dorsal Laryngeal
Linguolabial Dental Alveolar
Glottalic egressive B PF t
Glottalic ingressive u
Pulmonic egressive TB
Pulmonic ingressive ^LR ^K
Lingual egressive SS
Percussive b tbc dc ac
Table 7. High and medium frequency sounds displayed by constrictor (top) and constrictor
degree (left). Medium frequency sounds are bolded.
Constriction degree Bilabial Labiodental Coronal Dorsal Laryngeal
Linguolabial Dental Alveolar
Compressed B, b, SS PF tbc dc t, ac ^K u
Contacted ^LR TB
77
Table 8. High and medium frequency beatboxing sounds displayed by constrictor (top) and
musical role (left). Medium frequency sounds are bolded.
Musical role Bilabial Labiodental Coronal Dorsal Laryngeal
Linguolabial Dental Alveolar
Kick B, b u
Hi-Hat (tbc) (dc) t, (ac)
Snare SS PF ^K
Roll ^LR
Bass TB
78
3.2.3 Low-frequency sounds
3.2.3.1 Articulatory description of low-frequency sounds
Humming
Figure 41. Humming.
Humming is phonation that occurs when there is a closure in the oral cavity but air can be
vented past a lowered velum through the nose. This beatboxer did not identify humming as a
distinct sound per se, but did identify a beat pattern that featured “Humming while
Beatboxing” which is discussed more in Chapter 6: Harmony. The example of humming
shown here was co-produced with an unforced Kick Drum.
Vocalized Liproll, Inward
Figure 42. The Vocalized Liproll, Inward.
This sound is a voiced pulmonic ingressive labial trill. Like some other trills in this data set, it
generally begins with a Kick Drum (frames 1-3).
79
Closed Tongue Bass
Figure 43. The Closed Tongue Bass.
The Closed Tongue Bass is a glottalic egressive alveolar trill performed behind a labial
closure. As with phonation (or any other vibration of this nature), air pressure behind the
closure must be greater than air pressure in front of the closure. Egressive trills usually have
higher air pressure behind the trilling constriction because atmospheric pressure is relatively
low; for the Closed Tongue Bass, the area between the lips and the tongue tip is where
relatively low pressure must be maintained. This appears to be accomplished by allowing the
lips (and possibly cheeks) expand to increase the volume of the chamber while it fills with
air. In the beat pattern that features the Closed Tongue Bass, the beatboxer also uses glottalic
egressive alveolar trills with spread lips, presumably as a non-closed variant of the Closed
Tongue Bass.
80
Liproll
Figure 44. The Liproll.
The Liproll is a lingual ingressive bilabial fricative. It begins with the lips closed together and
the tongue body pressed into the palate. The tongue body then shifts backward, creating a
vacuum into which air flows across the lips, initiating a labial trill.
Water Drop (Tongue)
Figure 45. The Water Drop (Tongue).
The Water Drop (Tongue) is one of two strategies in this data set for producing a water drop
sound effect, the other being the Water Drop (Air). The Water Drop (Tongue) is a lingual
ingressive palatoalveolar stop with substantial lip rounding. With rounded lips, the tongue
body makes a closure by the velum, and the tongue tip makes a closure at the alveolar ridge;
the tongue tip constriction is then released, mimicking the sound of the first strike of a water
droplet. The narrow rounding of the lips may create a turbulent sound, similar to whistling.
81
(Inward) PH Snare
Figure 46. The (Inward) PH Snare.
The (Inward) PH Snare or Inward Classic Snare is a pulmonic ingressive bilabial affricate. In
these beat patterns, it was always followed by an Inward K Snare. A PH Snare closely
followed by an Inward K Snare is sometimes referred to as a PK Snare, and the beatboxer in
this study only explicitly identified the PK Snare as a sound they knew, not the PH Snare.
The choice was made to identify the PH Snare as a distinct sound because the few other
combination sounds in this data set—like the D Kick Roll and Inward Clickroll with
Whistle—also have their component pieces identified separately. (Note: the alternative
choice to treat the combo of PH Snare and Inward K Snare as a single PK Snare would
reduce the number of Inward K Snares in the data set from 91 to 78; re-assessing the power
law fit yields a slightly stronger correlation [R-squared = 0.9957, adjusted R-squared =
0.9956] but an exponent of b=1.032 [confidence interval (1.053, 1.011)] which is slightly larger
than the theoretical b=1).
82
Inward Clickroll
Figure 47. The Inward Clickroll.
The Inward Clickroll (also called Inward Tongue Roll) is a voiceless pulmonic ingressive
central sub-laminal retroflex trill. The tongue tip curls backward so that the underside is
against the palate, and sides of the tongue press against the side teeth so that the only air
passage is across the center of the tongue. The lungs expand, pulling air from outside the
body between the underside of the tongue blade and the palate, initiating a trill.
Open Hi-Hat
Figure 48. The Open Hi-Hat.
The Open Hi-Hat is a voiceless central alveolar affricate with a sustained release. The initial
closure release is ejective, but the part of the release that is sustained to produce frication is
pulmonic egressive.
83
Lateral alveolar closure
Figure 49. The lateral alveolar closure.
The lateral alveolar closure is a percussive lateral alveolar stop.
Sonic Laser
Figure 50. The Sonic Laser.
The Sonic Laser is a pulmonic egressive bilabial fricative with an initial apical alveolar
tongue tip closure followed by a narrow palatal constriction of the tongue body during the
fricative.
84
Labiodental closure
Figure 51. The labiodental closure.
The labiodental closure is a voiceless percussive labiodental affricate. It is usually
accompanied by the tongue moving forward toward an alveolar closure, though it is not clear
if this tongue movement is related to the labiodental closure or the alveolar closure that
typically follows the labiodental closure. Later chapters suggest that the labiodental closure is
a percussive variant of the PF Snare.
3.2.3.2 Composition summary of low-frequency sounds
Of these new sounds, all but one use the same two constriction degrees introduced by the
high and medium frequency sounds—compressed (stops/affricates) and contacted (for trills).
The remaining sound is the Sonic Laser {SonL}; it, as well perhaps as the Water Drop
(Tongue) {WDT}, uses a narrow constriction degree akin to speech fricatives. The majority
(7/12) of these sounds are bilabial or alveolar constrictions, following the trend from the
previous section that those two constriction locations hold more sounds than the others.
Labiodental and laryngeal constrictions were also augmented, but only one new place was
added (retroflex). This set of sounds also added the final airstream type, lingual ingressive.
85
Less obvious in Tables 10-12 is that these sounds introduce new phonetic dimensions
that apply to certain sound pairs. The lateral alveolar closure {tll} and alveolar closure differ
by laterality, not by place, constriction degree, or airstream. Likewise, the Inward Liproll
{^LR} and Vocalized Inward Liproll {^VLR} differ by voicing, while the Closed Hi-Hat {t}
and Open Hi-Hat {ts} differ by duration (with the latter adopting a secondary pulmonic
egressive airstream to support its length). These three dimensional additions—laterality,
voicing, and duration—are not leveraged distinctively by most beatboxing sounds.
The difficulty of capturing all the phonetic dimensions a sound uses when placing it
in an IPA-style table (or in this case, tables) is more than an issue of convenience. Using a
tabular structure for sounds is sometimes a useful proxy for assessing their periodicity
(Abler, 1989)—the degree to which sounds can be organized into groups that share similar
behavior—but relies on a certain degree of procrusteanism (Catford, 1977)—a willingness to
force the sounds into a predetermined pattern at the expense of nuanced descriptions, and a
strategy that only becomes less adequate as the beatboxing sound inventory expands. Some
consonants on the IPA table suffer from the same issue: double-articulated sounds like [w]
and non-pulmonic sounds (clicks, ejectives, implosives) do not fit into the reductive
single-articulator, pulmonic-only structure of the major IPA consonants table.
Of the sounds in this section, the Water Drop (Tongue), Sonic Laser, Open Hi-Hat,
and Closed Tongue Bass all use two values on some phonetic dimension which makes them
impossible to place on these tables. The Water Drop (Tongue), Sonic Laser, and Closed
Tongue Bass all use multiple constriction locations, and the Open Hi-Hat uses both glottalic
egressive and pulmonic egressive airstream. Sounds of this nature can be left out of the
86
tables, like [w] in the IPA. Otherwise, there are three ways to include these sounds on the
tables. The first way is to add a sound to multiple locations on the table to show its
multiple-articulation; this helps somewhat in small doses, but quickly gets confusing when
many sounds must be placed on the table two or more times. The second way is to add new
rows or columns or slots for double-valued dimensions; this might be a new “glottalic
egressive + pulmonic egressive” row in the airstream mechanism dimension, or a new “labial
+ coronal” column for the constrictor/place of articulation dimension. But double-valued
dimensions miss the point of having tables in the first place: the aim of the game is to look
for repetition of phonetic features in sounds, but adding new rows and columns only creates
more sparseness and hides repetition. The third way of adding double-valued sounds to the
tables is to assume that one of the dimension values is more important than the other(s) and
place the sound accordingly. This is the epitome of procrusteanism, and for simplicity it is
also the approach adopted in this chapter.
The point here, and even more importantly going forward into the lowest frequency
sounds, is that hard-to-place sounds often flesh out combinatorial possibilities by using
articulations that are already in the system to produce entirely novel sounds. But this will
sometimes not show up in analyses of the IPA-style tables because the sounds cannot be
represented adequately this way.
87
Table 9. Notation and description of the low-frequency beatboxing sounds.
Sound name BBX IPA Description Token
frequency
Cumulative
probability
Beat pattern
frequency
Humming {hm} [C ̬] Pulmonic egressive nasal voicing 32 71.02% 2
Vocalized Liproll,
Inward
{^VLR} [ ʙ↓] Voiced pulmonic ingressive bilabial trill 23 72.66% 2
Closed Tongue
Bass
{CTB} [r' ̚] Voiceless glottalic egressive alveolar trill with
optional labial closure
19 74.01% 2
Liproll {LR} [ ʙ ̥ ↓
] Voiceless lingual ingressive bilabial trill 19 75.36% 2
Water Drop
(Tongue)
{WDT} [ ǂʷ] Voiceless lingual ingressive labialized
palatoalveolar stop
16 76.49% 2
(Inward) PH
Snare
{^Ph} [p ͡ɸ↓] Voiceless pulmonic ingressive bilabial
affricate
13 77.41% 2
Labiodental
closure
{pf} [ ʘ ̪ ] Voiceless percussive labiodental stop 12 78.27% 2
Inward Clickroll {^CR} [ ɽ↓] Voiceless pulmonic ingressive retroflex trill 8 78.84% 2
Open Hi-Hat {ts} [t’s:] Voiceless glottalic egressive alveolar affricate
with sustained pulmonic egressive release
8 79.40% 2
Lateral alveolar
closure
{tll} [ ǁ] Voiceless percussive lateral alveolar stop 7 79.90% 2
Sonic Laser {SonL} Pulmonic egressive labiodental fricative with
a narrow tongue body constriction
6 80.33% 2
88
Table 10. High, medium, and low (bolded) frequency sounds displayed by constrictor (top)
and airstream mechanism (left).
Airstream Bilabial Labiodental Coronal Dorsal Laryngeal
Linguolabial Dental Alveolar Retroflex
Glottalic egressive B PF t, CTB, ts
Glottalic ingressive u
Pulmonic egressive SonL TB hm
Pulmonic ingressive ^LR, ^VLR, ^Ph ^CR ^K
Lingual egressive SS
Lingual ingressive LR WDT
Percussive b pf tbc dc ac, tll
Table 11. High, medium, and low (bolded) frequency sounds displayed by constrictor (top)
and constriction degree (left).
Constriction
degree
Bilabial Labiodental Coronal Dorsal Laryngeal
Linguolabial Dental Alveolar Retroflex
Compressed B, b, SS, ^Ph PF, pf tbc dc t, ts, ac, WDT, tll ^K u
Contacted ^LR, ^VLR, LR TB, CTB ^CR hm
Narrow SonL
89
Table 12. High, medium, and low (bolded) frequency sounds displayed by constrictor (top)
and musical role (left).
Musical role Bilabial Labiodental Coronal Dorsal Laryngeal
Linguolabial Dental Alveolar Retroflex
Kick B, b u
Hi-Hat (tbc) (dc) t, ts, (ac, tll)
Snare SS, ^Ph PF, pf ^K
Roll ^LR, ^VLR, LR ^CR
Bass TB, CTB
Sound Effect SonL WDT hm
90
3.2.4 Lowest-frequency sounds
The previous three sections assigned categorical phonetic descriptions to the set of
beatboxing sounds that appear in more than one beat pattern in this data set. Part of the aim
of doing so was to show what types of sounds are used most frequently in beatboxing, to
avoid making generalizations that weigh a Kick Drum equally with, say, a trumpet sound
effect. This section tests the generalizations of the previous three sections by looking at
another 20 sounds, bringing the total number of sounds described from 23 to 43 (out of a
total 55 sounds, the remainder of which could not be satisfactorily articulatorily described).
If beatboxing sounds are using a somewhat limited set of the many phonetic dimensions
available to a beatboxer, then the same most common phonetic dimensions should be
re-used by these next 22 beatboxing sounds.
3.2.4.1 Articulatory description of lowest-frequency sounds
Clop
Figure 52. The Clop.
The Clop is a voiceless lingual ingressive palatal stop.
91
D Kick
Figure 53. The D Kick.
The D Kick is a voiceless glottalic egressive retroflex stop. The underside of the tongue tip
presses against the alveolar ridge, flipping back to an upright position upon release.
Inward Bass
Figure 54. The Inward Bass.
The Inward Bass is pulmonic ingressive voicing. The base of the tongue root participates in
the constriction which may indicate that some other structure than (or in addition to) the
vocal folds is vibrating, such as the ventricular folds. The sound is akin to a growl. In this
case, the pulmonic airflow is directed through the nose rather than the mouth.
92
Low Liproll
Figure 55. The Low Liproll.
The Low Liproll is a voiced glottalic ingressive bilabial trill. The vocal airway is quite wide,
lowering the overall resonance behind the trill to create a deeper sound. Frames 1-2 show the
forced Kick Drum that occurs at the beginning of this sound; frames 3-4 show the lips
retracted and the tongue body pulled back.
Hollow Clop
Figure 56. The Hollow Clop.
The Hollow Clop is a glottalic ingressive alveolar stop. It appears to function similarly to a
click (e.g., the Water Drop Tongue) with the tongue tip making an alveolar closure as the
front part of a seal. In this case, however, the back of the seal is glottalic, not lingual.
Retraction of the tongue and lowering of the larynx expand the cavity directly behind the
seal, resulting in the distinctive position of the tongue tip sealed to the alveolar ridge (frame
3) just before it releases quickly into a wide, open vocal posture.
93
Tooth Whistle
Figure 57. The Tooth Whistle.
The Tooth Whistle is a labiodental whistle, which in this analysis is treated along with
fricatives as a narrow constriction.
Voiced Liproll
Figure 58. The Voiced Liproll.
The Voiced Liproll is a voiced glottalic ingressive bilabial trill, similar to the Low Liproll and
High Liproll. The tongue body retracts during the Voiced Liproll and creates a large cavity
behind the labial constriction.
94
Water Drop (Air)
Figure 59. The Water Drop (Air).
The Water Drop (Air) is a voiceless lingual ingressive palatal stop with subsequent tongue
body fronting. The tongue front and tongue body make a closure, then the tongue body
moves backward to eventually pull the tongue front away from its closure as expected for a
click. Following the release of the tongue front closure, however, the tongue body shifts
forward again. This, combined with lip rounding throughout, creates the sound of a water
drop from a pop that starts with a low resonant frequency and quickly shifts to a higher
resonant frequency.
Clickroll
Figure 60. The Clickroll.
The Clickroll is a voiceless lingual egressive alveolar trill. The tongue tip and tongue body
make a closure as they would for a click. Instead of the tongue body shifting backward or
95
down to widen the seal, the tongue gradually fills the seal to push air past the alveolar
contact, initiating vibration.
D Kick Roll
Figure 61. The D Kick Roll.
The D Kick Roll is a combination of the D Kick and a Closed (but in this case not actually
closed) Tongue Bass. It begins with a voiceless glottalic egressive retroflex stop (the D Kick).
When the tongue tip flips upright again, it makes light contact against the alveolar ridge; the
larynx continues to rise during this closure, pushing air through to make a trill.
High Liproll
Figure 62. The High Liproll.
The High Liproll is a voiced glottalic ingressive bilabial trill. The vocal tract airway is narrow
for the duration of the trill, raising the resonant frequencies behind the trill for a higher
sound.
96
Inward Clickroll with Liproll
Figure 63. The Inward Clickroll with Liproll.
The Inward Clickroll with Liproll is a combination of the Inward Clickroll and an Inward
Liproll. The Inward Clickroll begins the sound as a pulmonic ingressive retroflex trill; the lips
subsequently curl inward to make another trill vibrating over the same pulmonic ingressive
airflow.
Lip Bass
Figure 64. The Lip Bass.
The Lip Bass is a pulmonic egressive bilabial trill.
97
tch
Figure 65. tch.
The tch is a voiceless glottalic egressive laminal alveolar stop. The connection between the
tongue and the alveolar ridge begins with just an apical constriction but quickly transitions
to a laminal closure. The larynx rises at that point, pushing air past the closure into the tch
snare.
Sweep Technique
Figure 66. The Liproll with Sweep Technique.
The Sweep Technique is a Liproll variant in which the tongue tip connects with the
underside of the lower lip to change the frequency of the bilabial vibration.
98
Sega SFX
Figure 67. The Sega SFX.
The Sega SFX (abbreviation for sound effect) is composed of an Inward Clickroll and a
labiodental fricative. The lower lip is pulled farther back across the lower teeth during the
course of the sound to change the fricative frequency.
Trumpet
Figure 68. The Trumpet.
The Trumpet is a voiced pulmonic egressive bilabial (or possibly labiodental with the
connection between the upper teeth and the back of the lower lip) fricative. The tongue tip
makes intermittent alveolar closures to separate the Trumpet into notes with distinct onsets
affiliated with the musical meter.
99
Vocalized Tongue Bass
Figure 69. The Vocalized Tongue Bass.
The Vocalized Tongue Bass is a voiced pulmonic egressive alveolar trill.
High Tongue Bass
Figure 70. The High Tongue Bass.
The High Tongue Bass is a voiced pulmonic egressive alveolar trill, made with a higher
laryngeal position and narrower airway to raise the resonant frequency behind the trill.
100
Kick Drum exhale
Figure 71. The Kick Drum exhale.
The Kick Drum exhale is a forced Kick Drum produced with pulmonic egressive airflow in
addition to the usual glottalic egressive airflow. There are only two tokens of it in the data
set, and they might both be more appropriately analyzed as a true forced Kick Drum (frames
1-2) followed by a bilabial or labiodental fricative (frame 3).
3.2.4.2 Composition summary of lowest-frequency sounds
Many of the new sounds fill in gaps left by the earlier sounds. The addition of the Vocalized
Liproll {VLR} and Lip Bass {LB} fill out the bilabial place column, while the additions of the
Hollow Clop {HC} and Clickroll {CR} put a sound in every airstream of the alveolar place
column except pulmonic ingressive (which may be a practically unusable combination—the
Inward Clickroll {^CR} might be better treated typologically as an alveolar that manifests as
retroflex because of the aerodynamics required to make an ingressive trill).
Just as in the previous section, several of the sounds introduced in this section do not
fit into distinctive slots in the IPA-style tables we have established so far. The tch {tch} is a
glottalic egressive alveolar sound like the Closed Hi-Hat {t} except that it uses a laminal
closure instead of an apical closure. (It may also have a release qualitatively similar to a [tʃ].)
101
The Low Liproll {LLR}, High Liproll {HLR}, and Vocalized Liproll {VLR} differ with respect
to the area of the vocal airway behind the labial constriction, as do the Tongue Bass {TB} and
High Tongue Bass {HTB}. The Clop {C} and Water Drop (Air) {WDA} differ by the absence
or presence of a tongue fronting movement. These were placed in the tables procrusteanly by
ignoring the apical/laminal distinction and constrictions that one might judge as secondary
by comparison with speech sounds—this is for convenience of a tabular representation only
and not to be taken as an assumption about the actual nature of beatboxing sounds.
Six of the lowest frequency sounds were not placed on Tables 14-16 because they were
clearly composed of two major tongue and lip constrictions and were judged not to be able
to fit into a single cell: D Kick Roll {DR}, Inward Clickroll and Whistle {^CRW}, Sega SFX
{SFX}, Trumpet {T}, Loud Whistle {LW}, and Sweep Technique {st}. Each involves
constrictions from both the tongue tip and the lips.
102
Table 13. Notation and descriptions for the lowest frequency beatboxing sounds.
Sound name BBX Description Token
frequency
Beat pattern
frequency
Clop C Voiceless lingual ingressive palatal stop 28 1
D Kick D Voiceless glottalic egressive retroflex stop 17 1
Inward Bass IB Pulmonic ingressive phonation 16 1
Low Liproll LLR Voiceless glottalic ingressive bilabial trill 13 1
Hollow Clop HC Voiceless glottalic ingressive alveolar stop 12 1
Tooth Whistle TW Voiceless pulmonic egressive labiodental whistle 12 1
Voiced Liproll VLR Voiced glottalic ingressive bilabial trill 10 1
Water Drop (Air) WDA Voiceless lingual ingressive palatal stop 8 1
Clickroll CR Voiceless lingual egressive alveolar trill 6 1
D Kick Roll DR Voiceless glottalic egressive retroflex stop with alveolar trill 6 1
High Liproll HLR Voiceless glottalic ingressive bilabial trill 6 1
Inward Clickroll
with Liproll
^CRL Voiceless pulmonic ingressive retroflex trill and bilabial
trill
6 1
Lip Bass LB Pulmonic egressive bilabial trill 6 1
tch tch Voiceless glottalic egressive laminal alveolar stop 6 1
Sweep technique st 4 1
Sega SFX SFX Voiceless pulmonic ingressive retroflex trill with labial
fricative
4 1
Trumpet T 4 1
Vocalized
Tongue Bass
VTB Voiced pulmonic egressive alveolar trill 4 1
High Tongue
Bass
HTB Voiced pulmonic egressive alveolar trill with narrowed
airway behind the constriction
3 1
Kick Drum
exhale
Bx Voiceless pulmonic egressive bilabial stop 2 1
103
Table 14. All the described beatboxing sounds that could be placed on a table, arranged by
constrictor (top) and airstream mechanism (left). The lowest-frequency sounds are bolded.
Airstream Bilabial Labiodental Coronal Front Dorsal Laryngeal
Linguolabial Dental Alveolar Retroflex Palatal
Glottalic egressive B PF
t, CTB,
ts, tch D
Glottalic ingressive
LLR, VLR,
HLR HC u
Pulmonic egressive LB, Bx SonL, TW
TB, VTB,
HTB hm
Pulmonic ingressive
^LR,
^VLR, ^Ph ^CR ^K IB
Lingual egressive SS CR
Lingual ingressive LR WDT C, WDA
Percussive b pf tbc dc ac, tll
Table 15. All the described beatboxing sounds that could be placed on a table, arranged by
constrictor (top) and constriction degree (left). The lowest-frequency sounds are bolded.
Constriction degree Bilabial Labiodental Coronal Front Dorsal Laryngeal
Linguolabial Dental Alveolar Retroflex Palatal
Compressed B, b, SS,
^Ph, Bx
PF, pf tbc dc t, ts, ac, WDT,
tll, HC, tch
D C,
WDA
^K u
Contacted ^LR,
^VLR, LR,
LLR, VLR,
HLR, LB
CTB, TB, CR,
VTB, HTB
^CR hm, IB
Narrow SonL, TW
104
Table 16. All the described beatboxing sounds that could be placed on a table, arranged by
constrictor (top) and musical role (left). The lowest-frequency sounds are bolded.
Musical
role
Bilabial Labiodental Coronal Front Dorsal Laryngeal
Linguolabial Dental Alveolar Retroflex Palatal
Kick B, b, Bx D u
Hi-Hat (tbc) (dc) t, ts, (ac, tll)
Snare SS, ^Ph PF, pf tch ^K
Roll ^LR, ^VLR, LR,
LLR, VLR, HLR
CR ^CR
Bass LB TB, CTB,
VTB, HTB
IB
Sound
Effect
SonL, TW WDT, HC C, WDA hm
105
3.2.5 Quantitative periodicity analysis
Section 1 highlighted the difference between a system that is organized periodically with
combinatorial units (like speech) and a system that is organized to maximize distinctiveness
without repeated use of a small set of elements. So far we have seen that beatboxing sounds
do make repeated use of some phonetic properties. This means that beatboxing sounds are
combinatorial, and it also suggests that the sounds are not organized to maximize
distinctiveness by minimizing phonetic overlap. However, we have not established whether
the sounds are arranged periodically—that is, whether they appear to maximize the use of a
relatively small set of phonetic properties or appear to be distributed randomly in the
phonetic space they occupy.
The following quantitative assessment compares the periodicity of beatboxing sounds
against the periodicity of Standard American English consonants. The English consonant
system was chosen for convenience and because it has a similar number of sounds (22
beatboxing sounds will be used in this analysis; see below) and major phonetic dimensions:
23 English consonants spread across four manners of articulation, seven places of
articulation, and two voicing types (Table 19). The sound [l] is usually the 24th sound and
assumed to contrast with [r] in laterality, but since it is the only sound contrasting in
laterality it is set aside.
If beatboxing sounds are arranged periodically, then at least some sounds should be
expected to differ along only a single phonetic dimension. Two sounds that differ along only
a single dimension are a minimal sound pair. English minimal sound pairs include [p/b],
[b/m], and [t/s]. In beatboxing, the Kick Drum {B} is a minimal sound pair with the PF
106
Snare {PF}, Closed Hi-Hat {t}, and D Kick {D} in constrictor/place of articulation: all are
glottalic egressive and formed with a compressed constriction degree, but each is made with
different points of contact in the vocal tract. The Kick Drum is also in a minimal sound pair
with the Spit Snare {SS} and the Inward PH Snare {^Ph} along the dimension of airstream
mechanism. The first analysis (section 3.2.5.1) compares the minimal sound pair counts of
beatboxing and the English consonant system.
Periodic organization may also manifest as relatively high concentrations of sounds
along some phonetic dimensions and relatively few sounds in others. In a maximally
distributed system, on the other hand, no phonetic dimension should be used more than the
others. The second analysis (section 3.2.5.2) uses Shannon entropy as a metric of how
distributed sounds are different phonetic dimensions.
These analyses set aside some of the beatboxing sounds that arguably constitute
varieties of a single sound. The Open Hi-Hat {ts} could be considered a variety of Closed
Hi-Hat {t} that differs only in duration of the release. The unforced Kick Drum {b}, as well as
the percussives {pf} and {dc, ac}, are argued in Chapter 5: Alternations and Chapter 6:
Harmony to be context-dependent alternants of the glottalic egressive forced Kick Drum {B},
PF Snare {PF}, and Closed Hi-Hat {t}, respectively. Vocalized Liprolls (Inward or Outward)
as well as high/low Liprolls, are voiced variations on the theme of Liproll and Inward Liproll
(though Vocalized Liproll, High Liproll, and Low Liproll all require the Liproll to be
performed as glottalic ingressive rather than as lingual ingressive). The same goes for the
Vocalized Tongue Bass and High Tongue Bass as variants of the Tongue Bass. All sound sets
like these were consolidated into a single sound for these analyses. In the interest of more
107
closely matching the speech sound dimensions, the two narrow sounds Sonic Laser {SonL}
and Tooth Whistle {TW} were removed. Thus, the two-way voicing contrast of English
consonants matches the now-two-valued beatboxing constriction degree dimension. The
Water Drop (Air) {WDA} was also removed as it was not distinguishable from the Clop {C}
in this reduced feature system, as was {tch} for its similarity to {t}. From the set of sounds in
section 3.2.4, this analysis excludes {SonL, TW, b, pf, tbc, dc, ac, tll, ts, LLR, HLR, VTB, HTB,
^VLR, WDA, tch}. The 22 beatboxing sounds used in this analysis are shown in Table 17.
These final sound systems sacrifice some nuance. Many of the excluded beatboxing
sounds could be analyzed as genuine minimal sound pairs with each other and the remaining
sounds; their exclusion is meant to make the analysis as conservative as possible while
simplifying the minimal sound pair search method by trimming rarely used phonetic
dimensions. Likewise, there are simplifications to both the speech and beatboxing feature
spaces. Phonetically in speech, [f, v] are labiodental while [p, b, m] are bilabial, and [t ʃ, dʒ]
are affricates not stops; consolidating them into labial and stop categories reduces the
number of dimensions available in the analysis. Similar choices were made throughout this
chapter for the beatboxing sounds—for example, the Spit Snare {SS} and PF Snare {PF} have
qualitatively different releases compared to the Kick Drum {B} but all are grouped under the
compressed constriction degree. In future analyses, it would be important to explore the
beatboxing dimension space more thoroughly.
3.2.5.1 Minimal sound pairs
Consider a hypothetical maximally distributed system of 21 sounds in a three-dimensional
phonetic system (a 6 x 7 x 2 matrix of airstream x place x constriction degree). Maximal
108
dispersion can be created by linearizing the three-dimensional space into an 84-element
one-dimensional vector, then assigning the 21 elements to the vector at every fourth location.
That is, starting with the first position, [ X _ _ _ X _ _ _ X _ …]. The vector is then
de-linearized back into a 6 x 7 x 2 matrix, resulting in the arrangement of elements shown in
Table 18. Minimal sound pairs are found by taking the Hamming distance of each element’s
three properties: airstream, place, and constriction degree.. The Hamming distance counts
how many properties of two elements are different. For example, the first two elements
assigned into the maximally distributed matrix are a compressed glottalic egressive bilabial
sound and a compressed glottalic egressive palatal sound; since they differ only by the place
dimension, their Hamming distance would be 1 and they would be listed as a minimal sound
pair. (In the matrix these are encoded as [1 1 1] and [1 5 1], respectively; the only difference is
the middle number.) The third element assigned is a compressed glottalic ingressive
labiodental sound ([2 2 1] in the matrix) which has a Hamming distance of 2 with each of
the first two sounds—no minimal sound pairs there. The maximally distributed system yields
20 minimal sound pairs from 21 sounds in a 6 x 7 x 2 space (Table 18).
Calculated using the same distance technique, the actual distribution of 22
beatboxing sounds in the same 6 x 7 x 2 space yields 37 minimal sound pairs (Table 17). The
Standard American English consonant system has 23 sounds in a 4 x 7 x 2 (manner x place x
voicing) space with a total of 57 minimal sound pairs (Table 19). The speech system has
fewer dimensions and more sounds, both of which increase the likely number of minimal
sound pairs. Even so, just these three minimal sound pair counts on their own do not give a
sense of whether the beatboxing and English consonant sound systems are more periodic
109
than if they were arranged by chance. To gain a better sense of the periodicity, random sound
distributions were created to find the likelihood of the beatboxing and speech systems
having 37 and 57 minimal sound pairs, respectively, given the number of sounds and
dimensions in their systems.
Ten thousand (10,000) random sound systems were created for each domain using
the same method as the maximally distributed system except that the elements were placed
randomly instead of at every fourth location. For simulations of beatboxing sound
distributions, 22 sounds were arranged randomly in a 6 x 7 x 2 matrix; for simulations of
speech sound distributions, 23 sounds were randomly distributed in a 4 x 7 x 2 matrix.
Figures 72 (beatboxing) and 73 (English consonants) show histograms of how many
minimal sound pairs were found across all trials. The purple bar in each figure marks the
actual number of minimal sound pairs calculated from Tables 17 (beatboxing) and 19
(speech). The probability of the beatboxing sound system having 37 or more minimal sound
pairs is 17.69% (about 1 standard deviation from the mean); the probability of the English
consonant system having 57 or more minimal sound pairs is 0.16% (about 3 standard
deviations from the mean). Though not marked, the hypothetical maximally dispersed
system (~20 minimal sound pairs in Figure 72) is roughly as unlikely as the number of
minimal sound pairs in the English consonant system.
The number of minimal sound pairs found in beatboxing sounds (37) is somewhat
higher than the expected value of minimal sound pairs (mean=33). Compared to the
hypothetical maximally distributed system, this beatboxer’s sound system errs on the side of
more periodic. However, the distribution of beatboxing sounds has far fewer minimal sound
110
pairs than expected compared to the well-ordered system of English consonants. (For the
beatboxing system to be as periodic as the English consonant system in this analysis, there
would have needed to be 45 minimal beatboxing sound pairs.) Assuming that other
languages’ consonant systems share a similar well-orderedness (as has often been claimed),
beatboxing sounds are distributed less periodically than speech consonants.
3.2.5.2 Shannon entropy
Entropy is sometimes used as a metric for the diversity of a system, with higher values
representing greater dispersion (less predictability) (Shannon, 1948). As Table 17 shows, the
22 beatboxing sounds are mostly concentrated into labial (8 sounds) and alveolar (6 sounds)
constrictions, with the remaining 8 sounds spread across labiodental (1 sound), retroflex (2
sounds), palatal (1 sound), dorsal (1 sound), and laryngeal (3 sounds) constrictions.
Compared to the other systems’ place distributions, beatboxing has the lowest entropy (2.36
bits) which means it re-uses place features the most. The English consonants are slightly less
predictable (2.56 bits), and the maximally dispersed system has the greatest entropy (2.81
bits).
It is not clear whether entropy is a useful metric of comparison for the other phonetic
dimensions. Take constriction degree as an example: the most straightforward comparison
would be between beatboxing’s three major constriction degrees (compressed, contacted,
and narrow; 1.33 bits) and a similar three-way system for English consonants—compressed
(stops, affricates, and nasals), narrow (fricatives), and approximants (1.42 bits). (This brings
the {SonL} and {TW} sounds back into the mix for a total of 24 beatboxing sounds.) This
comparison suggests that beatboxing sounds are slightly more predictable/less evenly
111
distributed along the constriction degree dimension. But the set of English consonants is
arguably more informative along the dimension of manner of articulation, not constriction
degree, and it makes less sense to compare the distribution of two different parameter spaces.
The same goes for voicing (which English consonants often use contrastively but beatboxing
sounds do not) and airstream mechanism (where beatboxing sounds are distributed along
6-7 values while English consonants have one).
The safest conclusion to draw is that this beatboxer’s beatboxing sounds are more
unevenly distributed along the place dimension than the set of English consonants are,
suggesting that beatboxing has some periodicity but that it manifests more strongly along
some dimensions than others.
112
Table 17. 22 beatboxing sounds/sound families, 37 minimal differences. Compressed on the
left, contacted on the right.
Airstream Bilabial Labiodental Alveolar Retroflex Palatal Dorsal Laryngeal
Glottalic egressive B PF t
CT
B D
Glottalic ingressive VLR HC u
Pulmonic egressive Bx LB TB hm
Pulmonic ingressive ^Ph ^LR ^CR ^K IB
Lingual egressive SS CR
Lingual ingressive LR WDT C
Table 18. 21 sounds with maximal dispersion, 20 minimal differences. Compressed on the
left, contacted on the right.
Airstream Bilabial Labiodental Alveolar Retroflex Palatal Dorsal Laryngeal
Glottalic egressive X X X X
Glottalic ingressive X X X
Pulmonic egressive X X X X
Pulmonic ingressive X X X
Lingual egressive X X X X
Lingual ingressive X X X
Table 19. 23 English consonants, 57 minimal differences ([l] conflated with [r]). Voiceless on
the left, voiced on the right.
Manner Labial Dental Alveolar Postalveolar Palatal Velar Glottal
Stop p b t d t ʃ d ʒ k g
Nasal m n ŋ
Fricative f v θ ð s z ʃ ʒ h
Approximant r j w
113
Table 20. Summary of the minimal sound pair and entropy (place) analyses for beatboxing, a
hypothetical maximally distributed system, and English consonants.
System # Sounds # Min. sound pairs Phonetic dimensions Place entropy (bits)
Beatboxing 22 37 7 place x 6 airstream x 2 constriction degree 2.3565
Maximally distributed 21 20 7 place x 6 airstream x 2 constriction degree 2.8074
English consonants 23 57 7 place x 4 manner x 2 voicing 2.5618
114
Figure 72. Histogram of 10,000 random minimal sound pair trials in a 6 x 7 x 2 matrix. The
probability of a random distribution of 22 sounds having 37 (purple) or more (darker gray)
minimal sound pairs is 17.69% (95% confidence interval: 17.08–18.30%).
Range: 20-53. Mean: 33.34. Median: 33. Standard deviation: 3.95. Skewness: 0.31. Kurtosis: 3.20.
115
Figure 73. Histogram of 10,000 random minimal sound pair trials in a 4 x 7 x 2 matrix. The
probability of a random distribution of 23 sounds having 57 (purple) or more (darker gray)
minimal sound pairs is 0.16% (95% confidence interval: 0.14–0.19%). (The colors are not
visible because the bars counting random distributions with 57 minimal sound pairs are
vanishingly small.)
Range: 36-69. Mean: 46. Median: 46. Standard deviation: 3.73. Skewness: 0.38. Kurtosis: 3.26.
116
4. Discussion
4.1 Summary of analyses
Two analyses were performed to investigate the organization of beatboxing sounds: a
frequency distribution analysis and a phonetic feature analysis. The sounds of this
beatboxer’s beat patterns form a Zipfian frequency distribution, similar to the Zipfian
distribution of words in language corpora. Both systems rely on a few high-frequency items
that support the rest of utterance. In English, these are function words (e.g., “the” or “and”)
that can be deployed in a wide variety of utterances and are likely to be used multiple times
in a single utterance. Words with lower frequency, on the other hand, are more informative
because they are less predictable—words like “temperature” are typically used in a relatively
restricted set of conversational contexts. In beatboxing, the most frequent sounds are the
Kick Drum, Closed Hi-Hat, PF Snare, and Inward K Snare. These sounds form the backbone
of musical performances and can be used flexibly in many different beat patterns. Infrequent
sounds like the Inward Clickroll add variety to beat patterns but may not be suitable
aesthetically for all beat patterns or prolonged use.
As for the phonetic frequency analysis, the primary aim was to determine whether or
not beatboxing sounds are composed combinatorially—and the answer seems to be that they
are. As described by Abler (1989), hallmarks of self-diversifying systems like speech and
chemical elements include sustained variation via combinations of elements (instead of
blending) and periodicity of those elements. This study does not provide evidence about
whether or how beatboxing sounds sustain variation, but it does provide evidence that
117
beatboxing sounds are composed of combinations of phonetic features. Beatboxing has
existed for at least two broadly defined generations (the old school and the new school) to
say nothing of the rapid rate at which beatboxing developed as an art form with cycles of
teaching and learning; since the beatboxer studied here is from the new school of
beatboxing, we can conclude that either the system has recently developed into a
combinatorial one or that the old school of beatboxing was also combinatorial and has
remained so over time. At the very least, no sounds in the inventory are a blend (i.e., an
average) of other sounds; on the contrary, sounds like the D Kick Roll and Inward Clickroll
with Liproll demonstrate that new sounds can be created by non-destructively combining
two existing sounds. That is, the components involved in the sounds separately are still
observable when the sounds are combined.
Section 3.2.5 showed that while beatboxing sounds are not organized with maximal
dispersion, they are also not nearly as periodic as the set of English consonants. In some
sense, the periodicity of the system diminishes as lower frequency sounds are added: the
most frequent sounds are all compressed sounds arranged neatly along major places of
articulation, and all but one (or two, if you count the unforced Kick Drum) are glottalic
egressive; the pulmonic ingressive outlier, the Inward K Snare, only deviates from the others
because it has a crucial homeostatic role to play. Although there is a tendency for later
sounds to pattern into either bilabial or alveolar constrictor and compressed or contacted
constriction degree, still the initial current phonetic dimensions are broadened and more
dimensions are added without filling all the available phonetic space.
118
One reason for this may be that beatboxers do not learn beatboxing sounds like they
learn speech sounds. Speech is ubiquitous in hearing culture; when a child learns one or
more languages, they have an abundance of examples to learn from. Beatboxing is not
ubiquitous, so someone trying to learn beatboxing must usually actively seek out new
vocabulary items to add to their beatboxing inventory; and since it seems many beatboxers
do not start learning to beatbox until at least adolescence, the process of learning even a
single sound may be relatively rather slow. For a beatboxer who learns this way, their sound
inventory is more likely to be less periodic because there is no overt incentive to learn
minimal sound pairs. On the contrary, in the interest of broadening their beatboxing sound
inventory a beatboxer may be more motivated to learn sounds less like the others they
currently know.
As previewed at the ends of sections 3.2.3 and 3.2.4, a major shortcoming of this
periodicity analysis is the reliance on a fixed table structure. Sounds like the Water Drop
(Tongue), Water Drop (Air), Sonic Laser, D Kick Roll, Inward Clickroll with Liproll, Sweep
Technique, Sega SFX, and Trumpet use multiple constrictions that are relatively common
among the sounds but do not manifest in a tabular periodicity measurement. To take the
Water Drop (Tongue) as an example: it uses both labial and coronal constrictors with a
lingual ingressive (tongue body closure and retraction) airstream. Placing it in only the
coronal constrictor column causes the analysis to under-count the labial articulation; but
placing the sound in the labial column too would inflate the number of sounds that use
lingual ingressive airstream. Rather than looking for periodicity in whole sounds, it would be
better in the future to look for periodicity among individual vocal constrictions. Chapter 4:
119
Theory discusses this issue more and the possibility of treating these combinatorial
constrictions as cognitive gestures.
4.2 Implications
4.2.1 Contrastiveness in beatboxing
The notion that speech sounds have a relationship with each other—and are in fact defined
by this relationship—is a major insight of pre-generative phonology. Sapir (1925) for example
emphasized that speech sounds (unlike non-speech sounds) form a well-defined set within
which each speech sound has a “psychological aloofness” (1925:39) from the others, creating
relational gaps that encode linguistic information through contrast. Many phonological
theories assume that the fundamental informational units of speech are aligned to specific
phonetic dimensions and combine to make a larger unit (a segment). We have seen that
beatboxing sounds have meaning and that there is even a Zipfian organization to the use of
beatboxing sounds which implies that they have word-like meanings—that is, their meanings
are directly accessible by the speaker or beatboxer, as opposed to the featural or segmental
information of speech sounds which speakers generally do not have awareness of. Since
beatboxing sounds are combinatorial, does that make the individual phonetic dimensions
contrastive? Cognitive?
Beatboxing sounds clearly do not encode the same literal information as speech
sounds because beatboxing cannot be interpreted as speech. But 37 minimal sound pairs
were identified in a reductive three-dimensional framework of 22 beatboxing sounds, and the
less reductive system of over 40 sounds includes minimal differences in parameters like
120
voicing, double articulations, and double airstreams. The analysis in section 3.2.5.1 may not
have found evidence for robust periodicity, but it did find that there are far more minimal
sound pairs in this beatboxer’s inventory than if the sounds were carefully arranged to
minimize compositionality. Changing one phonetic property of a beatboxing sound may
change the meaning of that sound just as changing one phonetic property of a word may
change the meaning of the word (e.g., changing the nasality of the final sound in “ban”
[bæn] results in “bad” [bæd]). In this sense, yes: the sounds of beatboxing are in a
contrastive relationship with each other.
Because the sounds of this beatboxer are not arranged very periodically, the contrasts
are not as neatly arranged as they are in speech. But even in speech contrast is a gradient
rather than a categorical phenomenon (Hockett, 1955): sounds may encode contrasts to
different degrees depending on the phonetic dimensions involved (e.g., the laterality of [l] in
English applies to only that one sound) or their role in larger constructions (e.g., [ ŋ] only
occurs word-finally in English and so is only contrastive word-finally whereas [n] is
contrastive word-initially and word-finally). Beatboxing sounds can contrast with each other
even if the contrasting system is not as dimensionally-efficient as a language’s contrastive
sound system.
Less clear is whether the differences between beatboxing sounds are also cognitive
differences. The answer depends in part on whether beatboxing sounds have phonological
patterning that is predictable based on certain phonetic dimensions. For example, velum
lowering is generally considered a cognitive gesture for nasality because nasality is active in
phonological behavior (e.g., spreading in phonological harmony); the velum raising that
121
makes oral sounds possible, on the other hand, is often considered inert because it does not
appear to play a role in phonological behavior. Whether any of the combinatorial dimensions
of beatboxing sounds are cognitive is taken up in detail in Chapter 6: Harmony.
4.2.2 Domain-general explanation for similar phonetic dimensions
Despite their phonetic similarities, beatboxing is not an offshoot of or parasite on phonology
(cf the vocal art form scatting which does draw on phonological well-formedness conditions
for the production of non-linguistic music; Shaw, 2008). For one thing, the lack of vowels
precludes the possibility that the near-universal CV syllable could exist in beatboxing. For
another, if beatboxing sounds were composed of linguistic phonological units then there
would be no pulmonic ingressive or lingual egressive beatboxing sounds because those do
not exist in language either (Eklund, 2008; cf Hale & Nash, 1997 for lingual egressive sounds
in Damin).
Even so, we have seen conspicuous overlap between the combinatorial phonetic
dimensions leveraged by speech and beatboxing—shared constriction locations and
constriction degrees, some use of voicing and laterality, and overlapping airstream
mechanisms. This may be easily explained by domain-general approaches to speech (and
beatboxing) cognition. For example, the Quantal Theory (Stevens, 1989; Stevens & Keyser,
2010) deduces common phonological features by searching for regions in the vocal tract that
afford stable relationships between articulation and acoustics; the apparent universality of
features in speech is thus explained as arising from humans sharing the same vocal tract
physiology. But the relationship between articulation and acoustics in the vocal tract is not
122
special to speech—it is simply a property of the human vocal instrument, and so could just as
easily apply to beatboxing. The prediction would be that beatboxing and speech would share
many of the same phonetic features, which is indeed what we found here. Auditory theories
of speech could likewise apply to beatboxing audition, though to my knowledge there is no
work on beatboxing perception.
Chapter 4: Theory offers an explicit gesture-based approach to phonology and
beatboxing which capitalizes on the domain-general properties the systems share. That
chapter also includes a brief discussion of how a gestural description might encode
beatboxing contrast more effectively than the procrustean tables of sounds used here. Since
speech and beatboxing units are informationally unrelated to each other, purely
domain-specific theories of phonology cannot offer any explanation for why beatboxing and
speech might have similar structural units.
123
CHAPTER 4: THEORY
This chapter introduces a theoretical framework under which speech and beatboxing
phonological units are formally linked. Specifically, in the context of the task-dynamics
framework of skilled motor control, speech and beatboxing are argued to have atomic units
that share the same graph (that is, the same fundamental architecture) but may differ
parametrically in task-driven ways. Under the hypothesis from Articulatory Phonology that
information-bearing action units are the fundamental cognitive (phonological) units of
language, the graph-level link between speech and beatboxing actions becomes a cognitive
relationship. This cognitive link permits the formation of hypotheses about similarities and
differences between beatboxing and speech actions.
1. Introduction
At present, there is no theory of the cognitive structure of beatboxing or its fundamental
(motor) units. Therefore, there is no theoretically-motivated basis for drawing comparisons
between the atoms of speech and beatboxing or their organization. This chapter aims to
sketch such a theory of beatboxing fundamental units and their organization that can
provide a way of formally relating units in speech and beatboxing,
Dynamical systems are here used as the basis for understanding beatboxing units and
organization. The framework of task dynamics (Saltzman & Munhall, 1989) is commonly
used in Articulatory Phonology (Browman & Goldstein, 1986, 1989) to model the
coordination of a set of articulators in achieving the motor tasks (gestures) into which
speech can be decomposed. These task-based gestures are hypothesized to be isomorphic
124
with the fundamental cognitive units of speech. The coordination of the multiple units
composing speech is in turn modeled by coupling the activation dynamics of these units
(Nam & Saltzman, 2003; Goldstein et al., 2009; Nam et al., 2009). But task dynamics and the
coupling model are not speech-specific; they are inspired by nonlinguistic behaviors and can
be used to model any skilled motor task. Section 2 introduces concepts from dynamical
systems that will be the foundation of the link between speech and beatboxing. Section 3
argues that beatboxing sounds may be composed of gestures, and section 4 illustrates the
specific hypothesis that the fundamental units of beatboxing and speech share the same
domain-general part of the equations of task dynamics (the graph level). This establishes a
formal link between the cognitive units of speech and beatboxing that can serve as the basis
for comparison and hypothesis testing.
2. Dynamical systems and their role in speech
Articulatory Phonology hypothesizes that the fundamental units of phonology are action
units called “gestures” (Browman & Goldstein, 1986, 1989). Unlike symbolic features which
make no reference to time and only reference the physical vocal tract abstractly (if at all),
gestures as phonological action units vary in space and over time according to an invariant
differential equation (Saltzman & Munhall, 1989) that predicts directly observable
consequences in the vocal tract. While a gesture is active, it exerts control over a vocal tract
task variable (e.g., lip aperture) through coordinated activity in a set of articulators, in order
to accomplish some phonological task (e.g., a complete labial closure for the production of a
labial stop) as specified by the parameters of its differential equation. Phonological
125
phenomena that are stipulated through computational processes in other models emerge in
Articulatory Phonology from the coordinated overlap in time of gestures in an utterance.
Section 2.1 describes dynamical systems in terms of state, parameter, and graph levels.
Section 2.2 explains different point attractor dynamical systems and the usefulness of point
attractors as phonological units in speech.
2.1 State, parameter, and graph levels levels.
The dynamical systems used to model phonological units and their organization can be
characterized with three levels: the state level, the parameter level, and the graph level
(Farmer, 1990; Saltzman & Munhall, 1992; see Saltzman et al., 2006 for a more thorough
introduction). The dynamical system in Equation 1 which characterizes the movement of a
damped mass-spring and is commonly used as the basic equation for gestures in Articulatory
Phonology (Saltzman & Munhall, 1989):
Equation 1. ̈ =− ̇ − ( − 0
)
State level. In Equation 1, The variables , , and all encode the instantaneous value of the ̇ ̈ state variable(s) of the system: the first represents its position, the second represents its
velocity, and the third represents its acceleration. The state variables generally are the vocal
tract task variables referred to above, such as the distance between the tongue body and the
palate or pharynx (tongue body constriction degree) or the distance between the upper and
lower lip (lip aperture). The values of those state variables change continuously as vocal tract
articulators move to achieve the goal of the system.
126
Parameter level. The task goal of the system ( ) is defined at the parameter level: it 0
does not change while this gesture is active. Other parameters in this equation are (a damping coefficient) and (which determines the stiffness of the system—that is, how fast the system moves toward its goal). Each phonological gesture is associated with its own
distinct parameters. For example, the lip aperture gestures for a voiced bilabial stop [p] and a
voiced bilabial fricative [ ɸ] are different primarily in their aperture goal : the goal of the 0
stop is lip compression (parameterized as a negative value for ), while the goal of the 0
fricative is a light closure or slight space between the lips (parameterized as a value for 0
near 0). (Parameter values change more slowly over time, such as when a person moves to a
new community and adapts to a new variety of their language). Thus, Equation 1 states that a
fixed relation defined by the phonological parameters holds among the physical state
variables at every moment in time that the gesture is active. This fixed relationship defines a
phonological unit.
Graph level. The graph level is the architecture of the system. Part of the architecture
is the relationship between states and parameters in an equation (Saltzman et al., 2006). For
example, notice that the term for the spring restoring force in the mass-spring ( − 0
)
system above is subtracted from the damping force ; if it were added instead, that would ̇ be a change in the graph level of this dynamical system. The system’s graph architecture also
includes the number and composition of the equations in a system (Saltzman & Munhall,
1992). With respect to speech, composition crucially includes the specification of which tract
variables are active at any time.
127
Different graphs can result in qualitatively different behaviors. Changing the number
of equations in a system can create entirely different sounds. For example, the graph for an
oral labial stop [b] uses a lip aperture tract variable, but the graph for a nasal labial stop [m]
uses tract variables for both lip aperture and velum position. Alternatively, changing the
relationship between terms in an equation can affect how the same effector moves. Equation
2 shows the graph of a periodic attractor (Saltzman & Kelso, 1987); this type of dynamical
system describes the behavior of a repetitive action like rhythmic finger tapping or turning a
crank, which is qualitatively different from a point attractor system with a goal of a single
point in space. The graph for the periodic attractor in Equation 2 is modified from the
damped mass-spring system in Equation 1 by the addition of the term which adds or ( , ̇)
removes energy from the system to sustain the intended movement amplitude.
Equation 2. ̈ = ( , ̇) − ̇ − ( − 0
)
Taken together, state, parameter, and graph levels characterize a dynamical system. In
Articulatory Phonology and task-dynamics, the mental units of speech and their organization
are dynamical systems, and so the state, parameter, and graph levels characterize a speaker’s
phonology. Table 21 summarizes the roles of state, parameter, and graph levels in gestures
(i.e., Equation 1).
128
Table 21. Non-exhaustive lists of state-, parameter-, and graph-level properties for dynamical
systems used in speech.
State level Parameter level Graph level
System type: Gesture
Position
Velocity
Acceleration
Activation strength
Target state
Stiffness
Strength of other movement
forces (e.g., damping)
Blending strength
System topology (e.g., point
attractor)
Tract variable selection
Selection of and relationship
between parameter and
state variables
2.2 Point attractors
Vocal tract movements in speech have certain characteristics that suggest what an
appropriate dynamical topology (graph) may be. Speech is produced by forming an
overlapping series of constrictions and releases in the vocal tract; each constriction affects
the acoustic properties of the vocal instrument in a specific, unique way such that
constrictions of different magnitudes and at different locations in the vocal tract create
distinctive acoustic signals. Each speech action can therefore be characterized as having a
relatively fixed spatial target for the location and degree of constriction. Moreover, speech
movements exhibit equifinality: they reach their targets regardless of the initial states of the
articulators creating the constriction or perturbations from external forces—as long as there
is enough time for the constriction to be completed and the same articulators are not being
used to try to achieve two incompatible constrictions simultaneously. Figure 74 demonstrates
how position and velocity change as a function of time during a spoken labial closure.
129
Figure 74. A lip closure time function for a spoken voiceless bilabial stop [p], taken from
real-time MRI data.
Point attractor dynamical systems generate precisely these qualities. Several different
differential equations can be used to model point attractor dynamics, and their goodness of
fit to the data can be assessed by comparing the model kinematics against the real-world
kinematics. For example, consider the first-order point attractor in Equation 3 in which is the current spatial state of the system, is the system velocity, is the system’s spatial ̇ 0
target, and is a constant that determines how quickly the system state changes. 0 < < 1
Regardless of the starting value of , the state always moves (asymptotically) toward —that 0
is, it is attracted to the target point . 0
Equation 3. ̇ =− ( − 0
)
130
Figure 75. Schematic example of a spring restoring force point attractor.
But comparing the spring restoring force (Figure 75) to an actual speech movement (Figure
74) reveals that the details of the velocity state variable for this first-order point attractor are
not a good fit for speech kinematics. Speech movements generally start with 0 velocity and
have increasing velocity until they reach peak velocity sometime in the middle of the
movement trajectory, but this first-order spring system in Equation 3 begins at maximum
velocity and has only decreasing velocity over time. A kinematic profile that starts at peak
velocity is not an accurate portrayal of speech kinematics which tend to start at 0 velocity.
A better choice for modeling the dynamics of speech atoms is the damped
mass-spring system from Equation 1. When critically damped ( ), the damped = 2 mass-spring system acts as a point attractor: regardless of the initial starting state , the state 131
of the system will converge toward its goal and stay at that goal for as long as the system is 0
active. The position time series for a critically damped mass-spring system (Figure 76) results
in a somewhat better fit to the characteristic kinematic properties of speech movements
(Figure 74): velocity starts at 0, increases until it peaks, then gradually decreases again as approaches . However, the peak velocity of the observed speech movement exhibits a more 0
symmetric velocity profile with the peak velocity about halfway through the gesture; the time
of peak velocity for the mass-spring equation in Equation 1 (Figure 76) is much earlier,
indicating that a different equation may have a better fit.
Figure 76. Schematic example of a critically damped mass-spring system.
A third point attractor with a different graph is the damped mass-spring system with a “soft
spring” that has been suggested to more accurately model the kinematics of vocal movement
132
(Sorensen & Gafos, 2016). This equation (Equation 4) has the same pieces as Equation 1,
plus a cubic term that weakens the spring restoring force when the current state ( − 0
)
3
is relatively far from the target state. In other words, the system won’t move as quickly
toward its target state at the beginning of its trajectory.
Equation 4. ̈ =− ̇ − ( − 0
) + ( − 0
)
3
Figure 77. Schematic example of a critically damped mass-spring system with a soft spring.
One of the most noticeable differences between the damped mass-spring systems with
(Figure 77) and without (Figure 76) the soft spring is the difference in the relative timing of
peak velocity. Both systems start out with 0 velocity and gradually increase velocity until
velocity reaches its peak; however, the system with the soft spring reaches its peak velocity
later than the system without the soft spring, which Sorensen and Gafos (2016) show is a
133
better fit to speech data (compare for example against the speech labial closure in Figure 74).
The critically-damped mass-spring system without the soft spring can result in this
kinematic profile if gestures have ramped activation—that is, rather than treating gestures as
if they turn on and off like a light switch, increasing a gesture’s control over the vocal tract
gradually like a dimmer switch also delays the time to peak velocity (Kröger et al., 1995; Byrd
& Saltzman, 1998, 2003). Sorensen & Gafos (2016) argue that the dynamical system with the
soft spring term should be preferred to the simpler damped mass-spring system with ramped
activation to preserve a gesture’s intrinsic timing (see Fowler, 1980).
Details of equation architecture (graph) aside, point attractor dynamical systems are
useful as speech units for many reasons. A variety of speech phenomena can be accounted
for by specifying the temporal interval during which a point attractor exerts control in the
vocal tract. For example, if a gesture’s dynamical system does not last long enough for the
gesture to reach its goal, gestural undershoot might lead to phonological alternation e.g.,
between a stop and a flap or fricative (Parrell & Narayanan, 2018). Alternatively, if a gesture
remains active after reaching its target state, the gesture is prolonged; this is one account of
the articulation of geminates (Gafos & Goldstein, 2011). The temporal coordination of two or
more gestures can result in spatio-temporal overlap that may account for certain types of
phonological contrasts and alternations (Browman & Goldstein, 1992). Some types of
phonological assimilation, harmony, and epenthesis can be described as resulting from the
temporal overlap of gestures (Browman & Goldstein, 1992. And when two gestures are active
at the same time over the same vocal tract variable(s), those gestures blend together,
resulting in coarticulation.
134
All in all, point attractor topologies are advantageous as models of gestures for a
variety of reasons. Section 4 argues that the point attractor system that gestures share are
advantageous for beatboxing as well.
3. Beatboxing units as gestures
Beatboxers have mental representations of beatboxing sounds. Beatboxers are highly aware
of the beatboxing sounds in their inventory and the differences between many of those
sounds. They give names to most of the sounds—though the names may differ from language
to language or beatboxer to beatboxer as they do for the “power kick” (Paroni et al., 2021)
and “kick drum” which are both names for a bilabial ejective that fulfills a particular musical
role. Likewise, a crucial component of beatboxing pedagogy is associating a name and a
sound; beatboxers who want to learn a sound they heard someone else perform first need to
know the name of the sound so they can ask for instruction. The naming and identification
of beatboxing sounds suggest that skilled beatboxers can distinguish a wide variety of vocal
articulations within the context of beatboxing. (Unsupervised classifier models can also
reliably group beatboxing sounds into clusters based on the acoustic signal via MFCCs;
Paroni et al., 2021). As distinct objects, each associated with some meaning and available to
abstract thought, beatboxing sounds can be thought of as segment-sized mental
representations.
Chapter 3: Sounds motivates looking at the phonological-level patterning of
beatboxing in terms of gestures rather than more traditional phonetic dimensions. Gestures
in speech are cognitive representations that fill the role of the most elementary abstract,
135
compositional units of phonological information. Information implies phonological contrast:
changing, removing, adding, or re-timing a gesture can often change the meaning of a word.
Chapter 3: Sounds argued that beatboxing sounds are composed compositionally in the same
way that speech sounds are—though without the same degree of periodicity/feature
economy—and that there are a not insubstantial number of minimal sound pairs for which
changing one articulator task, one gesture, can change the meaning of the sound.
Superficially, at least, this is similar to the use of speech gestures.
Gestures are particularly advantageous for describing beatboxing patterns because a
gestural description of a sound can incorporate time and complex multi-articulator
combinations that are difficult or impossible to manage in a symbolic, IPA chart-type
phonetic notation (though foregoing symbols and charts sacrifices some brevity). But the
actual number of tract variable gestures used to create the different beatboxing sounds is
relatively small. Sounds like the D Kick Roll and Open Hi-Hat do not fit into a table just
because they use too many tract variables to conveniently fit in any one place in a table.
Others like the Water Drop (Air) re-use the same tract variable multiple times (in this case,
the tongue body constriction location). A gestural approach to describing beatboxing can
account for these types of cases in a way that looking at the sounds in a table cannot. If the
number of tract variables used really is small, then a gestural perspective might even show
that the periodicity of beatboxing sounds is comparable to the periodicity of speech sounds.
Beyond the descriptive convenience, evidence for gestures as primitives of
information in the cognitive system underlying beatboxing would require a more complete
inventory of contrastive beatboxing gestures as well as evidence that these gestures play a
136
role in defining natural classes for characterizing beatboxing alternations (on analogy to
phonological alternations in speech). The role of gestures in characterizing beatboxing
natural classes is investigated in Chapter 5: Alternations and Chapter 6: Harmony. A
complete inventory is beyond the scope of this work, and the task might in principle be
impossible—beatboxing sounds appear to be an open set, and the gestures themselves might
come and go as a beatboxer’s sound inventory changes.
That said, it is possible to hazard some educated guesses about what a set of
beatboxing gestures might include. Frequently-used constrictors like the lips and the tongue
tip are likely to be associated with beatboxing gestures at compressed and contacted
constriction degrees (that can be encoded in task dynamics as constriction degree goals of a
negative value and 0, respectively). Since beatboxing involves a wider range of mechanisms
for controlling pressures in the vocal tract than does speech, contrasting gestures to initiate
pressure changes in the vocal tract would seem to be required: pulmonic, laryngeal, and
lingual tasks, along with contrasts in the goal value of such gestures (increased or decreased
pressure). Voicing seems to make a difference in some sounds too, and at the very least is the
only clearly identifiable property of humming and the Inward Bass.
Suspiciously, almost all of these hypothetical gestures use the same vocal tract
variables as speech and with ultimately similar pressure control aims, though not necessarily
with speechlike targets or configurations. Pulmonic tasks (for increased vs. decreased
pressure) is the only one of these not attested to be used contrastively in speech, although
pulmonic ingressive airflow is used somewhat commonly around the world as a sort of
pragmatic contrast to non-ingressive speech (Eklund, 2008). But the point is not that speech
137
and beatboxing are built on the same set of gestures—the contrastive use of pulmonic
airstreams as well as the use of lateral labial constrictions (not reported in this dissertation
because there was only a midsagittal view) rules out the possibility that beatboxing is a
reconfiguration of speech units. Rather, the point is that a gesture-based account may work
well for both speech and beatboxing. In this sense, just as Articulatory Phonology is based
around the hypothesis that gestures are the fundamental units of speech production and
perception, a similar hypothesis for beatboxing phonology is that beatboxing actions
controlled by task dynamic differential equations are the fundamental units of beatboxing
production and perception. The next sections are dedicated to developing an understanding
of how gestures are recruited differently for speech and beatboxing while simultaneously
linking the two domains through the potential of the vocal instrument.
4. A dynamical link between speech and beatboxing
The state, parameter, and graph levels of the differential equations in task dynamics provide
an explicit way to formally compare and link speech and beatboxing sounds. Beatboxing and
speech actions use the same vocal tract articulators to create sound, which means they are
constrained by the same physical limits of the vocal apparatus. In task dynamics, these
limitations constrain the graph dynamics and parameter space of actions available to a vocal
system. Functional speech-specific and beatboxing-specific constraints can further delineate
and refine each domain’s graph dynamics and parameter space; even so, as this section
argues, the actions in both domains appear to use the same point attractor topologies, tract
138
variables, and coordination, all of which indicate that speech and beatboxing share the same
graph.
4.1 Hypothesis: speech and beatboxing share the same graph
These graph properties appear to be shared by speech and beatboxing: the individual actions
are point attractors (section 4.1.1) operating mostly over the same tract variables as speech
gestures (section 4.1.2) with similar timing relationships (section 4.1.3). In addition, coupled
oscillator models of prosodic structure have been used to account for both speech and
musical timing, making them a good fit for beatboxing as well (section 4.1.4).
4.1.1 Point attractor topology
Point attractors have been used as models of action units for behaviors other than speech,
even behaviors without the kinds of phonological patterns that speech has (Shadmehr, 1998;
Flash & Sejnowski, 2001). Goldstein et al. (2006:218) “view the control of these units of
action in speech to be no different from that involved in controlling skilled movements
generally, e.g., reaching, grasping, kicking, pointing, etc.”
Beatboxing and speech sounds leverage the same vocal tract physics: wider
constrictions (as in sonorants) alter the acoustic resonances of the vocal tract, and narrower
constrictions or closures obstruct the flow of air to create changes in mean flowrate and
intraoral pressure and generate acoustic sources. Moreover, beatboxing and speech both have
discrete sound categories, like a labial stop [p] in speech and a labial stop Kick Drum {B} in
beatboxing. Creating discrete sounds requires vocal constrictions with specific targeted
139
constriction locations and degrees (Browman & Goldstein, 1989). As discussed in section 2.2,
point attractors are ideal for such actions.
The kinematics of many beatboxing movements bear a strong resemblance to the
kinematics created by speech point attractor gestures: they start slow, increase velocity until a
peak somewhere near the middle of the movement, and slow down again as the target
constriction is attained (Figure 78). This suggests that beatboxing actions share both
qualitative and quantitative point attractor graphs with speech gestures.
4.1.2 Tract variables
Part of the graph level is the specification of which task variables are active at any time.
Beatboxing and speech operate over the same vocal tract organs and therefore have access to
(and are limited to) the same vocal tract variables. In Chapter 3: Sounds it was established
that many beatboxing sounds resemble speech sounds in constriction degree and location.
The specific tract variables used by each behavior may not completely overlap. Beatboxers
for example sometimes use lateral bilabial constrictions but there are no speech task
variables for controlling laterality—due partly to the difficulty of acquiring relevant lateral
data to know what a lateral task variable might be, but also to the fact that laterals in speech
are always coronal and can be modeled by adding an appropriate dorsal gesture to a coronal
gesture. Such a strategy would not work for modeling lateral labials. Overall, though, speech
and beatboxing movements are more similar than they are different.
140
Figure 78. Position and velocity time series for labial closures for a beatboxing Kick Drum {B}
(left) and a speech voiceless bilabial stop [p] (right). Movements were produced by the same
individual, tracked using the same rectangular region of interest that encompassed both the
upper and lower lips. Average pixel intensity time series in the region of interest were
smoothed using locally weighted linear regression (kernal = 0.9, Proctor et al., 2011; Blaylock,
2021), and velocity was calculated using the central difference theorem as implemented in
the DelimitGest function (Tiede, 2010). Both movements were extracted from a longer,
connected utterance (a beatbox pattern with the Kick Drum and the phrase “good pants”
from a sentence produced by the same beatboxer). See Chapter 2: Method for details of data
acquisition.
The physics of sound manipulation in the vocal tract are the same for speech and
beatboxing: different constriction magnitudes and locations along the vocal tract result in
different acoustics. Some regions of the vocal tract are more stable than others, meaning that
variation of constriction location within some regions results in little acoustic change; these
stable regions are argued to shape the set of distinctive contrasts in a language so that
coarticulation does not dramatically alter the acoustic signal and lead to unwanted percepts
(Stevens, 1989; Stevens & Keyser, 2010). Though beatboxing does not have linguistically
contrastive features to convey, parity must still be achieved between an expert beatboxer and
a novice beatboxer in order for learning to occur. Beatboxers exploit the same vocal physics
141
to maximize transmission of the beatboxing signal, resulting in beatboxers leveraging the
same vocal tract variables.
4.1.3 Intergestural coupling relationships
The relative timing of two speech gestures can make a meaningful difference within a word.
For example, the timing of a velum lowering gesture makes all the difference between “mad”
[mæd] (velum timed to lower at the beginning of the word) and [bæn] (velum timed to
lower closer to the end of the word). Timing between gestures can also be contrastive even
within a single segment too, like the relative timing of the oral closure gesture and laryngeal
lowering that distinguishes voiced plosives and voiced implosives (Oh, 2021).
A common model of intergestural timing in Articulatory Phonology is a system of
coupled periodic timing oscillators or “clocks” (Nam & Saltzman, 2003; Goldstein et al.,
2009; Nam et al., 2009). While a clock is running, its state (the phase of the clock)
continually changes just like the hands on a clock move around a circle. These clocks are
responsible for triggering the activation of its associated gesture(s) in time; the triggering
occurs when a clock’s state is equal to a particular activation phase. Thinking back to the
graph level, coupling two oscillators means that the dynamical equation for each oscillator
includes a term corresponding to the state (phase) of the oscillator(s) to which it is coupled;
thus, the phase of each oscillator at any time depends on the phases of the other oscillators
to which it is coupled. This inter-clock dependency is a major advantage of the oscillator
model of intergestural timing: the phases of coupled clocks settle into different modes like
in-phase (0 degree difference in phase) or anti-phase (180 degree difference in phase) that
142
result in gestures being triggered synchronously or sequentially, respectively. The state,
parameter, and graph components of the coupled oscillator model are given in Table 22.
Table 22. Non-exhaustive lists of state-, parameter-, and graph-level properties for coupled
timing oscillators (periodic attractors).
State level Parameter level Graph level
System type: Coupled oscillators
Phase Activation/deactivation phase
Oscillator frequency
Coupling strength & direction
Coupling type (in-phase,
anti-phase)
Number of tract variables
Intergestural coupling
Selection of and relationship
between parameter and
state variables
In-phase coupling between a consonant gesture and a vowel gesture results in a CV syllable
or mora; it is also used intrasegmentally for consonants with more than two gestures, for
example a voiceless stop with both an oral constriction gesture and a glottal opening gesture.
Anti-phase coupling between a vowel and consonant typically results in a sequential
nucleus-coda syllable structure. Anti-phase coupling may also exist in some languages
between consonants in an onset cluster, with all the consonants coupled in-phase to the
vowel but anti-phase to each other, resulting in what has been described as the C-Center
effect (Browman & Goldstein, 1988).
The specific timing relations needed to model beatboxing are unclear at the moment,
and it is not clear if beatboxing needs a coupled oscillator model of timing per se. On the one
hand, beatboxing does not usually feature wide vowel-like constrictions, so there does not
appear to be anything quite like a CV syllable in beatboxing, much less something like a
syllable coda; in general, beatboxing sounds are coordinated with the alternating rhythmic
143
beats (section 4.1.4), so intergestural coupling relations might usually be relevant only among
the component gestures of a given beatboxing sound. On the other hand, there is clear
evidence for intra-segmental timing relationships that may benefit from a coupled oscillator
approach. Some of the most common beatboxing sounds are ejectives, and these require
careful coordination between the release of an oral constriction and the laryngeal
closing/raising action that increases intraoral pressure (Oh, 2021); the same is likely true for
lingual and pulmonic beatboxing sounds. In addition, some beat patterns feature two
beatboxing sounds coordinated to the same metrical beat, resulting in sound clusters like a
Kick Drum followed closely by some kind of trill. This kind of relationship between sounds
and the meter suggests that the beatboxing sounds in these clusters may be coupled with
each other in some way.
4.1.4 Coupled timing oscillators
Hierarchical prosodic structure in speech has also been modeled using coupled oscillators,
including syllable- and foot-level oscillators (Cummins & Port, 1998; Tilsen, 2009; Saltzman
et al., 2008; O’Dell & Neiminen, 2009). The cyclical nature of oscillators matches the ebb and
flow of prominence in some languages, including stress languages that alternate (more or
less regularly) stressed and unstressed syllables.
In Chapter 2: Method, it was shown that the musical meter in styles related to
beatboxing have highly regular, hierarchically nested strong-weak alternations. Coupled
oscillators are well-suited for modeling these types of rhythmic alternations in music (e.g.,
Large & Kolen, 1994): each oscillator contributes to alternations at one level of the hierarchy,
and the oscillators to which it is coupled have either half its frequency (hierarchically
144
“above”, with slower alternations) or double its frequency (hierarchically “below”, with
slower alternations), yielding a stable 1:2 frequency coupling relationship between each level.
Other rhythmic structures like triplets can be modeled by temporarily changing oscillator
frequencies (a parameter level change).
4.2 Tuning the graph with domain-specific parameters
Speech and beatboxing share the same set of vocal organs, each of which has its own
mechanical potential and limitations for any movement. Tasks are constrained by the
physical abilities of the effectors that implement them; in the task dynamics model, this is
represented as a constraint on the range of values of each dynamical parameter that fits into
a given graph. Therefore, speech and beatboxing share both their graph structures and a
physically-constrained parameter space.
Within that physically-constrained parameter space, the difference between two
speech gestures that use the same tract variable is encoded by different parameter values.
Different constriction targets (represented as x
0
in Equation 1) can lead to different manners
of articulation, with a narrow constriction target for a fricative, a lightly closed constriction
target for a trill, or a compression target for a stop. For a given sound, the selection of a tract
variable (or tract variables) and the associated learned parameter values are part of a
person’s knowledge about their language (and may differ slightly from person to person for a
given language).
Gestures can be viewed as available “pre-linguistically” (Browman & Goldstein, 1989):
the action units that become gestures are not inherently linguistic, but are harnessed by the
145
language-user to be used as phonological units. This is accomplished by tuning the
parameters of a gesture to generate a movement pattern appropriate for the information
contrasts relevant to language being spoken. The same pre-linguistic actions can be
harnessed for non-linguistic purposes, including beatboxing; they may simply require
different parameter tuning associated with their domain-specific function.
This tuning of speech gestures to functionally-determined values is spelled out within
the task-dynamics framework (Saltzman & Munhall, 1989) as the specification of values at
the parameter level of a dynamical equation described in section 2.1. When a gesture is
implemented, the task-specific parameter values for that gesture are applied to the system
graph. This application is depicted in Figure 79. The point attractor graph space on the left
represents the untuned dynamical system that is (by hypothesis) the foundational structure
of every gesture (Saltzman & Munhall, 1989). Learned parameters associated with a
particular speech task are summoned from the phonological lexicon to tune the dynamical
system, like the intention to form a labial constriction for a /b/, represented in the figure as
an unfilled (dark) circle. The result of this tuning is a speech action—a phonological gesture,
represented in the figure as a filled (light) circle.
146
Figure 79. Parameter values tuned for a specific speech unit are applied to a point attractor
graph, resulting in a gesture.
Figure 80. Speech-specific and beatboxing-specific parameters can be applied separately to
the same point attractor graph, resulting in either a speech action (a gesture) or a beatboxing
action. Applying appropriately tuned parameters to a graph specializes the action for one
domain or another.
As argued above, speech and beatboxing actions can both be described as point attractors
operating over a shared set of tract variables, though the use of those tract variables
sometimes differs between the two domains. With respect to parameter tuning in
task-dynamics, this simply means that beatboxing actions use the same point attractor graph
147
as speech but with beatboxing-specific parameter values (Figure 80). This is one way of
establishing a formal link between beatboxing and speech in task dynamics: the atomic
actions of each behavior share the same graph, but differ by domain-specific parameter
values.
2
What determines the parameter values for speech sounds and beatboxing sounds?
The answer lies in the intention behind each behavior: beatboxing actions create musical
information, speech actions communicate lexico-semantic information, and the dynamical
parameters are tuned accordingly. For example, beatboxing and some languages both feature
bilabial ejectives in their system of sounds. A beatboxing bilabial ejective is a Kick Drum, and
has a particular aesthetic quality to convey, so its labial and laryngeal gestures may have
different targets, stiffnesses, and inter-gestural coordination compared to a speech bilabial
ejective which contributes to the communication of a linguistic message.
Beatboxing phonology—including its fundamental contrastive and cognitive units and
the interplay between those units—arises from the interaction between the physiological
constraints of vocal sound production and the broader tasks of beatboxing, just as the
fundamental contrastive and cognitive units of speech and the interplay between those units
arise from the interaction between the same constraints and the tasks of speech. Gestures are
2
As noted earlier, an alternative hypothesis is that beatboxing is “parasitic” on speech, recombining whole
speech gestures—including existing phonological parameterizations—into the set of beatboxing sounds. This
seems unlikely because the tract variables and target values used by speech and beatboxing do not fully overlap.
Beatboxing does not adopt the speech gestures used for making approximants and vowels. More to the point,
English-speaking beatboxers use lateral labial gestures, constrictions that make trills, and a variety of
non-pulmonic-egressive airstreams, none of which are attested in the phonology of English. Even if one were to
assume an innate, universal set of phonological elements for beatboxing to pilfer from, the lack of attestation of
phonologically contrastive pulmonic ingressive and lingual egressive units rules them out from the set of
universal features—since beatboxing has them, it must have gotten them from somewhere else besides speech.
For illumination by comparison: there are vocal music genres like scatting (Shaw, 2008) that do seem to be
parasitic on speech gestures and phonological patterns; these behaviors sound speechlike, and beatboxing does
not.
148
a useful way of modeling this interaction in both domains because they encode both
domain-specific intention and domain-general abilities/constraints. The possible parameter
values for a given gesture are constrained both by the physical limitations of the system and
by domain-specific task requirements.
This approach to speech and beatboxing is in some sense a formalization of the
anthropophonic perspective of speech sound. The term anthropophonics originated with Jan
Baudouin de Courtenay as part of the distinction between the physical (anthropophonic)
and the psychological (psychophonic) properties of speech sounds. Catford (1977) defines
anthropophonics as a person’s total sound-producing potential, referring to all the vocal
sound possibilities that can be described (general phonetics) of which the whole set of
speech possibilities is only a subset (linguistic phonetics). Lindblom (1990) adopted the
anthropophonic perspective as part of the broader program of deducing the properties of
speech from non-speech phonetic principles, specifically with respect to the question of how
to define a possible sound of speech (cf Ladefoged, 1989). Particularly as used in the vein of
Catford and Lindblom, anthropophonics is about taking domain-general vocal potential—all
of the possible vocal sound-making strategies and configurations—and understanding how
domain-specific tasks filter all that potential into a coherent system. The dynamical
formalization accomplishes this by encoding domain-general possibilities at the graph level
and domain-specific tasks in the control parameters.
149
5. Predictions of the shared-graph hypothesis
The argument so far is that speech and beatboxing are domain-specific tunings of a shared
graph. Moreover, by the hypothesis of Articulatory Phonology that the actions composing
speech are also the fundamental cognitive units of speech, the graph-level link between
speech and beatboxing is a domain-general cognitive link between speech and beatboxing
sounds. This is how similarities and differences between speech and beatboxing phonology
can be predicted: any phenomenon that could emerge due to the nature of the graph in one
domain is fair game for the other (but task-specific phenomena, including which units are
selected for production and the task-specific parameters of those units, are not). Likewise,
any hypotheses made about speech graphs may therefore manifest in the beatboxing graph
as well, and vice-versa. For example, the Gestural Harmony Model (Smith, 2018)
hypothesizes two new graph elements: a persistence parameter that allows a gesture to have
no specified ending activation phase, and an inhibitive type intergestural coupling
relationship by which one gesture inhibits the activation of another. In doing so, the model
simultaneously makes predictions about the parameter space and coupling graph options
that beatboxing has access to as well. It turns out that beatboxing fulfills these predictions as
described in Chapter 6: Harmony.
The proposed graph-level link also introduces a new behavioral possibility: that
speech and beatboxing sounds may co-mingle and be coordinated as part of the same motor
plan. After all, no part of the framework outlined above precludes the simultaneous use of a
point attractor with speech parameters and a point attractor with beatboxing parameters.
People do not spontaneously or accidentally beatbox in the middle of a typical sentence, but
150
during vocal play speakers may for fun mix sounds that are otherwise unattested in their
language variety into their utterances; and, beatboxers sometimes use words or word phrases
as part of their music. But the clearest evidence for the existence of speech-and-beatboxing
behavior (and support for the graph-level link) is the art form known as beatrhyming, the
simultaneous production of speech (i.e., singing or rapping) and beatboxing by an individual.
Beatrhyming shows that humans can take full advantage of the flexibility of the motor
system to blend two otherwise distinct tasks into a brand new task. Beatrhyming is discussed
more thoroughly in Chapter 7: Beatrhyming.
There are alternatives to gestures as the fundamental beatboxing units. Paroni et al.
(2021) suggest the term boxeme be used to mean a distinct unit of beatboxing sound,
analogous to a phoneme. Boxemes are posited as the building blocks of beatboxing
performances; since beatboxers explicitly refer to these individual sounds in the composition
of a beat pattern, the notion seems to be that every sound that can be differentiated from
another sound (by name, acoustics, or articulation) is a boxeme candidate. Given the
evidence that beatboxing sounds are composites of smaller units, a phoneme-like boxeme
could be said to be composed of symbolic beatboxing features. (Paroni et al., 2021 do not
commit to either a symbolic or dynamical approach, and “boxeme” may simply be a useful,
theory-agnostic way to refer to a meaningful segment-sized beatboxing sound init; for the
sake of argument, we assume that the clear connection to “phoneme” is meant to imply a
symbolic perspective.)
As mental representations for speech, gestures and phonemes are two very different
hypotheses for the encoding of abstract phonological information: phonemes are purely
151
domain-specific, abstract, symbolic representations composed of atomic phonological
features that are not deterministic with respect to the physical manifestation of a sound.
Gestures on the other hand are simultaneously abstract and concrete (domain-specific and
domain-general) by virtue of their dynamical representation—a specific differential equation
that is predicted to be observably satisfied at every point in time during which a gesture is
being produced. Gestures are particularly advantageous for treating timing relationships (at
multiple time scales) as part of a person’s phonological knowledge. In this sense, the
difference between a beatboxing gesture and a beatboxing feature would similarly be a
difference between units that are both domain-specific and domain-general and units that
are purely domain-specific. As discussed in Chapter 1: Introduction, gestures are the
preferred choice of representation when attempting to draw comparisons between speech
and beatboxing because their partly domain-general nature creates explicit, testable links
between the domains. Symbolic boxemes and phonemes, on the other hand, have no basis
for comparison with each other, no intrinsic links to each other, and no basis for one making
predictions about the other because they are defined purely with respect to their own
domain.
152
CHAPTER 5: ALTERNATIONS
This section addresses whether ”forced” {B} and “unforced” {b} varieties of Kick Drum are
cognitively distinct sound categories or cognitively related, context-dependent alternatives of
a single sound category. It is shown that forced and unforced Kick Drums fulfill the same
rhythmic role in a beat pattern, with unforced Kick Drums generally occurring between
sounds with dorsal constrictions and forced Kick Drums generally occurring elsewhere. The
forced and unforced Kick Drums therefore appear to be context-dependent alternations of a
single Kick Drum category, similar to phonological alternations observed in speech.
1. Introduction to Kick Drums
The Kick Drum mimics the kick drum sound of a standard drum set. It is typically
performed as a voiceless glottalic egressive bilabial plosive, also known as a bilabial ejective
(de Torcy et al. 2013, Proctor et al. 2013, Blaylock et al. 2017, Patil et al. 2017, Underdown
2018). Figure 81 illustrates how one expert beatboxer from the rtMRI beatboxing corpus
produces a classic, ejective Kick Drum. First a complete closure is made at the lips and glottis
(Figure 81a), then larynx raising increases intraoral pressure so that a distinct “popping”
sound is produced when lip compression is released (Figure 81b).
153
Figure 81. Forced/Classic Kick Drum. Larynx raising, no tongue body closure.
a. b.
Many labial articulations produced by this beatboxer during connected beatboxing
utterances (“beat patterns”) were clearly identifiable as classic ejective Kick Drums during
the transcription process based on observations of temporally proximal labial closures and
larynx raisings. These Kick Drums in beat patterns qualitatively matched the production of
the Kick Drum in isolation (albeit with some quantitative differences, e.g., in movement
magnitude of the larynx).
However, some sounds produced with labial closures in the beat patterns of this data
set did not match the expected Kick Drum articulation—nor were they the same as other
labial articulations like the PF Snare (a labio-dental ejective affricate) or Spit Snare (a
buccal-lingual egressive bilabial affricate). These “mystery” sounds had labial closures and
release bursts most similar to those of the Kick Drum, but were generally produced with a
tongue body closure and without any larynx raising. These differences are visible in a
comparison of Figure 81 (the Kick Drum) with Figure 82 (the mystery labial): in Figure 81,
the tongue body never makes a constriction against the palate or velum, and bright spot at
the top of the trachea indicates that the vocal folds are closed; but in Figure 82, the tongue
body is pressed against a lowered velum, and the lack of a bright spot indicates that the vocal
folds are spread apart.
154
Figure 82. Unforced Kick Drum. Tongue body closure, no larynx raising.
a. b. c.
Based both on consultation with beatboxers and on the analysis that follows below, this
mystery labial sound has been identified as what is known in the beatboxing community as
an “unforced Kick Drum”—a “weaker” alternative to the more classic ejective “forced” Kick
Drum, and which does not have a common articulatory definition (compared to the forced
Kick Drum, which beatbox researchers have established is commonly an ejective) (Tyte &
SPLINTER, 2014; Human Beatbox, 2018). Given the clear dorsal closure, one might expect
that the unforced Kick Drum would be performed as a lingual (velaric) ingressive (clicklike)
or egressive sound. However, preliminary analysis suggests that the unforced Kick Drum is a
“percussive” (Pike, 1943), referring to a lack of ingressive or egressive airstream during the
production of this sound (not to be confused with their role in musical percussion). Figure
83 illustrates this via comparison to the Spit Snare, a lingual egressive bilabial sound: the Spit
Snare reduces the volume of the chamber in front of the tongue through tongue fronting and
jaw raising (Figure 83, left), whereas the unforced Kick Drum does neither (Figure 83, right).
155
Figure 83. Spit Snare vs Unforced Kick Drum. The Spit Snare (left) and unforced Kick Drum
(right) are both bilabial obstruents made with lingual closures. The top two images of each
sound are frames representing time of peak velocity into the labial closure and initiation of
movement out of the labial closure (found with the DelimitGest function of Tiede [2010]).
The difference between frames (bottom) was generated using the imshowpair function in
MATLAB’s Image Processing Toolbox. In both images, purple pixels near the lips indicate
that the lips are closer together in the later frame than in the first. For the Spit Snare, the
purple pixels near the tongue indicate that the tongue moved forward between the two
frames, and the green pixels near the jaw indicate that the jaw rose. For the unforced Kick
Drum, the relative lack of color around the tongue and jaw indicate that the tongue and jaw
did not move much between these two frames.
Not all beatboxers appear to be aware of the distinction between forced and unforced Kick
Drums—or if they are aware, they do not necessarily feel the need to specify which type of
Kick Drum they are using. Hence, while the expert beatboxer in this study did not identify
the difference between forced and unforced Kick Drums and chose to produce only forced
Kick Drums in isolation, they made liberal use of both Kick Drum types in beat patterns
throughout the data acquisition session, as shown in Chapter 3: Sounds.
156
For another example of beatboxers not distinguishing between forced and unforced
Kick Drums: during an annotation session in the early days of this research, a
researcher-beatboxer of self-assessed intermediate skill involved with this project
demonstrated a beat pattern featuring only sounds with dorsal articulations (a common
strategy used for the practice of phonating while beatboxing, as discussed in Chapter 6:
Harmony). In the beat pattern, she produced several of what we now recognize as unforced
Kick Drums—sounds that act as Kick Drums but have a dorsal articulation instead of an
ejective one. But when asked to name the sound, she simply called it “a Kick Drum,” not
specifying whether it was forced or unforced and apparently not noticing (or caring about,
for that beat pattern) the difference.
The parallel to similar observations about speech are striking. English speakers who
have a sense that words are composed of sounds can often recognize the existence of a
category of sounds like /t/, but may not be aware that it manifests differently (sometimes
very differently) in production depending on a variety of factors including its phonological
environment. In the same way, beatboxers are aware of the Kick Drum sound category but
may not always be aware of the different ways it manifests in production. In symbolic
approaches to phonology, this type of observation has been used to argue for the existence of
abstract phonological categories (e.g., phonemes) with context-dependent alternants
(allophones). In Articulatory Phonology, much of allophony is accounted for by gestural
overlap: instead of categorical changes from one allophone to another depending on context,
the gestures for a given sound are invariant and only appear to change when co-produced
with gestures from another sound (Browman & Goldstein, 1992; see Gafos & Goldstein, 2011
157
for a review). In either approach, there is a single sound category (a phoneme or gestural
constellation) the manifestation of which varies predictably and unconsciously based on the
sounds in its environment.
Do beatboxers treat forced and unforced Kick Drums as alternate forms of the same
sound category? If so forced and unforced Kick Drums would be expected to be members of
the same class of sounds and to occur in complementary distributions conditioned by their
phonetic environments. Articulatory Phonology’s account of allophony via temporal overlap
furthermore predicts that the constriction that makes the difference between the sounds will
come from a nearby sound’s gesture. Assuming that the forced Kick Drum is the default
sound because it was the one produced in isolation by the beatboxer, the tongue body
closure characterizing the unforced Kick Drum is predicted to be a gesture associated with
another sound nearby. Establishing the first criterion, that the forced and unforced Kick
Drums are members of the same class of sounds, is done with a musical analysis. A
subsequent phonetic analysis looks for evidence that the two Kick Drums are
complementary distribution conditioned by tongue body closures of nearby sounds. Both
analyses are summarized below.
The musical analysis takes into account that beatboxing sounds are organized into
meaningful musical classes. Musical classes of sounds have aesthetically-conditioned metrical
constraints that can be satisfied by any sound in the class; for example, although snare
sounds as a class are generally required on beat 3 (the back beat) of any beatboxing
performance, the requirement can be accomplished with any sound from the class of snares
including a PF Snare, a Spit Snare, or an Inward K Snare. The members of a musical class of
158
sounds are not necessarily alternations of the same sound—PF Snares and Inward K Snares
are not argued here to be context-dependent variants of an abstract snare category. But for
forced and unforced Kick Drums to be alternants of the same category, they minimally must
belong to the same musical class. Because sounds in a musical class have metrical occurrence
restrictions, a test of musical class membership is to observe whether forced and unforced
Kick Drums are performed with the same rhythmic patterns and metrical distributions. If
they are not, then they are not members of the same musical class and therefore cannot be
alternants of a single abstract category.
3
(The names of the sounds clearly imply that
beatboxers treat the forced Kick Drum and unforced Kick Drum as two members of the Kick
Drum musical class; the musical analysis below illustrates this relationship in detail.)
The phonetic analysis is to note the phonetic environment of each Kick Drum type
and look for patterns in the gestures of those environments. Complementary distribution is
found if the phonetic environments of the two types of Kick Drum are
non-overlapping—that is, the selection of a forced or unforced Kick Drum should be
predictable based on its phonetic environment. This type of analysis is performed in many
introductory phonology classes where complementary distribution is often taken as evidence
for the existence of phonemes with multiple allophones.
Sections 2 and 3 below establish that in this data set, forced and unforced Kick Drums
are in fact environmentally-conditioned alternations of a Kick Drum sound category: they
share the same rhythmic patterning (Section 2.1), but unforced Kick Drums are mostly found
3
It may be useful in future analyses to consider the possibility that some sounds vary by metrical position or
otherwise exhibit positional allophony. Guinn & Nazarov (2018) suggest that phonotactic restrictions on place
that prevent coronals from occurring in metrically strong positions; perhaps those restrictions are part of a
broader pattern of allophony.
159
between two dorsal sounds whereas forced Kick Drums have a wider distribution (Section
2.2). The unforced Kick Drum therefore appears to be a Kick Drum that has assimilated to
an inter-dorsal environment (and lost its laryngeal gesture in the process). This account of
the data will be reinforced in Chapter 6: Harmony when it is shown that unforced Kick
Drums often emerge due to tongue body harmony.
2. Analyses
Beat patterns were transcribed into drum tab notation from real-time MRI videos as
described in Chapter 2: Method. Based on those transcriptions, section 2.1 shows that
unforced Kick Drums have a similar rhythmic distribution to forced Kick Drums,
particularly beat 1 of a beat pattern. Section 2.2 shows that unforced Kick Drums appear to
have a fairly restricted environment, occurring mostly between two dorsal sounds. The two
findings combined suggest that forced and unforced Kick Drums are alternative
contextually-conditioned manifestations of a Kick Drum category (discussed in Section 3).
From this point forward, the ejective (classic/forced Kick Drum) version will be
written in Standard Beatbox Notation {B}, whereas the unforced Kick Drum will be written
in Standard Beatbox Notation {b} (Tyte & SPLINTER, 2014). (Note that uppercase vs
lowercase in Standard Beatbox Notation cannot always be interpreted as a forced vs unforced
distinction. For example, the Closed Hi-Hat is considered a forced sound, but is written with
a lowercase {t}.)
160
2.1. Rhythmic patterns of Kick Drums
Forty beat patterns were identified as containing a forced Kick Drum, unforced Kick Drum,
or both. One beat pattern with forced Kick Drums was omitted because it also included
unusually breathy (possibly Aspirated) Kick Drums which are not the subject of the analysis.
Of the remaining thirty-nine beat patterns, all but six were exactly four measures long; for
this analysis, the six longer beat patterns were truncated to just the first four measures. An
exception was made for beat pattern 38 (Figure 86) which comes from the same
performance as beat pattern 28 (Figure 84). The originating beat pattern was 32 measures
long; the first section (measures 1-4, beat pattern 28) used forced Kick Drums whereas the
last section (measures 29-32, beat pattern 38) used both forced and unforced Kick Drums,
and the two sections were judged to have sufficiently distinctive beat patterns that they could
both be included in the analysis.
A total of 40 four-measure Kick Drum patterns were sorted into three groups: 28 beat
patterns that only contain forced Kick Drums (Figure 84), 7 beat patterns that only contain
unforced Kick Drums (Figure 85), and 5 beat patterns that contain both forced and unforced
Kick Drums (Figure 86).
There are many possible forced Kick Drum patterns (Figure 84), but three particular
details will facilitate comparison to unforced Kick Drums. First, in all beat patterns but one
the forced Kick Drum occurs on the very first beat of the very first measure (27/28 cases,
96.4%, beat patterns 2-28). Second, in several cases the Kick Drum occurs on beats 1, 2+, and
4 of the first and third measures (9/28 cases, 32.1%, beat patterns 18-26). And third, 7 of those
same 9 beat patterns feature Kick Drums on 1+ and 2+ of measure 2, with similar patterns in
161
measure 4 (beat patterns 19-25). There are fewer beat patterns that use unforced Kick Drums
to the exclusion of forced Kick Drums (Figure 85), but the unforced Kick Drums in these
beat patterns have similar patterns to the ones just described for forced Kick Drums above.
First, in all but one beat pattern the unforced Kick Drum occurs on beat 1 of measure 1 (6/7
cases, 85.7%, beat patterns 30-35). Second, the Kick Drum tends to also occur on beats 1, 2+,
and 4 of the first and third measures (5/7 cases, 71.4%, beat patterns 31-35). And third, 4 of
those same 5 beat patterns feature Kick Drums on 1+ and 2+ of measure 2, with similar
patterns in measure 4 (beat patterns 32-35).
Figure 84. Forced Kick Drum beat patterns.
1) B|------x---------|--x---x---------|------x---------|--x---x---------
2) B|x---------------|----------------|x-----------x---|----------------
3) B|x---------------|----x-----------|x---------------|----x-----------
4) B|x---------------|----x-----------|x---------------|----x-----------
5) B|x---------------|x---------x-----|x---------------|x-----x---------
6) B|x--------------x|x---x-----------|x--------------x|x---x-----------
7) B|x--------------x|x---x-----------|x--------------x|x---x-----------
8) B|x--------------x|x---x-----------|x--------------x|x---x-----------
9) B|x--------------x|x---x-----------|x--------------x|x---x-----------
10) B|x-------------x-|----x-----------|x-------------x-|----x-----------
11) B|x-----------x---|----------------|x-----------x---|----------------
12) B|x-----------x---|----------------|x-----------x---|----------------
13) B|x-----------x---|----x-----------|x-----------x---|----x-----------
14) B|x-----------x---|----x-----------|x-----------x---|----x-----------
15) B|x-----------x---|----x-----------|x-----------x---|----x-----------
16) B|x-----------x---|----x-------x---|x-----------x---|----x-----------
17) B|x-----------x---|----x-----x---x-|------------x---|----x-----------
18) B|x-----x-----x---|--x-------------|x-----x-----x---|--x-x-----------
19) B|x-----x-----x---|--x---x---------|x-----x-----x---|--x---x---------
20) B|x-----x-----x---|--x---x-----x---|x-----x-----x---|--x---x-----x---
21) B|x-----x-----x---|--x---x-----x---|x-----x-----x---|--x---x-----x---
22) B|x-----x-----x---|--x---x-----x---|x-----x-----x---|--x---x-----x---
23) B|x-----x-----x---|--x---x---x-----|x-----x-----x---|--x---x---x-----
24) B|x-----x-----x---|--x---x---x---x-|x-----x-----x---|--x-------x-----
25) B|x-----x-----x---|--x---x---x---x-|x-----x-----x---|--x---x---x-----
26) B|x-----x-----x---|-x----x---------|x-----x-----x---|------x---------
27) B|x---x-----x---x-|--x-x-----x-----|x---x-----x---x-|--x-x-----x-x---
28) B|x---x---x---x---|x---x---x---x---|x---x---x---x---|x---x---x---x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
| Measure 1 | Measure 2 | Measure 3 | Measure 4
162
Figure 85. Unforced Kick Drum beat patterns.
29) b|------x---------|--x---x---------|x-----x-----x---|--x---x---x-x---
30) b|x---------------|x---------------|x---------------|x---------------
31) b|x-----x-----x---|----x-------x---|x-----x-----x---|----x-------x---
32) b|x-----x-----x---|--x---x---------|x-----x-----x---|--x---x---x---x-
33) b|x-----x-----x---|--x---x-------x-|x-----x-----x---|--x---x---------
34) b|x-----x-----x---|--x---x---x---x-|x-----x-----x---|--x---x---x---x-
35) b|x-----x-----x---|--x---x-----x---|x-----x-----x---|--x-------x---x-
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
| Measure 1 | Measure 2 | Measure 3 | Measure 4
Even the two beat patterns in which a Kick Drum does not occur on beat 1 of measure 1
(beat pattern 1 of Figure 84 and beat pattern 29 of Figure 85) are similar: both have a single
Kick Drum on beat 2+ of measure 1, followed by two Kick Drums on beats 1+ and 2+ of
measure 2. (These beat patterns without Kick Drums on the first beat seem exceptional
compared to the rest of the beat patterns that do have Kick Drums on beat 1. Examining the
real-time MRI reveals that there are, in fact, labial closures on beats 1 and 4 of measure 1 in
both of these beat patterns, mimicking the common pattern of Kick Drums on beats 1, 2+,
and 4 of the first measure. The labial closures on beats 1 and 4 are co-produced with other
sounds on the same beat—a Lip Bass in the case of the forced Kick Drum (Figure 84, beat
pattern 1), and a Duck/Meow sound effect in the case of the unforced Kick Drum (Figure 85,
beat pattern 29). While many of the other beat patterns also feature Kick Drums
co-produced with other sounds on the same beat, the labial closures on beats 1 and 4 in these
two exceptional beat patterns have no acoustic release corresponding to the sound of a Kick
Drum, and so are absent from the drum tab transcription.)
Figure 86 shows five cases of beat patterns with both forced and unforced Kick
Drums. Each beat pattern is presented with both forced {B} and unforced {b} Kick Drum
drum tab lines as well as a “both” drum tab line that is the superposition of the two types of
163
Kick Drum. Notice that the two types of Kick Drum never interfere with each other (i.e., by
occurring on the same beat); on the contrary, they are spaced apart from each other in ways
that create viable Kick Drum patterns. This is especially noticeable in beat patterns 36, 37,
and 40: the Kick Drums collectively create pattern of Kick Drums on beats 1, 2+, and 4 of the
first measure, one of the common patterns described above (Figure 84, patterns 18-26); but
neither the forced nor the unforced Kick Drums accomplish this pattern alone—the pattern
is only apparent when the two Kick Drum types are combined on the same drum tab line.
Beat patterns 38 and 39 demonstrate that even inconsistent selection of forced and
unforced Kick Drums can still yield an appropriate Kick Drum beat pattern. In beat pattern
38, the first two measures feature mostly forced Kick Drums while the second two measures
feature mostly unforced Kick Drums; despite this, the resulting Kick Drum beat pattern is
clearly repeated with Kick Drums on beats 1, 2+, and 4 of the first and third measures as well
as beats 1+ and 2+ of the second and fourth measures. Likewise in beat pattern 39: even
though the penultimate Kick Drum is the only unforced Kick Drum, it contributes to
repeating the beat pattern from the first two measures.
164
Figure 86. Beat patterns with both forced and unforced Kick Drums.
36) B|x-----------x---|----x-------x---|x-----------x---|----x-------x---
b|------x---------|----------------|------x---------|----------------
both|x-----x-----x---|----x-------x---|x-----x-----x---|----x-------x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
37) B|x-----------x---|--x-------------|x-----------x---|--x-------------
b|------x---------|----------------|------x---------|----------------
both|x-----x-----x---|--x-------------|x-----x-----x---|--x-------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
38) B|x-----x-----x---|--x-------------|x---------------|----------------
b|----------------|------x---------|------x-----x---|--x---x---------
both|x-----x-----x---|--x---x---------|x-----x-----x---|--x---x---------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
39) B|x-----x-----x---|--x---x---x-----|x-----x-----x---|------x---------
b|----------------|----------------|----------------|--x-------------
both|x-----x-----x---|--x---x---x-----|x-----x-----x---|--x---x---------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
40) B|------x---------|------x---x-----|------x---------|------x---x-----
b|x-----------x---|--x-----------x-|x-----------x---|--x-------------
both|x-----x-----x---|--x---x---x---x-|x-----x-----x---|--x---x---x-----
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
In summary: forced and unforced Kick Drums fill the same metrical positions. When they
occur together in the same beat pattern, their joint patterning resembles typical Kick Drum
patterns—that is, they fill in each other’s gaps. For a beatboxer, this finding is probably
unsurprising. After all, the sounds are both just varieties of “Kick Drum”, so it makes sense
that their occurrences in musical performances would be similar.
But notice now that out of 40 beat patterns, only 5 used both forced and unforced
Kick Drums to build Kick Drum patterns; the remaining 35 beat patterns used either forced
or unforced Kick Drums, but not both. In fact, even in 3 of the 5 beat patterns with both
types of Kick Drums, the metrical distribution of Kick Drums is highly regular. For example,
in beat pattern 36 of Figure 86, unforced Kick Drums only occur on beat 2+ of measures 1
165
and 3. If forced and unforced Kick Drums are both fulfilling the role of Kick Drum in these
beat patterns, why do they not appear together in the same beat pattern more often? Why do
they not occur in free variation? The next section demonstrates that although forced Kick
Drums and unforced Kick Drums are members of the same musical class, their distribution
is conditioned by the articulations of the musical events around them—similar to some
phonological alternations.
2.2 Phonological environment
2.2.1 Method
Beat patterns were encoded as PointTiers as described in Chapter 2: Method. The PointTier
linearizes beat pattern events into sequences, even when two events are metrically on the
same beat. Most of the time this is desirable; even though a Kick Drum and Liproll may
occur on the same beat, the Kick Drum is in fact produced first in time followed quickly by
the Liproll. However, this is undesirable linearization for laryngeal articulations like
humming which may in fact be simultaneous with co-produced oral sounds, not sequential.
Figure 87 shows a sample waveform and spectrogram in which acoustic noise and the release
of oral closures may hide the true onset of voicing. Humming articulations that were
annotated in drum tabs as co-occurring on the same beat as an oral sound were removed,
leaving only oral articulations. Each beat pattern’s PointTier representation was converted to
a string in MATLAB using mPraat (Bo řil & Skarnitzl, 2016).
Environment types. Similar to some classical phonological analyses, trigram
environments were created from these beat patterns (i.e., {C X D}, where {C} and {D} are two
166
beat pattern events and {X} is a forced or unforced Kick Drum). Each unique trigram in the
corpus of beat patterns is called an environment type. To ensure that each Kick Drum was in
the middle of an environment type, each beat pattern was prefixed with an octothorpe (“#”)
to represent the beginning of a beat pattern and suffixed with a dollar sign (“$”) to represent
the end of a beat pattern. An utterance-initial unforced Kick Drum before a Clickroll {CR}
might therefore appear as the trigram {# b CR}, and an utterance-final forced Kick Drum
after a Closed Hi-Hat would be {t B $}. The set of unique environment types was generated
from the Text Analytics MATLAB toolbox. Forced Kick Drums were found in 141
environment types; unforced Kick Drums were found in 54 environment types.
Environment classes. Since a major articulatory difference between the forced and
unforced Kick Drums appears to be the presence (for unforced Kick Drums) or absence (for
forced Kick Drums) of a dorsal articulation, the unique trigram environment types were
grouped into environment classes
4
based on the dorsal-ness of the sounds adjacent to the
Kick Drum. These environment classes are generalizations that highlight the patterns of Kick
Drum distribution with respect to dorsal-ness.
4
Linguists would traditionally be looking for “natural” classes here. The term “environment class” skates around
issues of “naturalness” in speech and beatboxing, but the methodological approach to classifying a sound’s
phonological environment is essentially the same.
167
Figure 87. An excerpt from a PointTier with humming. In this beat pattern, the oral
articulators produce the sequence {b dc tbc b SS}, where {b} is an unforced Kick Drum, {dc}
and {tbc} are dental and interlabial clicks, and {SS} is a Spit Snare. The initial unforced Kick
Drum {b} and the interlabial click {tbc} are both co-produced with an upward pitch sweep
marked as {hm} and called “humming”. These hums were removed for this analysis, leaving
only the oral articulations. (Note that this audio signal was significantly denoised from its
original recording associated with the real-time MRI data acquisition, but a few artefacts
remain as echoes that follow most sounds in the recording.)
For example, consider two hypothetical trigram environment types: {SS b dc}, which is an
unforced Kick Drum between a Spit Snare {SS} and dental closure {dc}, and {^K b LR},
which is an unforced Kick Drum between an Inward K Snare {^K} and a Liproll {LR}. The
Spit Snare, dental closure, Inward K Snare, and Liproll all involve dorsal articulations, so the
environment types {SS b dc} and {^K b LR} would both be members of the environment
class {[+ dorsal] __ [+ dorsal]}. (The +/- binary feature notation style used here is for
convenience to represent the existence or absence of a dorsal closure and should not be
taken as an implication that this is a symbolic featural analysis). The options [+ dorsal], [-
dorsal], and utterance-boundary (“#” or “$”) can occur in both the before and after positions
168
for a Kick Drum environment, resulting in nine (3 * 3 = 9) logically possible Kick Drum
environment classes; two of these nine did not have any Kick Drum tokens in them, leaving
seven Kick Drum environment classes listed in Tables 23 and 24. Not all environment classes
were used by either type of Kick Drum.
2.2.2 Results
Tables 21 and 22 presents the forced and unforced Kick Drum frequency distributions
environments by token frequency (how many Kick Drums of a given kind were in each
environment class) and type frequency (how many unique trigram environment types of a
given Kick Drum kind were in each environment class). Table 21 shows the results of the
analysis for the forced Kick Drum environments, and Table 22 shows the results for the
unforced Kick Drum environments.
Table 21 summarizes the distribution of 330 forced Kick Drum tokens across 141
unique trigram environment types, which generalize to six environment classes. The
majority of forced Kick Drum tokens and environment types did not include proximity to a
dorsal sound ("Not near a dorsal" in Table 21). The forced Kick Drums that did occur near
dorsals tended to have a non-dorsal sound on their opposite side (i.e., {[- dorsal] B [+
dorsal]} or {[+ dorsal] B [- dorsal]}). As shown in Table 22, the vast majority (93.9%) of
unforced Kick Drum tokens occurred in environment classes that included one or more
dorsal sounds near the unforced Kick Drum (the “Near a dorsal” classes), with most of those
(83.3%) featuring dorsal sounds on both sides of the unforced Kick Drum. This is essentially
the reverse of the distribution of forced Kick Drums which were highly unlikely to occur
between dorsal sounds.
169
Tables 23 and 24 show contingency tables for observations of forced and unforced
Kick Drum environment types (Table 23) and tokens (Table 24). Fisher’s exact tests on these
tables were significant (p < 0.001 in both cases), meaning that the frequency distribution of
Kick Drums in these environments deviated from the expected frequencies—that is, Kick
Drum types appeared often in some environments and sparsely in others. Tables 23 and 24
highlight in green the cells with the highest frequencies and which correspond to the
observations in Tables 21 and 22: forced Kick Drums tend to occur between non-dorsal
sounds while unforced Kick Drums tend to occur between dorsal sounds.
Table 21. Forced Kick Drum environments.
Environment class Number of
environment types
Tokens in environment
class
Before After
Near a dorsal [+ dorsal] B [+ dorsal] 8 5.7% 18 5.5%
# B [+ dorsal] 1 0.7% 1 0.3%
[+ dorsal] B [- dorsal] 28 19.9% 60 18.2%
[- dorsal] B [+ dorsal] 20 14.2% 42 12.7%
Not near a dorsal [- dorsal] B [- dorsal] 63 44.7% 183 55.5%
# B [- dorsal] 21 14.9% 26 7.9%
Total
141 100% 330 100%
170
Table 22. Unforced Kick Drum environments.
Environment class Number of
environment types
Tokens in environment
class
Before After
Near a dorsal [+ dorsal] b [+ dorsal] 42 76.4% 95 83.3%
[- dorsal] b [+ dorsal] 1 1.8% 1 0.9%
# b [+ dorsal] 5 9.1% 7 6.1%
[+ dorsal] b [- dorsal] 2 3.6% 2 1.8%
[+ dorsal] b $ 2 3.6% 2 1.8%
Not near a dorsal [- dorsal] b [- dorsal] 3 5.5% 7 6.1%
Total
55 100% 114 100%
Table 23. Kick Drum environment type observations. Forced Kick Drums trigram
environment types were most likely to be of the {[- dorsal] B [- dorsal]} environment class,
while unforced Kick Drum environment types were most likely to be ofn the {[+ dorsal] b [+
dorsal]} environment class.
Environment class
Forced Kick Drum
environment type
Unforced Kick Drum
environment type Total
[+ dorsal] X [+ dorsal] 8 41 45
[+ dorsal] X [- dorsal] 28 2 25
[- dorsal] X [+ dorsal] 20 1 18
[- dorsal] X [- dorsal] 63 3 78
# X [+ dorsal] 1 5 6
# X [- dorsal] 21 0 21
[+ dorsal] X $ 0 2 2
Total 141 54 195
171
Table 24. Kick Drum token observations. Forced Kick Drum tokens were most likely to occur
in the {[- dorsal] B [- dorsal]} environment class, while unforced Kick Drum tokens were
most likely to occur in the {[+ dorsal] b [+ dorsal]} environment class.
Environment class
Forced Kick Drum
token frequency
Unforced Kick Drum token
frequency Total
[+ dorsal] X [+ dorsal] 18 95 106
[+ dorsal] X [- dorsal] 60 2 55
[- dorsal] X [+ dorsal] 42 1 39
[- dorsal] X [- dorsal] 183 7 208
# X [+ dorsal] 1 7 8
# X [- dorsal 26 0 26
[+ dorsal] X $ 0 2 2
Total 330 114 444
Figure 88 shows the time series for a sequence of a lateral alveolar closure, unforced Kick
Drum, and Spit Snare {tll b SS}. The sounds surrounding the unforced Kick Drum both have
tongue body closure: the lateral alveolar closure is a percussive like the unforced Kick Drum,
which in this case means it has tongue body closure but no substantial movement of the
tongue body forward or backward to cause a change in air pressure; the Spit Snare on the
other hand is a lingual ingressive sound, requiring a tongue body closure and subsequent
squeezing of air past the lips. The tongue body makes a closure high throughout the
sequence as represented by consistently high values for pixel intensity in the DOR region,
indicating that the Kick Drum may be unforced because of gestural overlap with one or more
tongue body closures intended for a nearby sound like the Spit Snare. The LAR time series
172
for larynx height is also included to confirm that there is no ejective-like action here that
would correspond to a forced Kick Drum.
Figure 88. A sequence of a lateral alveolar closure {tll}, unforced Kick Drum {b}, and Spit
Snare {SS}. The DOR region of the tongue body has relatively high pixel intensity
throughout the sequences, and the LAR region of the larynx has low pixel intensity.
3. Conclusion
Forced and unforced Kick Drums are in complementary distribution: unforced Kick Drums,
which were described earlier as having a dorsal articulation in addition to a labial closure,
tend to occur near dorsal sounds; forced Kick Drums do not share this dorsal articulation,
and tend to occur near non-dorsal sounds. Based on this context-dependent complementary
distribution and their similar rhythmic patterning, the forced and unforced Kick Drums
seem to be cognitively related as alternations of a single Kick Drum category.
173
Given the matching dorsal or non-dorsal quality of a Kick Drum and its
surroundings, it seems likely that the alternations are specifically participating in a
phonological agreement/assimilation phenomenon. The tongue body does not appear to
release its closure between the unforced Kick Drum and the sound before or after it. In a
traditional phonological analysis, one could posit a phonological rule to characterize this
distribution such as: “Kick Drums are unforced (dorsal) between dorsal sounds and forced
(ejective) elsewhere.” (Forced Kick Drums are the elsewhere case because their occurrence is
distributed somewhat more evenly over more environment classes.)
{B} —> {b} / {+ dorsal} ___ {+ dorsal}
The Articulatory Phonology analysis is roughly the same, if not so featural: Kick Drums are
unforced if they overlap with a tongue body closure. These interpretations assume a causal
relationship in which the Kick Drum is altered by its environment, but an alternative story
reverses the causation: forced and unforced Kick Drums are distinct sound categories that
trigger dorsal assimilation in the sounds nearby. The analysis of beatboxing phonological
harmony in Chapter 6: Harmony provides further evidence that the Kick Drum is subject to
change depending on the sounds nearby—including non-adjacent dorsal harmony
triggers—and not the other way around.
Kick Drums are not the only sound in the data set to show this type of pattern,
though their relatively high token frequency makes them the only sounds to show it so
robustly. As Chapter 3: Sounds listed, there are two labio-dental compression sounds: a
glottalic egressive PF Snare and a percussive labio-dental sound. As its name implies, the PF
Snares fulfills the musical role of a snare by occurring predominantly on the back beat of a
174
beat pattern. Suspiciously, the labio-dental percussive also appears on the back beat in the
two beat patterns it occurs in, and just like the unforced Kick Drum it occurs surrounded by
sounds with tongue body closures. The same goes for the Closed Hi-Hat and some of the
coronal percussives, though the pattern is confounded somewhat by the percussives being
distributed over several places of articulation while the Closed Hi-Hat is a distinctly alveolar
sound. Taking the Kick Drum, PF Snare, and Closed Hi-Hat together suggests that the
phenomenon discussed in this chapter is actually part of a general pattern that causes some
ejectives to become percussives other sounds with tongue body closures sounds are nearby.
Again, Chapter 6: Harmony addresses this in more detail.
175
CHAPTER 6: HARMONY
Some beatboxing patterns include sequences of sounds that share a tongue body closure, a
type of agreement that in speech might be called phonological harmony. This chapter
demonstrates that beatboxing harmony has many of the signature attributes that
characterize harmony in phonological systems in speech: sounds that are harmony triggers,
undergoers, and blockers. In beatboxing, the function of a sound in harmony is predictable
based on the phonetic dimension of airstream initiator. This analysis of beatboxing harmony
provides the first evidence for the existence of sub-segmental cognitive units of beatboxing
(vs whole segment-sized beatboxing sounds). These patterns also show that the harmony
found in spoken phonological systems is not unique to phonology.
1. Introduction
A common type of beat pattern in beatboxing involves the simultaneous production of
obstruent beatboxing sounds and phonation (which may not always be modal). This type of
"humming while beatboxing" beat pattern is well-known by beatboxers and treated as a skill
to be developed in the pursuit of beatboxing expertise (Stowell & Plumbley, 2008; Park, 2016;
WIRED, 2020).
176
Figure 89. A beat pattern that demonstrates the beatboxing technique of humming with
simultaneous oral sound production. This beat pattern contains five sounds: an unforced
Kick Drum {b}, a dental closure {dc}, a linguolabial closure {tbc}, a Spit Snare {SS}, and brief
moment of phonation/humming {hm}. In this beat pattern, humming co-occurs with other
beatboxing sounds on most major beats (i.e., 1, 2, 3, and 4, but not their subdivisions).
b |x-----x-----x---|--x---x-------x-|x-----x-----x---|--x---x---------
dc |--x-----------x-|----------------|--x-----------x-|------------x---
tbc|----x-----------|x---x-------x---|----x-----------|x---x-----------
SS |--------x-------|--------x-------|--------x-------|--------x-------
hm |x---x---x---x---|x---x---x---x---|x---x---x---x---|x---x---x---x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
Without knowing the articulation in advance, a humming while beatboxing beat pattern is
an pneumatic paradox: humming requires a lowered velum to keep air pressure low above
the vocal folds while they vibrate and to allow air to escape through the nose, but glottalic
and pulmonic obstruents—which many beatboxing sounds are (see Chapter 3:
Sounds)—require a raised velum so air pressure can build up behind an oral closure. The
production of voiced stops in speech comes with similar challenges; languages with voiced
stops use a variety of strategies such as larynx lowering to decrease supraglottal pressure
(Catford, 1977; Ohala, 1983; Westbury, 1983). Real-time MRI examples later in this chapter
show that beatboxers use a different strategy to deal with the humming vs obstruent
antagonism: separating the vocal tract into two uncoupled chambers with a tongue body
closure (see also Dehais-Underdown et al., 2020; Paroni, 2021b). Behind the tongue body
closure, the velum is lowered and phonation can occur freely with consistently low
supraglottal pressure. In front of the tongue body closure, air pressure is manipulated by the
coordination of the tongue body and the lips or tongue tip. In speech, a similar articulatory
arrangement is used for the production of voiced or nasal clicks.
177
The examples above of speech remedies for voiced obstruents operate over a
relatively short time span near when voicing is desired. Notice, however, that phonation {hm}
in the beat pattern from Figure 891 is neither sustained nor co-produced with every oral
beatboxing sound, yet every sound in the pattern is produced with a tongue body closure. It
turns out that other beat patterns like the one in Figure 90 also feature many sounds with
tongue body closures even when the beat pattern has no phonation at all; the humming
while beatboxing example is just one of several beat pattern types in which multiple sounds
share the property of being produced with a tongue body constriction. When multiple
sounds share the same attribute in speech, the result is phonological “harmony”.
This chapter demonstrates the existence of harmony in beatboxing, and in doing so
offers deep insights about the makeup of the fundamental units of beatboxing cognition. The
remainder of this section provides a basic overview of local (vowel-consonant) harmony in
speech (section 1.1) and previews some of the major theoretical issues at stake in the
description of tongue body closure harmony for beatboxing (section 1.2).
Figure 90. This beat pattern contains five sounds: a labial stop produced with a tongue body
closure labeled {b}, a dental closure {dc}, an lateral closure {tll}, and lingual egressive labial
affricate called a Spit Snare {SS}. All of the sounds are made with a tongue body closure.
b |x-----x-----x---|----x-------x---|x-----x-----x---|----x-------x---
dc |----x-----------|----------------|----x-----------|----------------
tll|----------------|x---------------|----------------|x---------------
SS |--------x-------|--------x-------|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
178
1.1 Speech harmony
Harmony in speech occurs when multiple distinct phonological segments can be said to
“agree” with each other by expressing the same particular phonological property. There are a
few different types of harmony patterns in speech, but the most relevant to this study is
“local harmony” in which the sounds that agree with each other occur in an uninterrupted
sequence. (Local harmony is also known as “vowel-consonant harmony” because it affects
both vowels and consonants). Rose & Walker (2011) describe a few types of local harmony
including nasal harmony, emphatic (pharyngeal) harmony, and retroflex harmony.
As a phonological phenomenon, harmony is ultimately governed by the goals of
speech—specifically, the task of communicating a linguistic message. Part of accomplishing
this task is to create messages that have a high likelihood of being accurately recovered by
someone perceiving the message. Harmony is one of several mechanisms that have been
hypothesized for strengthening contrasts that may otherwise be perceptually weak:
perceptually weak phonological units are more likely to be heard if they last longer and
overlap with multiple segments (Kaun, 2004; Walker 2005; Kimper 2011). Less teleologically,
others suppose that (local) harmony is the diachronic phonologization of coarticulation
(Ohala, 1994) or stochastic motor control variation (Tilsen, 2019), which may have
perceptual benefits. In either view, local harmony is initiated by a “trigger” segment which
has some phonological property to spread (i.e., a feature or gesture). Through harmony, that
property is shared with other nearby segments (“targets” or “undergoers”) so that they end
up expressing the same phonological information as the trigger segment.
179
The same overarching task of producing a perceptually recoverable message which
may motivate harmony also constrains which phonological properties of a sound will spread
and how. Harmony must be unobtrusive enough that it does not destroy other crucial
phonological contrasts; tongue body closure harmony, for example, is unattested in speech
because it would destroy too much information by turning all vowels and consonants into
velar stops (Gafos, 1996; Smith, 2018). Likewise, sounds that would be disrupted by harmony
should be able to resist harmonizing and prevent its spread; these types of sounds are called
“blockers”. In other languages, some sounds might be “transparent” instead, meaning that
they neither undergo nor block the harmony.
Theoretical accounts generally treat local harmony as the spreading of a single
phonological property to other sounds (Rose & Walker 2011). In featural accounts, this is
often done by formally linking a feature to adjacent segments according to some rule,
constraint, or other grammatical or dynamical force. In gestural accounts, local harmony has
been modeled as maintaining a particular vocal tract constriction over the course of multiple
segments (Gafos, 1996; Walker et al., 2008; Smith, 2018).
In sum, local/vowel-consonant harmony in speech is observed when multiple sounds
in a row share the same feature or gesture. Harmony is analyzed as a feature or gesture
spreading from a trigger unit onto or through adjacent segments called undergoers, though
some segments may also block harmony or be transparent to it. To the extent that harmony
is goal-oriented, it is likely motivated by a speech goal of promoting perceptual recoverability
of a linguistic message; harmony supports this goal by providing a listener more
opportunities to perceive what might otherwise be a perceptually weak feature or gesture.
180
1.2 Beatboxing harmony
Figures 89 and 90 provided examples of beatboxing sequences in which each sound has a
tongue body closure. While these beat patterns may be harmonious in the sense that the
sounds agree on some property, it does not mean that beatboxing harmony has the same
traits as speech harmony. The overarching goals of beatboxing are more aesthetic than
communicative, so beatboxing harmony may be related to less meaningful—but still
perceptually salient—aesthetic goals. For example, the humming while beatboxing pattern
described earlier allows the beatboxer to add melody to a beat pattern. Even without
phonation, it may sometimes be desirable to make many sounds with a tongue body closure
to create a consistent sound quality from the shorter resonating chamber in front of the
tongue body. Given the completely different tasks that drive speech and beatboxing
harmonies, they could in principle arise from completely distinct motivations using
completely distinct mechanisms, such that any resemblance between them is purely
superficial.
One way to determine whether beatboxing harmony bears only superficial similarity
to harmony in speech or a deeper one one based on a partly shared cognitive system
underlying sequence production is to see whether or not beatboxing harmony exhibits the
signature properties of speech harmony beyond the existence of sequences that share some
properties, namely: triggers, undergoers, and blockers. For example, consider a beatboxing
sequence like *{CR WDT SS WDT}. (The asterisk on that beat pattern indicates that it is not
a sequence found in this data set, which is not quite the same thing as saying that it is an
ill-formed beatboxing sequence.) In that sequence, each sound requires a tongue body
181
closure, so there may be a separate tongue body closure for each sound rather than a
prolonged tongue body closure that would be expected in speech harmony. Either way, none
of the sounds would have to trigger or undergo a tongue body closure assimilation to create
harmony because they all have tongue body closures in any context in which they appear;
and if there is no evidence for triggers, there could be no evidence for blockers either.
Alternatively, evidence could suggest that harmony in speech and beatboxing share
some deeper principles. Local harmony in speech involves prolonged constrictions; since
plenty of other nonspeech behaviors involve holding a body part in one place for an
extended period of time, beatboxing could do that too in order to create a prolonged
aesthetic effect. And if a beatboxer holds a tongue body closure for an extended period of
time during a beat pattern, the closure would temporally overlap with other sounds and
ensure that they are made with a tongue body closure too—even if they weren’t necessarily
selected to have one and wouldn’t have the tongue body closure in other contexts. Thus,
beatboxing might have triggers and undergoers (alternants, as in Chapter 5: Alternations).
Furthermore, if some beatboxing sounds in the same pattern cannot be produced with a
tongue body closure without radically compromising their character, those sounds might
block the tongue body closure harmony. Beatboxing harmony might present all the same
signature properties as speech harmony but for different aims.
Finding evidence in beatboxing for sustained constrictions and sounds with signature
harmony properties is not enough to claim that beatboxing harmony is like speech harmony.
Phonological harmony is a sound pattern. It’s predictable. Triggers, undergoers, and blockers
are classes of sounds organized by sub-segmental properties they share. If beatboxing has the
182
same type of harmony, then the sounds of beatboxing harmony must be organized along
similarly sub-segmental lines. Chapter 3: Sounds used analytic dimensions to describe the
phonetic organization of beatboxing sounds. The aim of the current chapter is to test
whether any of these dimensions play a role in the active cognitive patterning of beatboxing.
If beatboxing can be shown to exhibit harmony, then the roles of the sounds in a harmony
pattern—triggers, undergoers, blockers—should be predictable by some phonetic dimension
along which they are distributed. In turn, those same phonetic dimensions must be
sub-segmental cognitive units for beatboxing.
In the context of the larger question of domain-specificity of language cognition, the
analyses of this chapter aim at answering whether or not harmony is unique to language.
Theories of phonological harmony are designed only to account for language data; but if
beatboxing also has harmony, then a theory is needed that accounts for the shared or
overlapping cognitive structures of speech and beatboxing.. The shared-graph hypothesis in
Chapter 4: Theory represents an initial attempt to do that.
In summary, beatboxing harmony may resemble speech harmony one of two
different ways: in only the superficial sense that sequences of sounds share similar properties,
or in the more profound sense that harmony is governed by phonological principles similar
to those found for speech. In the latter case, beatboxing sounds that participate in harmony
patterns should be reliably classifiable into roles like trigger, undergoer, and blocker.
Furthermore, if these roles can be predicted by one or more phonetic attributes, then
harmony in beatboxing is also evidence for the existence of cognitive sub-segmental
183
beatboxing units. Like speech harmony, beatboxing harmony should then be able to be
accounted for using phonological models of harmony.
Section 2 introduces the method by which the beatboxing corpus was probed to
discover and analyze beatboxing harmony examples. Section 3 describes a subset of the
harmony examples in terms of the evidence for triggers, undergoers, and blockers. Section 4
argues for the existence of cognitive sub-segmental beatboxing elements relating to airflow
initiators and provides an account of beatboxing harmony patterns using gestures made
possible via Chapter 4: Theory.
2. Method
See Chapter 2: Method for details of how the rtMR videos were acquired and annotated then
converted to time series and gestural scores for the analysis below.
The videos and drum tabs of each beat pattern were visually inspected in order to
identify those which had sequences of sounds produced with tongue body closures. Eleven
such beat patterns were identified. For this analysis, each of those 11 beats patterns was
examined more closely to evaluate the constriction state of the tongue body during and
between the articulation of sounds in the beat pattern. These observations were
supplemented and corroborated by region-of-interest time series analysis.
Most of the beat patterns in the database were performed to showcase a particular
beatboxing sound. Seven of the eleven beat patterns exhibiting persistent tongue body
closure were from these showcase beat patterns, each of which features a sound that is
produced with a tongue body closure: Clickroll {CR}, Clop {C}, Duck Meow SFX, Liproll
184
{LR}, Spit Snare {SS}, Water Drop Air {WDA}, and Water Drop Tongue {WDT}. Two other
beat patterns showcasing the Inward Bass and the Humming while Beatboxing pattern were
also performed with a persistent tongue body closure; both of these beat patterns included
the Spit Snare {SS}. The final two beat patterns did not showcase any beatboxing sound in
particular: one was a long beat pattern featuring the Spit Snare, in which the last few
measures were made with a persistent tongue body closure; the other includes both the Spit
Snare and the Water Drop Tongue.
3. Results: Description of beatboxing harmony patterns
Five of the eleven beat patterns with harmony are discussed in this section to illustrate how
beatboxing harmony manifests and to test the hypothesis that beatboxing harmony exhibits
some of the same properties as the signature properties of speech harmony discussed above.
These five are the Spit Snare {SS} showcase (beat pattern 5), the humming while beatboxing
pattern (beat pattern 9), the Clickroll {CR} showcase (beat pattern 1), the Liproll {LR}
showcase (beat pattern 4), and a freestyle beat pattern that was not produced with the
intention of showcasing any particular beatboxing sound (beat pattern 10). As summarized
in Table 25, these beat patterns depict a beatboxing harmony complete with sounds that
trigger the bidirectional spreading of a lingual closure, sounds that undergo alternations due
to that closure, and sounds that block the spread of harmony.
Table 25. Summary of the five beat patterns analyzed.
Section 3.1. Beat pattern 5 — Spit Snare {SS} showcase
Observation
The tongue body rises into a velar closure at
Analysis
The Spit Snare triggers bidirectional tongue
185
the beginning of the utterance and stays
there until the end of the utterance. Kick
Drums in the scope of this velar closure lose
their larynx raising movement.
body closure harmony. Kick Drums in the
environment of the harmony lose their
larynx raising movement when they gain
their tongue body closure, and therefore
exhibit an alternation from a glottalic
egressive to percussive airstream.
Section 3.2. Beat pattern 9 — Humming while beatboxing
Observation
A velar tongue body closure splits the vocal
tract into two chambers so that percussion
and voicing can be produced independently.
Analysis
Tongue body closure harmony is triggered
again by the Spit Snare. It does not restrict
all laryngeal activity—it allows vocal fold
adduction for voicing (humming), but
eliminates the larynx raising movements
associated with Kick Drums.
Section 3.3. Beat pattern 4 — Liproll {LR} showcase
Observation
Tongue body harmony is again achieved by
maintaining a closure against the upper
airway. However, the location of that
closure moves back and forth between the
palate and the uvula as required by the
Liproll. When the Liproll is not active, the
tongue body adopts a velar position.
Analysis
Tongue body closure harmony does not
require a static tongue posture; it allows
variability in constriction location so long as
the constriction degree remains a closure.
The Liproll is the harmony trigger this time,
and PF Snares undergo harmony.
Section 3.4. Beat pattern 10 — Freestyle pattern 1
Observation
Some sequences of sounds agree in tongue
body closure, but these groups are separated
from each other by sounds without tongue
body closure including the Inward Liproll
and High Tongue Bass. Kick Drums near
these two sounds retain their larynx raising
movements.
Analysis
The Spit Snare is once again a harmony
trigger, but the Inward Liproll {^LR} and
High Tongue Bass {HTB} block the spread
of harmony. Both blocking sounds are
pulmonic, indicating that harmony is
blocked by pulmonic airflow. Temporal
proximity to the harmony blockers prevents
the Kick Drums from harmonizing.
Section 3.5. Beat pattern 1 — Clickroll {CR} showcase
Observation
Brief sequences agreeing in tongue body
Analysis
The Clickroll triggers tongue body closure
186
closure are broken up by forced Kick Drums
and Inward K Snares. The tongue body is
elevated during the forced Kick Drums but
an air channel over the tongue is created by
raising the velum.
harmony and the pulmonic Inward K Snare
blocks harmony. As with beat pattern 10,
Kick Drums close to the harmony blocker
are not susceptible to harmonizing. The
elevated tongue body position during forced
Kick Drums is argued to be anticipatory
coarticulation from the Inward K Snare
As for the other six beat patterns not discussed: the Clop {C} showcase (beat pattern 2) was
not analyzed because it only contains one oral sound—the Clop {C}; the Duck Meow SFX
{DM} showcase was not analyzed because a complete phonetic description of the Duck
Meow SFX was not currently possible to give in Chapter 3: Sounds, making an articulatory
analysis unfeasible. The remaining beat patterns for the Water Drop Air {WDA} showcase
(beat pattern 6), Water Drop Tongue {WDT} showcase (beat pattern 7), Inward Bass {IB}
showcase (beat pattern 8), and second freestyle pattern (beat pattern 11) all exhibit
bidirectional spreading like beat pattern 5. Beat pattern 7 is additionally confounded by the
presence of two sounds that use tongue body closures when performed in isolation.
Table 26 lists the beatboxing sounds used in the remainder of this chapter, along with
their transcription in BBX notation (see Chapter 3: Sounds). Transcription in notation from
the International Phonetic Alphabet is also provided which incorporates symbols from the
extensions to the International Phonetic Alphabet for disordered speech (Duckworth et al.,
1990, Ball et al., 2018) and the VoQS System for the Transcription of Voice Quality (Ball et
al., 1995; Ball et al., 2018). An articulatory description of each sound is also given in prose.
The table groups the sounds by their role in beatboxing harmony (which the subsequent
analysis provides evidence for). Note that “percussives” are sounds made with a posterior
187
tongue body closure but without the tongue body fronting or retraction associated with
lingual airstream sounds.
Table 26. The beatboxing sounds used in this chapter.
Name BBX IPA Description
Triggers
Spit Snare {SS} [ ʘ
͡ ɸ↑] Voiceless lingual egressive bilabial affricate
Clickroll {CR} [*] Voiceless lingual egressive alveolar trill
Liproll {LR} [ ʙ ̥ ↓
] Voiceless lingual ingressive bilabial trill
Blockers
Inward Liproll {^LR} [ ʙ ̥ ↓] Voiceless pulmonic ingressive bilabial trill
High Tongue Bass {HTB} [r] Voiced pulmonic egressive alveolar trill (high
pitch)
Inward K Snare {^K} [k͡ʟ ̝ ̊ ↓] Voiceless pulmonic ingressive lateral velar
affricate
Undergoers (alternants of other sounds)
Unforced Kick Drum {b} [ ʬ] Voiceless percussive bilabial stop
Labiodental closure {pf} [ ʘ ̪ ] Voiceless percussive labiodental stop
Dental closure {dc} [k͜ǀ] Voiceless percussive dental stop
Other
Kick Drum {B} [p’] Voiceless glottalic egressive bilabial stop
Closed Hi-Hat {t} [t’] Voiceless glottalic egressive alveolar stop
Humming {hm} [C ̬] Pulmonic egressive nasal voicing
Linguolabial closure {tbc} [ ʘ ̺ ] Voiceless percussive linguolabial stop
Dental-alveolar closure {dac} Voiceless percussive laminal dental stop
Alveolar closure {ac} [k͜ǃ] Voiceless percussive alveolar stop
Lateral alveolar closure {tll} [ ǁ] Voiceless percussive lateral alveolar stop
188
3.1 Beat pattern 5—Spit Snare showcase
Beat pattern 5 showcases the Spit Snare {SS}. Section 3.1.1 demonstrates how the tongue
body makes a velar closure throughout the entire performance, making this a relatively
simple case of tongue body closure harmony. The tongue body closure results in alternations
from forced (ejective) to unforced (percussive) sounds as well as a lack of laryngeal
movement associated with ejectives. Section 3.1.2 analyzes the pattern in terms of a tongue
body harmony trigger and undergoers. Table 27 re-lists the beatboxing sounds used in beat
pattern 5 for reference.
Table 27. Sounds of beatboxing used in beat pattern 5.
Name BBX IPA Description
Unforced Kick Drum {b} [ ʬ] Voiceless percussive bilabial stop
Dental closure {dc} [k ͜ǀ] Voiceless percussive dental stop
Spit Snare {SS} [ ʘ
͡ ɸ↑] Voiceless lingual egressive bilabial affricate
Lateral alveolar closure {tll} [ ǁ] Voiceless percussive lateral alveolar stop
3.1.1 Description of beat pattern 5
Beat pattern 5 is a relatively simple example of tongue body closure harmony in beatboxing.
As the drum tab (Figure 91) and time series (Figure 93) below show, the tongue body makes
a closure against the velum for the entire duration of the beat pattern.
3.1.1.1 Drum tab
The Spit Snare is metrically positioned as expected on the back beat (beat 3 of each
measure), and the unforced Kick Drum occurs in a relatively common pattern on beats 1, 2+,
and 4 of the first measure and beats 2 and 4 of the second measure, repeating the two
189
measure pattern for measures 3 and 4. The dental closure occurs on beat 2 of the first and
third measures, and the lateral alveolar closure occurs on beat 1 of the second and fourth
measures. All the sounds in this beat pattern share the trait of being made with a tongue
body closure. Agreement like this in speech would likely be considered a type of local
harmony.
Figure 91. Drum tab of beat pattern 5.
b |x-----x-----x---|----x-------x---|x-----x-----x---|----x-------x---
dc |----x-----------|----------------|----x-----------|----------------
tll|----------------|x---------------|----------------|x---------------
SS |--------x-------|--------x-------|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
3.1.1.2 Time series
The articulator movements for beat pattern 5 are illustrated in Figure 93 with four time
series: one for labial closures (LAB), one for alveolar closures (COR), one for tongue body
closures (DOR), and one for larynx height (LAR). The labial (LAB) time series includes the
gestures for the unforced Kick Drum {b} and the Spit Snare {SS}, while the coronal (COR)
time series feature the gestures for the dental closure {dc} and the lateral alveolar closure
{tll}. The tongue body (DOR) time series shows that the tongue body stays raised
throughout the beat pattern: the tongue body starts from a lower position at the very
beginning of the beat pattern, represented by low pixel intensity (close to the bottom of the
y-axis), but it quickly moves upward at the beginning of the beat pattern to make a closure
(high pixel intensity, closer to the top of the y-axis) in time for the first unforced Kick Drum
{b}.
190
Figure 92. Regions for beat pattern 5. From top to bottom: the labial (LAB) region for the
unforced Kick Drum {b} and Spit Snare {SS}; the coronal (COR) region for the dental
closure {dc} and lateral alveolar closure {tll}; the dorsal (DOR) region to show tongue body
closure and the laryngeal (LAR) region to show lack of laryngeal activity.
Unforced Kick Drum and Spit Snare
Dental closure and lateral alveolar closure
Dorsal closure during Spit Snare and empty larynx region during unforced Kick Drum
191
Figure 93. Time series of vocal tract articulators used in beat pattern 5, captured using a
region of interest technique. From top to bottom, the time series show average pixel intensity
for labial (LAB), coronal (COR), dorsal (DOR), and laryngeal (LAR) regions.
The time series in Figure 93 capture the results of the alternation of forced Kick Drums to
unforced Kick Drums. As discussed in Chapter 5: Alternations, the default forced Kick
Drums are ejectives which means the laryngeal time series of Kick Drums would show an
increase from low intensity to high intensity as a rising larynx enters the region of interest.
The alternative Kick Drum form, the unforced Kick Drum, is made in front of a tongue body
closure, so it is expected to exhibit activity in the dorsal time series. Tongue body closures are
not antithetical to laryngeal movement: they may occur at the same time, and often do for
dorsal ejectives in speech. Yet beat pattern 5 shows that the Kick Drums do have a tongue
body closure but do not have a laryngeal movement (Figure 94).
192
Figure 94. Upper left: Labial and laryngeal gestures for an ejective/forced Kick Drum at the
beginning of a beat pattern. Upper right: Labial gesture for a non-ejective/unforced Kick
Drum at the beginning of beat pattern 5. A larynx raising gesture occurs with the forced Kick
Drum, but not the unforced Kick Drum. (Pixel intensities for each time series were scaled
[0-1] relative to the other average intensity values in that region; the labial closure of the
forced Kick Drum looks smaller than the labial closure of the unforced Kick Drum because it
was scaled relative to other sounds in its beat pattern with even brighter pixel intensity
during labial closures. Both labial gestures in this figure are full closures.) Lower left: At the
time of maximum labial constriction for the ejective Kick Drum, the vocal folds are closed
(visible as tissue near the top of the trachea) and the airway above the larynx is open; the
velum is raised. Lower right: At the time of maximum labial constriction for the non-ejective
unforced Kick Drum, the vocal folds are open and the tongue body connects with a lowered
velum to make a velar closure.
Forced (ejective) Kick Drum Unforced (lingual) Kick Drum
Forced (ejective) Kick Drum Unforced (lingual) Kick Drum
193
From the perspective of aerodynamic mechanics, this is sensible: laryngeal movement behind
the tongue body closure has no effect on the size of the chamber between the lips and the
tongue body, so it makes no difference whether the larynx moves or not; better to save
energy and not move the larynx. From the perspective of beatboxing phonology, this
example is illuminating: if one assumes based on Chapter 5: Alternations that the forced Kick
Drum was selected for this beat pattern and undergoes an alternation into an unforced Kick
Drum, then the phonological model must provide not only a way to spread the tongue body
closure but also a way to get rid of the larynx raising. (Section 4 addresses this in more
detail.)
3.1.2 Analysis of beat pattern 5
Harmony patterns in speech are defined by articulations that spread from a single trigger
sound to other sounds nearby, causing them to undergo assimilation to that articulation. In
beat pattern 5, the Spit Snare is the origin of a lengthy tongue body closure gesture and other
sounds like the Kick Drum undergo assimilate to that dorsal posture as well. The sounds
agree by sharing a tongue body closure, and in this sense they are harmonious.
3.1.2.1 Harmony undergoers
As established in Chapter 5: Alternations, the unforced Kick Drum is an alternation of the
Kick Drum that mostly appears in environments with surrounding dorsal closures. This was
implicitly characterized as local agreement: the unforced Kick Drum adopts a tongue body
closure when adjacent sounds also have a tongue body closure. Looking beyond the unforced
Kick Drum’s immediate environment however, and considering the pervasive tongue body
194
closure in this beat pattern, the Kick Drum alternation in this beat pattern seems more aptly
described as the result of tongue body harmony: the Kick Drum is not just accidentally
sandwiched between two dorsal sounds—all the sounds, nearby and not, have tongue body
closures. The unforced Kick Drum is a forced Kick Drum that undergoes tongue body
closure harmony.
3.1.2.2 Harmony trigger
Of all the sounds in a beat pattern, only the ones that are always produced with a tongue
body closure, even in isolation, could be triggers of harmony. Of the sounds in this particular
beat pattern, only the Spit Snare was ever performed in isolation or identified as a distinct
beatboxing sound by the beatboxer; as the only sound in this beat pattern known to require a
tongue body closure, the Spit Snare is therefore the most likely candidate for a harmony
trigger. In fact, the Spit Snare is associated with long tongue body closures in all the beat
patterns it appears in, and in most cases is the only sound in that pattern known to be
produced with a tongue body closure.
Assuming the Spit Snare is a harmony trigger, then the tongue body closure harmony
in this beat pattern extends bidirectionally: it is regressive from beat 2 of the first measure to
begin with the first unforced Kick Drum {b}, but also progressive from beat 4 of the last
measure to co-occur with the final unforced Kick Drum.
3.2 Beat pattern 9—Humming while beatboxing
Beat pattern 9 is an example of the “humming while beatboxing” described at the beginning
of this chapter. Section 3.2.1 describes this humming while beatboxing pattern with drum tab
195
notation and articulatory time series. The humming is intermittent in this particular beat
pattern, and there is no need to keep a tongue body closure when humming is not
active—yet as the time series shows, the tongue body closure persists for the entire beat
pattern, suggesting a sustained posture like the ones exhibited in speech harmony. This is
discussed in section 3.2.2 in terms of triggers (the Spit Snare) and undergoers (the
non-humming sounds). For reference, the sounds of this beat pattern are listed in Table 28.
Table 28. Sounds of beatboxing used in beat pattern 9.
Name BBX IPA Description
Unforced Kick Drum {b} [ ʬ] Voiceless percussive bilabial stop
Dental closure {dc} [k ͜ǀ] Voiceless percussive dental stop
Spit Snare {SS} [ ʘ
͡ ɸ↑] Voiceless lingual egressive bilabial affricate
Linguolabial closure {tbc} [ ʘ ̺ ] Voiceless percussive linguo-labial stop
Humming {hm} [C ̬] Pulmonic egressive nasal voicing
3.2.1 Description of beat pattern 9
3.2.1.1 Drum tab
Beat pattern 9 showcases the strategy of humming {hm} while beatboxing (Figure 95). As in
beat pattern 5, the four supralaryngeal sounds in this beat pattern are the unforced Kick
Drum {b}, a Spit Snare {SS}, and two additional percussive closures—one dental {dc} and one
linguolabial {tbc}. The additional humming {hm} sound is a brief upward pitch sweep that
occurs on most beats. (If humming occurs with the first three Spit Snares, it is acoustically
occluded in the audio data of this beat pattern and therefore was not marked.)
196
Figure 95. Drum tab of beat pattern 9.
b |x-----x-----x---|--x---x-------x-|x-----x-----x---|--x---x---------
dc |--x-----------x-|----------------|--x-----------x-|------------x---
tbc|----x-----------|x---x-------x---|----x-----------|x---x-----------
SS |--------x-------|--------x-------|--------x-------|--------x-------
hm |x---x-------x---|x---x-------x---|x---x-------x---|x---x---x---x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
3.2.1.2 Time series
The time series were generated by the same regions used in section 3.1 (the Spit Snare
showcase). The DOR time series shows that the tongue body is raised consistently
throughout the beat pattern. Laryngeal activity on most major beats (LAR time series)
corresponds to voicing {hm}. There is also activity during the three Spit Snares that are not
marked for voicing in the drum tab; if this is voicing, it may not apparent in the acoustic
signal due to some combination of the noise reduction method used in audio processing and
the high amplitude of the Spit Snare itself.
Figure 96. Time series and gestures of beat pattern 9.
197
3.2.2 Analysis of beat pattern 9
The main point of note in this beat pattern is that the larynx is not necessarily inactive
during tongue body closure harmony. The description of beat pattern 5 in section 3.1 noted
that when forced Kick Drums undergo tongue body closure harmony, their unforced
alternants do not have a larynx raising gesture. A phonological model needs to be able to
“turn off” the larynx movement of the forced Kick Drums to generate the observed unforced
Kick Drums. But as beat pattern 9 shows, a blanket ban on laryngeal activity during tongue
body closure harmony would not be an appropriate choice for the phonological model
because the vocal folds can still phonate.
The musical structures of beat patterns 5 and 9 are different in sounds and rhythms,
but the rest of the analysis is essentially the same. Once again, the tongue body closure that
persists throughout the beat pattern is most likely to be associated with the Spit Snares: none
of the other sounds in this beat pattern were produced in isolation by the beatboxer, which
suggests that they are tongue-body alternations of sounds without tongue body gestures (like
the Unforced Kick Drum is an alternation of the Kick Drum) or sounds that are
phonotactically constrained to only occur in the context of a sound with a tongue body
closure—in either case, not independent instigators of a sustained tongue body closure.
Again, the harmony would be bidirectional, spreading leftward to the first sounds of the beat
pattern and rightward until the end.
198
3.3 Beat pattern 4—Liproll showcase
Beat pattern 4 showcases the Liproll {LR}. The Liproll triggers tongue body harmony just
like the Spit Snare did in the previous examples; but unlike the Spit Snare, the tongue body
constriction location during the Liproll changes dramatically during the Liproll’s
production—from the front of the palate all the way to the uvula in one smooth glide.
Tongue body closure harmony is maintained during the Liproll because the constriction
degree of the tongue body stays at a constant closure. When the Liproll is not being
produced, the tongue body adopts a static velar closure. Section 3.3.1 presents the beat
pattern in drum tab and time series forms, and section 3.3.2 analyzes the pattern in terms of a
tongue body harmony trigger (the Liproll) and undergoers (everything else).
Table 29. Sounds of beatboxing used in beat pattern 4.
Name BBX IPA Description
Liproll {LR} [ ʙ ̥ ↓
] Voiceless lingual ingressive bilabial trill
Alveolar closure {ac} [k ͜ǃ] Voiceless percussive alveolar stop
Unforced Kick Drum {b} [ ʬ] Voiceless percussive bilabial stop
Linguolabial closure {tbc} [ ʘ ̺ ] Voiceless percussive linguolabial stop
Labiodental closure {pf} [ ʘ ̪ ] Voiceless percussive labiodental stop
Dental closure {dc} [k ͜ǀ] Voiceless percussive laminal alveolar stop
3.3.1 Description of beat pattern 4
3.3.1.1 Drum tab
Beat pattern 4 (Figure 97; split into two parts) is composed of six sounds: the unforced Kick
Drum {b}, the Liproll {LR}, and percussive alveolar {ac}, dental {dc}, labiodental {pf}, and
199
linguolabial {tbc} closures. The onset of Liprolls are metrically synchronous with unforced
Kick Drums as represented by the “x” symbols, though the time series shows that they are
not simultaneous—a Kick Drum is made first and a Liproll follows quickly thereafter. The “~”
symbol signifies that the labial trill of the Liproll is extended across multiple beats. The
labiodental closure {pf} serves the role of the snare by occurring consistently and exclusively
on beat 3 of each measure; since it was never produced in isolation by the beatboxer, the {pf}
is analyzed as an alternant of the glottalic egressive {PF} snare.
Figure 97. Drum tab notation for beat pattern 4.
b |x-----x-----x---|--x---x-----x---|x-----x-----x---|--x-------x---x-
ac |----------x-----|----------x-----|----------x-----|--------x-------
dc |----------------|----x-----------|----------------|------------x---
tbc|----------------|----------------|----------------|----x-----------
pf |--------x-------|--------x-------|--------x-------|------x---------
LR |x~~~~~------x~~~|~~----------x~~~|x~~~~~------x~~~|~~--------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
b |x---x-------x---|x---x-------x---|x---x-------x---|x---x-----------
ac |----------x-----|----------x-----|----------x-----|----------------
dc |----------------|--------------x-|----------------|----------------
tbc|----------------|----------------|----------------|----------------
pf |--------x-------|--------x-------|--------x-------|--------x-------
LR |x~~~x~~~----x~~~|x~~~x~~~--------|x~~~x~~~----x~~~|x~~~x~~~--------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
3.3.1.2 Time series
The time series representation for beat pattern 4 (Figure 99) follows five time series. The first
three (LAB, LAB2, and FRONT) have movements relevant to the production of sounds in
this pattern. Labial closures of the unforced Kick Drum {b} and labiodental closure {pf} are
in the LAB time series; labial closures during which the lips are pulled inward over the teeth
for the Liproll {LR} are in LAB2; and the anterior region of the vocal tract into which the
tongue shifts forward at the beginning of a Liproll is represented by FRONT. (A coronal time
200
series for the alveolar, dental, and interlabial closures is not included.) The dorsal DOR and
laryngeal LAR time series are included to show the consistently high tongue body posture
and the lack of laryngeal activity, respectively.
Figure 98. Regions used to make time series for the Liproll beat pattern.
Unforced Kick Drum (left) and labiodental closure (right) in LAB region.
Liproll retraction of lower lip over the teeth into LAB2 region.
201
Liproll tongue body in (left) and out of (right) the FRONT region
The tongue body makes a closure with the velum in the DOR region during the labiodental
closure (left) and there is no laryngeal activity in the LAR region (right).
202
Figure 99. Time series of the beat pattern 4 (Liproll showcase).
3.3.2 Analysis of beat pattern 4
The Liproll triggers tongue body closure harmony in beat pattern 4, causing both Kick
Drums and PF Snares to be produced with tongue body closures instead of glottalic egressive
airflow. Figure 98 shows snapshots of the different positions of the tongue body during this
beat pattern: the tongue body adopts a resting position closed against the velum during most
sounds but shifts forward and backward (right image) to create the Liproll.
3.4 Beat pattern 10—Freestyle beat pattern
Beat pattern 10 is a freestyle beat pattern not intended to showcase any particular sound. The
Spit Snare is once again a harmony trigger as it was in beat patterns 5 and 9, but here the
harmony does not spread throughout the whole beat pattern as it did in those earlier ones. In
the first six measures of the beat pattern, tongue body closures triggered by a Spit Snare do
203
not extend through the Inward Liproll or High Tongue Bass. These two pulmonic sounds are
analyzed as harmony blockers.
Table 30. Sounds of beatboxing used in beat pattern 10.
Name BBX IPA Description
Inward Liproll {^LR} [ ʙ ̥ ↓] Voiceless pulmonic ingressive bilabial trill
Kick Drum {B} [p’] Voiceless glottalic egressive bilabial stop
Inward K Snare {^K} [k ͡ʟ ̝ ̊ ↓] Voiceless pulmonic ingressive lateral velar
affricate
Spit Snare {SS} [ ʘ
͡ ɸ↑] Voiceless lingual egressive bilabial affricate
Dental-alveolar closure {dac} Voiceless percussive laminal dental stop
Unforced Kick Drum {b} [ ʬ] Voiceless percussive bilabial stop
Linguolabial closure {tbc} [ ʘ ̺ ] Voiceless percussive linguolabial stop
High Tongue Bass {HTB} [r] Voiced pulmonic egressive alveolar trill (high
pitch)
Dental closure {dc} [k ͜ǀ] Voiceless percussive laminal alveolar stop
3.4.1 Description of beat pattern 10
3.4.1.1 Drum tab
The Spit Snare {SS} occurs on beat three of each measure of this beat pattern. In measures 2,
4, 6, and 7-8 the Spit Snare follows a linguolabial closure {tbc} and unforced Kick Drum {b},
indicating that some harmony is occurring. In the same measures, however, there are also
forced Kick Drums and High Tongue Basses that did not undergo harmony. And in measures
1, 3, and 5 the Spit Snare is the only tongue body closure sound around. Only in the final two
measures does the pattern return to one of a sequence of tongue body closure sounds.
204
Figure 100. Drum tab for beat pattern 10.
B |x-----x-----x---|--x-------------|x-----x-----x---|--x-------------
^LR|x~~~~~------x~~~|~~--------------|x~~~~~------x~~~|~~--------------
^K |----------------|----------------|----------------|----------------
SS |--------x-------|--------x-------|--------x-------|--------x-------
tbc|----------------|----x-----------|----------------|----x-----------
HTB|----------------|------------x~~~|----------------|------------x~~~
b |----------------|------x---------|----------------|------x---------
dc |----------------|----------------|----------------|----------------
dac|----------------|----------------|----------------|----------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
B |x-----x-----x---|--x-------------|x---------------|----------------
^LR|x~~~~~------x~~~|~~--------------|----------------|----------------
^K |----------------|----------------|----------------|------------x---
SS |--------x-------|--------x-------|--------x-------|--------x-------
tbc|----------------|----x-----------|----------------|----x-----------
HTB|----------------|------------x~~~|----------------|----------------
b |----------------|------x---------|------x-----x---|--x---x---------
dc |----------------|----------------|--x-----------x-|----------------
dac|----------------|----------------|----x-----------|x---------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
3.4.1.2 Time series
Nine beatboxing sounds manifest along six time series. The forced Kick Drum {B}, unforced
Kick Drum {b}, and Spit Snare {SS} all go on the labial closure LAB time series. The Inward
Liproll {^LR} goes on the LAB2 time series which responds to pixel intensity when the lower
lip retracts over the bottom teeth. (It also responds to tongue tip movement in the same
pixels, but there are no meaningful movements highlighted in that case.) The High Tongue
Bass {HTB}, linguolabial closure {tbc}, and dental closure {dc}, are in the COR tongue tip
time series. The Inward K Snare {^K} goes on the DOR region, and the LAR region has the
laryngeal movements for the forced Kick Drum. The dental-alveolar closure {dac} was
captured in a separate region that is not pictured. Black boxes surround movements that
were partially or completely manually corrected.
205
Most of the Kick Drums near the Inward Liproll and High Tongue Bass are marked as
forced because laryngeal closure was apparent when visually inspecting the image frames of
those sounds. A forced Kick Drum was also observed in the production of the Inward Liproll
in isolation. But in this beat pattern, the laryngeal activity during most forced Kick Drums is
minimal. In some instances the laryngeal region brightens for a moment and then darkens
again with no apparent vertical movement. Unusually high pixel brightness near the lips and
tongue tip may drown out the details of whatever laryngeal closure/raising there may be. At
other times, there is clear vertical laryngeal movement during a subsequent Spit Snare; Spit
Snares after forced Kick Drums co-occur with larynx raising, while Spit Snares after unforced
Kick Drums do not.
The relationship between sounds in beatboxing clusters—like the Kick Drums and
Inward Liprolls organized to the same beat—is unknown territory for beatboxing science, so
it is not clear how to expect those Kick Drums should manifest. For this analysis, the
presence of any laryngeal closure at all during these Kick Drums is taken as indication that
they are forced, and the lack of noticeable vertical movement attributed to undershoot (not
enough time for noticeable movement). Laryngeal movements marked on the time series
correspond to visual observations of laryngeal activity. At the very least, the Kick Drums just
before linguolabial closures {tbc} have clear laryngeal closure/raising.
As shown in the DOR time series, the tongue body is sometimes raised into an
extended closure and sometimes not. The tongue body is elevated overall because the DOR
region has at least some brightness at all times except during the Inward K Snare {^K} when
the tongue body completely leaves the region. The aperture of tongue body constriction
206
widens during most Inward Liprolls and High Tongue Basses, then decreases again as the
tongue body moves back into its closure before and after Spit Snares.
Figure 101. The regions used to make the time series for beat pattern 10.
Forced Kick Drum (left), unforced Kick Drum (center), and Spit Snare (right) in LAB region.
Inward Liproll in LAB2 region.
High Tongue Bass (left) and linguolabial closure (right) in COR region.
207
Inward K Snare (left) outside of the DOR region and (local) maximum larynx height during
a forced Kick Drum in the LAR region (right).
Figure 102. Time series of beat pattern 10.
3.4.2 Analysis of beat pattern 10
The domain of the Spit Snare’s harmony extends bidirectionally up to an Inward Liproll
{^LR} or High Tongue Bass {HTB}, then halts. As non-nasal pulmonic sounds, the Inward
Liproll and High Tongue Bass cannot be made with a tongue body closure because a tongue
body closure would prevent the pulmonic airflow from passing over the relevant oral
208
constriction. In speech harmony, sounds with this kind of physical antagonism to harmony
that also seem to stop the spread of harmony are generally analyzed as harmony blockers.
Alternatively, some sounds are analyzed as transparent to harmony, meaning they do not
prevent harmony from spreading but they also do not undergo a qualitative harmonious shift
either. It could be that the Inward Liproll and High Tongue Bass are transparent—tongue
body closure harmony continues through them, but the need for pulmonic airflow
temporarily trumps the tongue body closure.
The blocking analysis works slightly better here because of the presence of forced
Kick Drums. As we have seen in every other beat pattern so far, tongue body closure
harmony seems to trigger a qualitative shift in which forced Kick Drums become unforced,
losing their laryngeal closure/raising gestures and gaining a tongue body closure. Here
however there are some forced Kick Drums near pulmonic sounds. If harmony were not
blocked, then the Kick Drums should undergo harmony; since they don’t, then either they
are exceptional Kick Drums that are intrinsically resistant to harmony or they are defended
from harmony by other sounds that block harmony.
5
There is no other reason to think that
any Kick Drums should be exceptional compared to others. A phonological analysis with
unexplained exceptionality is less appealing than an analysis that explains everything, so
5
This would be a problem in a traditional phonological analysis that treats sounds as sequential symbol strings.
Consider the sequence {... ^LR B tbc b SS HTB …} in which tongue body harmony has spread regressively from
the Spit Snare {SS} as indicated by underlining beneath the Spit Snare and undergoers. In this format, blocking
from the Inward Liproll must “jump” over the forced Kick Drum to stop the harmony from affecting the forced
Kick Drum and making *{... ^LR b tbc b SS HTB …}. In theories that don’t pretend sounds exist in time and can
overlap, however, this is not as big an issue. If those Kick Drums are sufficiently temporally proximal to the
blockers—and indeed many of the Kick Drums in this beat pattern partially overlap with the pulmonic
sounds—then the harmonizing tongue body closure may simply not have yet been unblocked.
209
blocking is the preferred analysis over transparency here. The beat pattern in section 3.5
reinforces the blocking analysis.
3.5 Beat pattern 1—Clickroll showcase
Beat pattern 1 is a Clickroll {CR} showcase beat pattern. Section 3.5.1 presents the beat
pattern in drum tab and time series forms, illustrating an example of tongue body harmony
that is periodically interrupted by Inward K Snares. Section 3.5.2 analyzes the pattern in
terms of a tongue body harmony trigger (the Clickroll), undergoers (the unforced Kick
Drum and dental closure), and a blocker (Inward K Snare).
Table 31. Sounds of beatboxing used in beat pattern 1.
Name BBX IPA Description
Clickroll {CR} [*] Voiceless lingual egressive alveolar trill
Kick Drum {B} [p’] Voiceless glottalic egressive bilabial stop
Unforced Kick Drum {b} [ ʬ] Voiceless percussive bilabial stop
Inward K Snare {^K} [k ͡ʟ ̝ ̊ ↓] Voiceless pulmonic ingressive lateral velar
affricate
Closed Hi-Hat {t} [t’] Voiceless glottalic egressive alveolar stop
Dental closure {dc} [k ͜ǀ] Voiceless percussive dental stop
3.5.1 Description of beat pattern 1
3.5.1.1 Drum tab
Beat pattern 1 (Figure 103) is composed of six sounds: the unforced and forced Kick Drums
{b} and {B}, Closed Hi-Hat {t}, dental closure {dc}, Inward K Snare {^K}, and Clickroll {CR}.
The Kick Drums follow a two-measure pattern of occurrence—beats 1, 2+, and 4 of the first
210
measure, then the “and”s of each beat in the second measure. The pattern repeats in the latter
half of the beat pattern except that the final Kick Drum is replaced by an Inward K Snare.
Inward K Snares additionally appear on beat 3 of each measure. Clickrolls in this beat
pattern are always co-produced on the same beat as an unforced Kick Drum, though the
reverse is not true (i.e., an unforced Kick Drum at the end of the second measure is not
co-produced with a Clickroll). The dental closure also follows a two-measure pattern with
occurrences on the 2 and 3+ of the first measure and beats 1, 2, and 4 of the second measure;
this pattern repeats in the latter half of the beat pattern, but a Closed Hi-Hat occurs where
the last dental closure is expected.
Figure 103. Drum tab notation for beat pattern 1.
b |x-----------x---|--x-----------x-|x-----------x---|--x-------------
B |------x---------|------x---x-----|------x---------|------x---x-----
t |----------------|----------------|----------------|------------x---
dc|----x-----x-----|x---x-------x---|----x-----x-----|x---x-----------
^K|--------x-------|--------x-------|--------x-------|--------x-----x-
CR|x~~~--------x~~~|--x~------------|x~~~--------x~~~|--x~------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
3.5.1.2 Time series
The time series representation for beat pattern 1 (Figure 105) follows five distinct time series:
labial closures (LAB), alveolar closures (COR), dorsal closures (DOR), velum position
(VEL), and larynx height (LAR). Note that in this beat pattern, the dental closure is usually
the release of a coronal closure caused by a Clickroll or Inward K Snare and does not have its
own closing action. The DOR time series illustrates that the tongue body is raised near the
velum throughout beat pattern 1 except during the Inward K Snare and after the penultimate
Inward K Snare. Figure 104 shows that even what appears to be tongue body lowering during
211
the Inward K Snares is actually tongue fronting; whether for the Inward K Snare or for
harmony, the tongue body is elevated until close to the end.
Surprisingly, there are several forced Kick Drums in the beat pattern despite the
consistently raised tongue body posture. Tongue body closure and larynx raising are not
physically impossible to produce together, but every example thus far has shown that tongue
body closures cancel larynx closure/raising gestures during harmony. Here, forced Kick
Drums before Inward K Snares are produced with both laryngeal raising and a raised tongue
body. The velum (VEL) time series shows how a Kick Drum can be forced even when the
tongue body is high. In this beat pattern, persistent tongue body closures are made by the
tongue body and velum coming together; among other things, this allows the beatboxer to
breathe through the nose while simultaneously using the mouth to create sound. During the
forced Kick Drums, harmony ends not by lowering the tongue body but rather by raising the
velum in preparation for the pulmonic ingressive Inward K Snare (Figure 105). This directs
laryngeal air pressure manipulations through the air channel now extant over the tongue.
The resulting Kick Drums therefore have larynx raising without tongue body closure, giving
them the form typically expected for Kick Drums. The last forced Kick Drum differs from
the rest: it is the only one for which the tongue body does not appear to be raised toward the
velum—nor does it appear to be making any particular constriction at all (Figure 107).
212
Figure 104. Regions for beat pattern 1 (Clickroll showcase).
Labial (LAB) closures for forced Kick Drum (left) and unforced Kick Drum (right).
Tongue tip (COR) closures for the Clickroll (left), dental closure (center), and Closed
Hi-Hat (right).
The tongue body is out of the DOR region during the Inward K Snare.
213
Larynx (LAR) region filled at the MAXC of the larynx raising associated with a forced Kick
Drum (left). The right image was taken from the PVEL2 of the tongue tip release (COR time
series)
Figure 105. Time series of beat pattern 1.
214
Figure 106. The DOR region for the Clickroll showcase (beat pattern 1) in the first {CR dc B
^K}. Left: The tongue body is raised and the velum is lowered during the Clickroll {CR},
leaving no air channel over the tongue body; pixel intensity in the region is high. Center: The
tongue body is raised during a forced Kick Drum, but the velum is also raised so there is a
gap between the tongue body and the velum through which air can pass; pixel intensity in
the region is high. Right: The tongue body is shifted forward during the lateral release of an
Inward K Snare; pixel intensity in the region is low.
Figure 107. Each forced Kick Drum in the beat pattern in order of occurrence. The image was
taken from the frame of the LAB region’s peak velocity (change in pixel intensity)
corresponding to PVEL2 as described in Chapter 2: Method. The final Kick Drum (far right)
signals that harmony has ended because the tongue body is not making a narrow velar
constriction.
215
Figure 108. Upper left: Labial and laryngeal gestures for an ejective/forced Kick Drum before
an Inward K Snare in beat pattern 1. Upper right: Labial gesture for a non-ejective/unforced
Kick Drum in beat pattern 1. Lower left: Near the time of maximum labial constriction for
the ejective Kick Drum, the vocal folds are closed (visible as tissue near the top of the
trachea) and the airway above the larynx is open, including a narrow passage over the tongue
body which is raised but not making a closure with the velum; the velum is raised. Lower
right: At the time of maximum labial constriction for the non-ejective unforced Kick Drum,
the vocal folds are open and the tongue body connects with a lowered velum to make a velar
closure.
Forced (ejective) Kick Drum Unforced (lingual) Kick Drum
Forced (ejective) Kick Drum Unforced (lingual) Kick Drum
216
3.5.2 Analysis of beat pattern 1
3.5.2.1 Harmony trigger
Harmony was readily apparent in beat patterns 5 (section 3.1) and 9 (section 3.2) in the form
of a spreading tongue body closure triggered by a Spit Snare, and in beat pattern 4 (section
3.3) from a Liproll trigger. In those beat patterns, the tongue body constriction degree is
consistent throughout—every sound is made with a tongue body closure. Beat pattern 1 is
more challenging to analyze as harmony because the tongue body closure is frequently
interrupted, making it relatively more difficult to spot prolonged tongue body closures or to
know what sounds might have triggered them. For this beat pattern the Clickroll is the only
sound made with a tongue body closure when performed in isolation, making it the most
likely trigger.
3.5.2.2 Harmony undergoers
Chapter 5: Alternations used metrical patterning to motivate the analysis that the unforced
Kick Drum is a Kick Drum alternation that occurs in dorsal environments. This beat pattern
provides evidence that the dental closure {dc} may be an alternation of the Closed Hi-Hat.
The drum tab in Figure 103 shows that the Closed Hi-Hat appears at the end of the
performance in precisely the metrical position where a dental closure is expected. This would
make three clear harmony undergoers from the beat patterns analyzed so far: the forced Kick
Drum to the unforced Kick Drum, the PF Snare to a labiodental closure, and the Closed
Hi-Hat to a dental closure.
But given the frequent tongue body closure interruptions and use of several forced
Kick Drums, it is not clear whether the unforced Kick Drums and dental closure alternants
217
should be considered the result of harmony or more simply a consequence of local
assimilation. All the unforced Kick Drums in this beat pattern save one are produced in the
same metrical position as a Clickroll; the tongue body must be raised in anticipation of the
Clickroll, so the Kick Drum on the same beat as a Clickroll must be co-produced with a
tongue body closure—there is not enough time between the release of the labial closure and
the onset of trilling to raise the tongue body and create the necessary pocket of air between
the tongue body and the tongue tip. Likewise, the starting conditions for most dental
closures are the result of tongue closures for a Clickroll or Inward K Snare. Dental closures
may occur as alternants of Closed Hi-Hats in this environment simply because they are
mechanically advantageous given the current state of the vocal tract, not because of a
harmonizing tongue body closure.
Two aspects of the data suggest that this is a harmony pattern. First, the absence of
laryngeal closure/raising gestures: if the Kick Drums simply became percussives because of a
concurrent but non-inhibiting tongue body closure gesture, there should still be larynx
raising—which there is not. Second, there is the sequence{... ^K B dc b | b-CR ...}
which begins from beat 3 of the second measure (the pipe character | indicates the divide
between measures 2 and 3, and the hyphen in {b-CR} indicates that the sounds are made on
the same beat). This sequence features a dental closure {dc} and unforced Kick Drum {b}
(both underlined) that are made without an adjacent Clickroll or even an adjacent Inward K
Snare—that is, without any sounds nearby that require a tongue body closure . If the
alternations from forced Kick Drum to unforced Kick Drum and from Closed Hi-Hat to
dental closure in this beat pattern were due only to coproduction, then these particular
218
dental closure and unforced Kick Drum should have been ejective Closed Hi-Hat and forced
Kick Drum instead. The presence of tongue body closure here, despite there being no
immediate coarticulatory origin for it, indicates harmony. Extrapolating this to the rest of the
beat pattern, the unforced Kick Drums and dental closures in this pattern can be described
as the result of the same bidirectional tongue body closure harmony that appeared in beat
patterns 5, 9, and 4.
3.5.2.3 Harmony blocker
As noted earlier, there are forced Kick Drums here which by definition are produced without
a tongue body closure; there are also Inward K Snares which move the tongue body closure
to a different location, lateralize it, and bring air flowing pulmonic ingressively through the
mouth and over the tongue. Neither sound is participating in harmony, either as a trigger or
as an undergoer. Section 3.4 suggested that pulmonic sounds like the Inward Liproll and
High Tongue Bass are harmony blockers that defend the Kick Drums from harmonizing too;
if so, then the pulmonic ingressive Inward K Snare can also be analyzed the same way. Just as
in section 3.4, these forced Kick Drums are close enough temporally to the Inward K Snare
that they can also benefit from the blocking of the tongue body harmony.
The last measure of this beat pattern provides perhaps the clearest demonstration
that harmony is blocked by the Inward K Snare. Figure 107 illustrates that all but the last
forced Kick Drum are co-produced with a tongue body constriction. Suspiciously, all but the
last forced Kick Drum also fall somewhere between a Clickroll and an Inward K Snare,
precisely where harmony is predicted to be trying to spread, whereas the last forced Kick
Drum has no Clickroll in its vicinity. The penultimate Inward K Snare blocks harmony for
219
the last time and so all following sounds are made without influence from a tongue body
closure. This notably includes the Closed Hi-Hat which has never appeared so far in
harmony pattern but occurs frequently outside of harmony (see Chapter 3: Sounds); its
appearance in this beat pattern is another indicator that harmony has ended.
4. Theoretical accounts and implications
The introduction of this chapter posed two main questions. First, descriptively, does
beatboxing exhibit signature properties of phonological harmony like triggers, undergoers,
and blockers? And second, what can be concluded about beatboxing cognition from the
description of beatboxing harmony? With respect to the first question, section 3 found that
there are indeed beat patterns with sustained tongue body closure that can be described as
bidirectional harmony. Those patterns include sounds associated with those tongue body
closures that act as triggers, sounds that undergo qualitative change because of the harmony,
and sounds that block the spread of harmony. The remainder of this chapter addresses the
second question about the implications for beatboxing cognition in two parts: the evidence
for cognitive sub-segmental units (section 4.1), and a discussion of how beatboxing harmony
might be implemented in gestural and featural frameworks (section 4.2).
4.1 Evidence for cognitive sub-segmental beatboxing units
It is hopefully uncontroversial that beatboxing sounds are (or have) cognitive
representations. Beatboxers learn categories of sounds and overtly or covertly organize them
by their musical role; they can also name many of the sounds they can produce, and likewise
220
produce a sound they know when prompted with its name. All of this knowledge is necessary
and inevitable for skilled beatboxers. Less clear is the nature and composition of those
representations. The question at hand is whether there is evidence for cognitive units
different from whole beatboxing sounds (sub-segmental units), like gestures.
Chapter 3: Sounds characterized beatboxing sounds along a relatively small set of
phonetic dimensions, but cautioned that finding observable dimensions does not imply the
cognitive reality of those dimensions. The atoms of speech—units the size of features or
gestures—are argued to be cognitive because of many years of observational data and more
recent (40-50) years of experimental data showing that sounds pattern along these phonetic
dimensions. In almost all cases the patterns occur for sounds of a particular “natural” class,
which is to say that the sounds involved share one or more phonetic properties.
If there is any cognitive reality to the phonetic dimensions of beatboxing sounds, then
beatboxing sounds belonging to a given class defined by one or more phonetic dimensions
should share a certain pattern of behavior. Beatboxing harmony provides a window through
which to assess the possibility of sub-segmental cognitive beatboxing units. Triggers,
undergoers, and blockers have complementary behavior in harmony; if they also have
complementary phonetic dimensions relevant to the harmony, then those dimensions will
satisfy the criteria above for being cognitively real.
221
Table 26 (reprinted). The beatboxing sounds involved in harmony organized by their
harmony role.
Name BBX IPA Description
Triggers
Spit Snare {SS} [ ʘ
͡ ɸ↑] Voiceless lingual egressive bilabial affricate
Clickroll {CR} [*] Voiceless lingual egressive alveolar trill
Liproll {LR} [ ʙ ̥ ↓
] Voiceless lingual ingressive bilabial trill
Blockers
Inward Liproll {^LR} [ ʙ ̥ ↓] Voiceless pulmonic ingressive bilabial trill
High Tongue Bass {HTB} [r] Voiced pulmonic egressive alveolar trill (high
pitch)
Inward K Snare {^K} [k͡ʟ ̝ ̊ ↓] Voiceless pulmonic ingressive lateral velar
affricate
Undergoers (alternants of other sounds)
Unforced Kick Drum {b} [ ʬ] Voiceless percussive bilabial stop
Labiodental closure {pf} [ ʘ ̪ ] Voiceless percussive labiodental stop
Dental closure {dc} [k͜ǀ] Voiceless percussive dental stop
Other
Kick Drum {B} [p’] Voiceless glottalic egressive bilabial stop
Closed Hi-Hat {t} [t’] Voiceless glottalic egressive alveolar stop
Humming {hm} [C ̬] Pulmonic egressive nasal voicing
Linguolabial closure {tbc} [ ʘ ̺ ] Voiceless percussive linguolabial stop
Dental-alveolar closure {dac} Voiceless percussive laminal dental stop
Alveolar closure {ac} [k͜ǃ] Voiceless percussive alveolar stop
Lateral alveolar closure {tll} [ ǁ] Voiceless percussive lateral alveolar stop
222
Table 26 (reprinted) lists the sounds that participate in the five analyzed beat patterns with
harmony according to their function in the harmony pattern. The sounds in the “other”
group are sounds which were either prevented from undergoing harmony by nearby blocking
sounds (the forced Kick Drum and Closed Hi-Hat) or for which there is not sufficient
evidence to say what their role is (humming, and some percussives). Within each group, the
sounds do not belong to the same musical category (i.e., snare, kick, roll) and do not have the
same primary constrictors. Though the undergoers all happen to be made with compressed
oral closures (i.e., as stops), neither the triggers nor the blockers pattern by constriction
degree within their groups. The only phonetic dimension along which all three groups
pattern complementarily is their general airstream type: triggers have a lingual airstream,
undergoers are percussives, and blockers have pulmonic airstream.
As discussed in section 3 and in Chapter 5: Alternations, the percussive undergoers
were never identified by this beatboxer as distinctive sounds, are restricted to occurring near
other sounds with tongue body closures, and pattern metrically like their glottalic egressive
counterparts {B], {PF} (glottalic egressive labiodental affricate), and {t}. (The four coronal
percussives in the “Other” group in Table 26 may also be alternants of a coronal sound like
the Closed Hi-Hat {t}, but there is not enough metrical data to be sure.) Based on this, the
sounds that undergo harmony are likely intended to be glottalic sounds but because of the
harmony are produced with a tongue body closure and without a laryngeal gesture.
Re-phrasing the airstream conclusion from the previous paragraph: triggers have lingual
airstream, undergoers shift from glottalic airstream to percussive, and blockers have
pulmonic airstream.
223
An equivalent way to characterize the pattern is that the triggers are all composed of
a tongue body closure gesture and another more anterior constriction whereas the rest of the
sounds do not have tongue body closures—and in the case of the Inward K Snare, do not
have an additional simultaneous anterior constriction. Pulmonic sounds, the blockers, may
override tongue body closure harmony because they fulfill both musical and homeostatic
roles (keeping the beatboxer alive long enough to finish their performance). The remaining
sounds, which happen to be glottalic, would not benefit homeostatically from blocking the
spread of the tongue body closure (since they do not afford breathing in any case due to
their usual glottal closure) and in undergoing the harmony they lose their laryngeal raising
since it is rendered inert with respect to pressure regulation by the tongue body closure.
The criteria for a phonetic dimension to be counted as a cognitive dimension were
that there must 1) be a class of sounds sharing that dimension which 2) collectively
participate in some behavior. Not only do the trigger sounds analyzed in these five beat
patterns all share the lingual airstream dimension, but so also do the showcased sounds in
the beat patterns not analyzed above—the Clop, Duck Meow SFX, Water Drop Air, and
Water Drop Tongue are all either lingual egressive or lingual ingressive and are the most
likely candidates for triggering harmony in their beat patterns. These seven lingual sounds
are also the complete set of lingual airstream sounds for this beatboxer: every lingual
airstream sound performed by this beatboxer is a likely trigger for tongue body closure
harmony. (See the appendix for drum tabs of every harmony-containing beat pattern.) The
triggers therefore constitute a natural class within the set of beatboxing sounds this
beatboxer knows. With respect to the two criteria, harmony triggers 1) share the dimension
224
of lingual airstream and 2) collectively trigger harmony. There is not enough data to say for
sure whether every pulmonic sound is a harmony blocker, but the evidence so far predicts as
much. Tongue body closure can therefore be analyzed as a sub-segmental cognitive
representation because it places the trigger sounds in a cognitive relation with each other; in
doing so, it also places the triggers (lingual), blockers (pulmonic), and undergoers (other) in
a complementary cognitive relationship with each other. Section 4.2 offers a theoretical
account of tongue body closure harmony in a gesture-based framework, notably positing
pulmonic gestures for beatboxing which act as blockers of tongue body closure gestures.
4.2 Gestural implementation of beatboxing harmony
Having established that there are sub-segmental cognitive units of beatboxing, the next step
is to develop a theoretical account of harmony based on those units. A theoretical account of
beatboxing harmony needs to account for the behavior of triggers and their prolonged
tongue body closures, undergoers which lose a laryngeal raising gesture when the extended
tongue body closure spreads through them, and pulmonic blockers that disrupt the
spreading of the tongue body closure. This section compares compares two gestural
accounts—the Gestural Harmony Model (Smith, 2018) and Tilsen’s (2019) extension to the
selection-coordination model—and, briefly, a symbolic account.
4.2.1 Gestural Harmony Model
Chapter 4: Theory provides the basis for an action-based account of beatboxing phonology.
Speech and beatboxing movements share certain constriction tasks and kinematic properties,
suggesting that the fundamental cognitive units of beatboxing are the same types of actions
225
as speech units—albeit with different purposes. In the language of dynamical systems, this
equivalence is expressed through the graph level which speech gestures and beatboxing
actions are hypothesized to share.
The Gestural Harmony Model (Smith, 2018) provides the means for generating these
beatboxing harmony phenomena. The Gestural Harmony Model extends the gestures of task
dynamics with a new parameter for persistent or non-persistent activation, and extends the
coupled oscillator model of intergestural coordination with an option for intergestural
inhibition.
6
In speech, persistent activation allows a gesture to last until it is inhibited by
another gesture or until the end of the word containing the gesture. These additions to the
model are new parameters; because the graph level deals with selection of dynamical system
parameters and the relationship of those parameters to each other and to dynamical state
variables, the addition of new parameters to the model is a graph-level change (Table 32).
Under the shared-graph hypothesis, the Gestural Harmony Model’s revisions to speech
graphs must also be reflected in the graphs of beatboxing actions and their coordination. All
of the different gestural arrangements possible under the Gestural Harmony
Model—including pathological patterns that are unattested in speech, as discussed
below—are predicted to be available to beatboxing as well. So just like speech, beatboxing
can have persistent actions which last until they are inhibited by another beatboxing action
6
The coupled oscillator model does not have a mechanism for starting a tongue body closure early, stretching it
regressively. Typically a gesture’s activation is associated with a particular phase of its oscillator; the oscillators
settle into a stable relative phase relationship based on their couplings before an utterance is produced, giving
later activation times to gestures associated with later sounds. The Gestural Harmony Model uses constraints in
an OT grammar to shift the onset of activation of a persistent gesture earlier in an utterance. A similar strategy
could be used for beatboxing harmony, or else a more dynamical method of selecting coupling relationships. In
either case, the force that causes harmony to happen in a theoretical model must be related to the aesthetic
principles that shape beatboxing—here perhaps the drive to create a cohesive aesthetic through a consistently
sized resonance chamber (the oral chamber in front of the tongue body). The formalization of that force is left
for future work.
226
or until the end of a musical phrase. The next few paragraphs schematize how the Gestural
Harmony Model accounts for the harmony patterns in beatboxing.
4.2.1.1 Harmony triggers and undergoers
To start, consider the simplest sequence of two sounds: a Kick Drum and a Spit Snare (see
beat pattern 5 for an example). The Kick Drum is an ejective composed of a labial
compression action and a laryngeal closing and raising action, and the Spit Snare is a lingual
egressive sound composed of a labial compression action and a tongue body compression
action. These compositions are laid out in a coupling graph at the top of Figure 109 with
coupling connections between the paired actions for each sound—the specific nature of these
connections in a coupled oscillator model determines the relative timing of these actions and
contributes to the perception of multiple gestures as part of the same segment; for present
purposes, however, the important coupling relationship to watch for is the inhibitory
coupling.
Section 3 characterized the Spit Snare as a harmony trigger, so the tongue body
closure of the Spit Snare needs to turn the Kick Drum into an unforced Kick Drum via
temporal overlap. This is accomplished by flagging the Spit Snare’s tongue body closure
gesture as persistent—marked with arrow heads on the top and bottom of the oscillator in
the coupling graph—causing it to extend temporally as far as possible both forward and
backward. By extending backward, the tongue body closure is activated before or around the
same time as the labial closure of the Kick Drum, resulting in the production of a Kick Drum
that has adopted a tongue body closure (an unforced Kick Drum). The gestural score below
the coupling graph in Figure 109 shows this temporal organization.
227
Table 32. Non-exhaustive lists of state-, parameter-, and graph-level properties for dynamical
systems used in speech from Chapter 4: Theory. Parameter additions to the system from the
Gestural Harmony Model are underlined. Because the graph level is responsible for the
selection of and relationship between parameter and state variables, the addition of
persistence and inhibition to the parameter space is a graph-level change.
State level Parameter level Graph level
System type: Gesture
Position
Velocity
Acceleration
Activation strength
Target state
Stiffness
Strength of other movement
forces (e.g., damping)
Blending strength
Persistence
System topology (e.g., point
attractor)
Tract variable selection
Selection of and relationship
between parameter and
state variables
System type: Coupled oscillators
Phase Activation/deactivation phase
Oscillator frequency
Coupling strength & direction
Coupling type (in-phase,
anti-phase)
Inhibition
Number of tract variables
Intergestural coupling
Selection of and relationship
between parameter and
state variables
Section 3 also showed that the laryngeal raising gesture of the Kick Drum disappears when it
harmonizes to the tongue body closure of the Spit Snare. This can be accomplished with an
inhibitory coupling relationship: if an inhibitor sound is scheduled to activate before an
inhibitee sound to which it is coupled, then the inhibitee is prevented from activating at all.
The coupling graph in Figure 109 shows an inhibitory coupling relationship between the
tongue body closure of the Spit Snare and the larynx raising gesture of the Kick Drum; since
the tongue body the closure starts before the laryngeal gesture, the laryngeal gesture never
activates. The gestural score in Figure 109 shows the “ghost” of the laryngeal gesture as a
visual indication that it was intended but never produced.
228
Why does this inhibitory relationship exist in the first place? Laryngeal activity isn’t
antagonistic to tongue body closure—dorsal ejectives are well-attested in languages, so
clearly laryngeal closure/raising and tongue body closure action can even be collaborative.
And, we have seen that this canceling of the laryngeal closure/raising gesture is not a blanket
inhibition on all laryngeal activity. Figure 110 depicts the same relationship between Kick
Drum and Spit Snare as Figure 109, but with the addition of a humming phonation gesture
from the humming-while-beatboxing pattern (section 3.2). The persistent tongue body
closure from the Spit Snare inhibits the laryngeal raising gesture of the Kick Drum, just as it
did in Figure 109; however, the humming gesture has no inhibitory coupling relations, so it is
free to manifest at the appropriate time. The result is an unforced Kick Drum coarticulated
with humming and followed by a Spit Snare. (The humming gesture is depicted with
in-phase coupling to the labial closure of the Kick Drum as a way of showing that the
humming and the Kick Drum occur together on the same beat. In a more expansive account,
they might not be coupled to each other directly but instead share the activation phase of
some metrical oscillator.)
One answer is that closing the vocal folds reduces opportunities to manage the
volume of air in the lungs. Expert beatboxing requires constant breath management because
the ability to produce a given sound in an aesthetically pleasing manner requires specific air
pressure conditions. We have seen that beat patterns can include sound sequences with many
different types of airflow; in planning the whole beat pattern (or chunks of the beat pattern),
beatboxers must be prepared to produce a variety of airstream types and so are likely to try
to maintain breath flexibility. Laryngeal closures prevent the flow of air into and out of the
229
lungs for breath management purposes, and therefore are antagonistic not to the tongue
body closure but to breath control. By this explanation, the inhibition of the laryngeal
closure/raising gesture by the tongue body closure gesture is a formalization of a qualitative
shift in how airflow is managed in the vocal tract.
A different explanation is that cancellation of the laryngeal closure/raising gesture is
an adaptive coordinative pattern that dynamical approaches to speech predict as a hallmark
of skill (Pouplier, 2012). The coordination of the body’s end effectors change as the
organism’s intentions change, sometimes resulting in qualitative shifts in coordinative
patterns; this has been notably recognized in quadrupeds like horses which switch into
distinct but roughly equally efficient gaits for different rates of movement (Hoyt & Taylor,
1981). In this case of laryngeal closure and raising in beatboxing, expert beatboxers are likely
to recognize that the laryngeal gesture they usually associate with a forced Kick Drum (or
other glottalic sounds that undergo harmony) has no audible consequence during tongue
body closure harmony. From this feedback, a beatboxer would learn a qualitative shift in
behavior—to not move the larynx while the tongue is making a closure. A similar thing
happens in speech in the context of assimilation due to overlap: Browman & Goldstein
(1995) provide measurements that when a speaker produces the phrase “tot puddles” there is
wide variation in the magnitude of the final [t] tongue tip constriction gesture, including
effective deletion. In this example, the speaker reduces or deletes their gesture when it would
have no audible consequence anyway. The same could be said of the laryngeal closure and
raising gesture in beatboxing when overlapped with tongue body closure harmony.
230
Figure 109. A schematic coupling graph and gestural score of a Kick Drum and Spit Snare.
The tongue body closure (TBCD) gesture of the Spit Snare overlaps with and inhibits the
closure and raising gesture of the larynx (LAR).
231
Figure 110. A schematic coupling graph and gestural score of a Kick Drum, humming, and a
Spit Snare. The tongue body closure (TBCD) gesture overlaps with and inhibits the closure
and raising gesture of the larynx (LAR) as in Figure 109, but the humming LAR gesture is
undisturbed.
4.2.1.2 Harmony blockers
The apparent blocking behavior of the Inward K Snare can also be accounted for with
inhibitory coupling. Figure 111 shows an example from beat pattern 1 with a {b CR B ^K}
sequence. The Inward K Snare requires a tongue body closure and lung expansion to draw
air inward over the sides of the tongue body, which is incompatible with a full tongue body
closure triggered by the Clickroll. This lung expansion action ends the persistent tongue
232
body gesture associated with a harmony trigger—if it didn’t, then the tongue body closure
would block the inward airflow and the Inward K Snare couldn’t be produced. Inhibiting the
persistent tongue body closure also prevents the persistent tongue body closure gesture from
inhibiting the laryngeal gesture of the Kick Drum between the Clickroll and the Inward K
Snare. As a result, the first Kick Drum that does overlap with the persistent tongue body
closure gesture has its laryngeal closure/raising gesture inhibited, but the second Kick Drum
will not.
Positing a breathing task is a major departure from the typical tract variables of
Articulatory Phonology. Lung movements are not considered a gesture in Articulatory
Phonology, and reasonably so—no language uses pulmonic ingressive airflow to make a
phonological contrast (Eklund, 2008). Pulmonic egressive airflow, on the other hand, is
practically ubiquitous in speech which means that it does not really operate contrastively
either. Either way, there has been no need to posit any kind of pulmonic gesture for speech.
But in beatboxing, pulmonic activity is contrastive (see ) Beatboxing sound frequencies
and appears to contribute to productive sound patterns, indicating that it is cognitive too.
233
Figure 111. A schematic coupling graph and gestural score of a {b CR B ^K} sequence. The
tongue body closure (TBCD) gesture of the Clickroll overlaps with and inhibits the laryngeal
closing and raising gesture (LAR) of the first Kick Drum. The lung expansion (PULM)
gesture coordinated with the Inward K Snare inhibits the TBCD gesture of the Clickroll
before the TBCD gesture can inhibit the second LAR gesture.
The shared-graph hypothesis of predicts that beatboxing and speech will Theory chapter
exhibit similar patterns of behavior permitted by the dynamical graph structures they use.
The Gestural Harmony Model augments the graphs of the task dynamics framework and the
coupling graph system to include gestural persistence and inhibition options; any predictions
of possible action patterns made by the Gestural Harmony Model should therefore also be
predictions about possible beatboxing patterns. The finding that beatboxing harmony exists
in such a speechlike form provides evidence in favor of both the shared-graph hypothesis
and the Gestural Harmony Model.
The support is all the stronger because the gestural analysis of beatboxing harmony
includes patterns that are predicted by the Gestural Harmony Model but unattested in
234
speech. As Smith (2018:204-206) discusses, intergestural inhibition may not be constrained
enough for speech: inhibition is introduced specifically so that an inhibitor gesture can block
the spread of a persistent inhibitee gesture, but it is equally possible in the model that a
persistent gesture could inhibit non-persistent gestures—even though such a thing never
appears to occur in speech. Within the narrow domain of speech, the Gestural Harmony
Model over-generates inhibition patterns. But beatboxing uses those patterns when
persistent tongue body closure gestures inhibit laryngeal raising gestures; under the
shared-graph hypothesis, the predictions of the Gestural Harmony Model are met with
evidence. (It is of course possible, maybe even likely, that the lack of attestation of this
particular inhibitory pattern is simply be due to a relative scarcity of articulatory
investigations of the types of speech harmony that could exhibit this. If this pattern were
found in speech, it would mean that the Gestural Harmony Model does not over-generate
patterns and that speech and beatboxing harmony have one more thing in common.)
4.2.2 Extension to selection-coordination-intention
Tilsen (2019) offers two different gesture-based accounts for the origins of non-local
phonological processes as emerging from stochastic motor control variability within an
extension to the selection-coordination-intention model (Tilsen, 2018)—that is, in this
system, harmony is thought to start off more or less accidentally because of how
domain-general motor control works and to later become phonologized into a regular part of
speech. In this model, the selection of gestures or groups of gestures is modeled as a
competitive gating process of activation in a dynamical field: when a group of gestures is
selected, their excitation level ramps up high enough to trigger the dynamic motor processes
235
associated with that set of gestures; other gestures that have not been selected yet or which
were already selected and subsequently “demoted” are still present but are not strongly
enough excited to be excited or to influence motor planning fields in any way. The
continuous process of selection is discretized into static “epochs” that describe a snapshot
view of the whole state of the system and the gestures therein. One cause of demotion is
inhibition—gestures are conceived of as having both excitatory and inhibitory manifestations
in the dynamical field. Gestural antagonism is formalized as one gesture’s excitatory side
conflicting with another’s inhibitory side; when two antagonistic gestures would be selected
into the same epoch, the inhibitory gesture demotes the excited gesture from the selection
pool.
Tilsen’s account of local spreading harmony (which we have argued is the nature of
beatboxing harmony above) arises from “selectional dissociation”, a theoretical mechanism
by which a gesture may be selected early or de-selected late relative to the epoch it would
normally be selected into. Blocking in this model occurs when a gesture which has been
selectionally dissociated conflicts with an inhibitory gesture in another epoch. In the case of
nasal harmony, for example, a velum lowering gesture might fail to be suppressed causing it
to remain selected in later epochs; this would be progressive/rightward spreading of a
gesture. The velum lowering would be blocked if it were ever extended into an epoch in
which an antagonistic, inhibiting velum closing gesture was also selected: the inhibitory
velum raising gesture would demote the velum lowering gesture, causing the lowering
gesture to slip below the threshold of influence over the vocal tract.
236
The tongue body closure spreading and pulmonic airflow blocking in beatboxing can
be accounted for by similar means, with the tongue body closure gesture being anticipatorily
de-gated (selected early) and remaining un-suppressed unless it conflicts with the selection
of an antagonistic pulmonic gesture (e.g., from an Inward K Snare) in a later epoch. This has
the advantage of providing an explicit explanation for why some Kick Drums in sequences
like {CR dc B ^K} do not undergo harmony: if the Kick Drum and Inward K Snare are
selected during the same epoch, then the Inward K Snare’s pulmonic gesture blocks the
spread of harmony during that whole epoch, effectively defending the Kick Drum from
harmony.
As hinted above, the selection-coordination-intention model offers a second
mechanism for dealing with non-local phonological agreement patterns: “leaky” gating. A
gesture that is gated is not selected and therefore exerts no influence on the tract variable
planning fields—and therefore, has no influence on the vocal tract. But if a gesture is
imperfectly gated, its influence can leak into the tract variable planning field even though it
hasn’t been selected. Leaky gating cannot be blocked because blocking is formalized as a
co-selection restriction; since the leaking gesture has not actually been selected, it cannot be
demoted. Local spreading harmony—including the beatboxing examples above—often
features blocking behavior, which makes leaky gating inappropriate for the crux of a
spreading harmony analysis (but useful as an analysis of long-distance agreement harmony
which is generally not blocked). But there is nothing to say that leaky gating can’t be used
with selectional dissociation; on the contrary, if a spreading gesture has an intrinsically high
excitation level, it would be all the more likely to lurk beneath the selection threshold,
237
leaking its influence into the vocal tract without antagonizing the currently selected gestures.
This could explain why the tongue body remains elevated during most of the forced Kick
Drums in the complex example in section 3.5.1: the pulmonic gesture of the Inward K Snare
blocks the spreading tongue body closure gesture by demoting it to sub-selection levels, but
the tongue body closure gesture leakily lingers and keeps the tongue body relatively high.
Only near the end of the beat pattern does the tongue body closure gesture stop leaking,
presumably because there are no more Clickrolls to reinforce its excitation.
So far as we can tell however, the loss of the laryngeal closing/raising gestures during
Kick Drums and other harmony-undergoer sounds cannot be accounted for in this model.
Laryngeal closure/raising is not physically antagonistic to tongue body closure, so there is no
reason to posit a pre-phonologized inhibitory relationship between a tongue body gesture
and a laryngeal closure/raising gesture. If antagonism is defined phonologically instead of
phonetically, the complementary behavior of triggers, undergoes, and blockers may be
enough to set up phonological antagonism between their respective airstream initiator
gestures—but it is not clear in the model what the nature of the antagonism is or how this
type of antagonism might be learned without a physical antagonism first.
4.2.3 Domain-specific approaches
To conclude this theoretical accounting of beatboxing harmony, recall from section 1 that
models of phonological harmony that only account for linguistic harmony should be
dispreferred to models that can accommodate beatboxing harmony as well. What about a
238
more traditional, purely domain-specific phonological framework based around symbolic
features instead of gestures?
Most computational approaches are likely to be able to provide an account of
beatboxing harmony, though great care would need to be taken in order to define sensible
features for beatboxing. One might posit a set of complementary airstream features {+
pulmonic} and {+ lingual} for sounds with either pulmonic or lingual airstream. An Inward K
Snare would be defined as {+ pulmonic} for airstream and, because it is made with the
tongue body, {+ dorsal} for its place feature (the primary constrictor, when mapped to
phonetics). Because pulmonic and lingual airstreams are complementary, the Inward K Snare
would also be {- lingual}. Though not a deal-breaker per se, it would be a little strange in a
phonetically grounded model for a sound to be both {+ dorsal} and {- lingual}: there is no
qualitative distinction between a tongue body closure used for a pulmonic dorsal sound on
the one hand and a tongue body closure for a lingual egressive, lingual ingressive, or
percussive airstream sound on the other—in either case, the tongue body’s responsibility is to
stop airflow between the oral cavity and the pharynx. There would also need to be a
mechanism for preventing boxeme inputs that are simultaneously {+ lingual, + dorsal}
because the tongue body can’t manipulate air pressure behind itself. The gestural approach
has none of these issues: both an Inward K Snare and a lingual airstream sound just simply
use a tongue body constriction degree gesture.
To the main point, there is the question of whether most featural accounts of
linguistic harmony have any justification for extending to beatboxing harmony. We have
seen already that gestures are defined by both their domain-specific task and the
239
domain-general system for producing constriction actions in the vocal tract; by the
hypothesis laid out in Chapter 4: Theory, the domain-general capacity of the graph level to
implement linguistic harmony predicts that gesture-ish beatboxing units should come with
the same ability. Beatboxing harmony is thus predicted from linguistic harmony in a gestural
framework. But computational features are traditionally defined domain-specifically: the
features are concerned exclusively with their encoding of linguistic contrast and linguistic
patterns, and are historically removed from phonetics and the physical world by design
(though they have become more and more phonetically-grounded over time). The grammars
that operate over those features are intended to operate exclusively over linguistic inputs and
outputs. Phonological features and grammar could be adapted to beatboxing, every part of
their nature suggests that they should not.
5. Conclusion
Phonological harmony is not unique to speech: common beat patterns in beatboxing like the
humming while beatboxing pattern have the signature properties of phonological harmony
including triggers, undergoers, and blockers. This suggests that phonology (or at least
phonological harmony) is not a special part of language but rather a specialization of a
domain-general capacity for harmonious patterns. The existence of beatboxing harmony
provides evidence for sub-segmental cognitive units in beatboxing. The articulatory
manifestation of beatboxing harmony is amenable to an analysis based on gestures. The
notion that speech and beatboxing phonology are each specializations of a domain-general
240
harmony ability is expressed this way because gestures are essentially domain-general action
units specialized for a particular behavior.
241
CHAPTER 7: BEATRHYMING
Beatrhyming is a type of multi-vocalism performed by simultaneous production of
beatboxing and speech (i.e., singing or rapping) by a single individual. This case study of a
beatrhyming performance demonstrates how the tasks of beatboxing and speech interact to
create a piece of art. Aside from being marvelous in its own right, beatrhyming offers new
insights that challenge the fundamentals of phonological theories built to describe talking
alone.
1. Introduction
1.1 Beatrhyming
One of many questions in contemporary research in phonology is how the task of speech
interacts with other concurrent motor tasks. Co-speech manual gestures (Krivokapi ć, 2014;
Danner et al., 2019), co-speech ticcing from speakers with vocal Tourette’s disorder (Llorens,
in progress), and musical performance (Hayes & Kaun, 1996; Rialland, 2005; Schellenberg,
2013; Schellenberg & Gick, 2020; McPherson & Ryan, 2018) are just a few examples of
behaviors which may not be under the purview of speech in the strictest traditional sense
but which all collaborate with speech to yield differently organized speech performance
modalities. Studying these and other multi-task behaviors illuminates the flexibility of
speech units and their organization in a way that studying talking alone cannot.
This chapter introduces beatrhyming, a type of speech that has not previously been
investigated from a linguistic perspective (see Blaylock & Phoolsombat, 2019 for the first
242
presentation of this work, and also Fukuda, Kimura, Blaylock, and Lee, 2021). Beatrhyming is
a type of multi-vocalism performed by simultaneous production of beatboxing and speech
(i.e., singing or rapping) by a single individual. Notable beatrhyming performers include Kid
Lucky, Rahzel, and Kaila Mullady, though more and more beatboxers are taking up
beatrhyming as well. In terms of tasks, beatrhyming is an overarching task for artistic
communication that is composed of a beatboxing task and a speech task. The question at
hand is: how do the speech and beatboxing tasks interact in beatrhyming?
A beatrhyming performance contains words and beatboxing sounds interspersed with
each other. Artists differ in their use beatboxing sounds differently in their beatrhyming. For
example, Rahzel’s beatrhyme performance “If Your Mother Only Knew” (an adaption of
Aaliya’s “If Your Girl Only Knew”) uses mostly Kick Drums
(https://www.youtube.com/watch?v=ifCwPidxsqA), whereas Kaila Mullady (whose
beatrhyming is analyzed in this chapter) more often uses a variety of beatboxing sounds in
her beatrhyming.
Words and beatboxing sounds may interact in different ways in beatrhyming. In
some cases, words and beatboxing sounds are produced sequentially. Taking the word “got”
[gat] as an example, a sequence of beatboxing and speech sounds would be transcribed as
{B}[gat] (a Kick Drum, followed by the word “got”). In other cases, words and beatboxing
sounds may overlap, as in {K}[at] (with a Rimshot completely replacing the intended [g] in
“got”) or [ga]{^K}[at] (with a K Snare interrupting the [a] vowel of “got”).
Complete replacement is illustrated by two examples in the acoustic segmentation of
the word “dopamine” /dopəmin/ in Figure 112: the Closed Hi-Hat {t} replaces the intended
243
speech sound /d/ and Kick Drum {B} replaces the intended speech sound /p/. In both cases,
the /d/ and /p/ were segmented on the phoneme (“phones”) tier with the same temporal
interval as the replacing beatboxing sound (on the “beatphones” tier). The screenshot also
features one example of partial overlap, a K Snare {^K} that begins in the middle of an [i]
(annotated "iy").
For reference, Table 33 below lists the five main beatboxing sounds that will be
referred to in this chapter. Each beatboxing sound is presented with both Standard Beatbox
Notation (SBN) (TyTe & Splinter, 2019) in curly brackets and IPA in square brackets. (The
IPA notation for the Inward K Snare uses the downward arrow [ ↓] from the extIPA symbols
for disordered speech to indicate pulmonic ingressive airflow, and should not be confused
with the similar arrow in IPA that indicates downstep.)
Sections 1.2-1.3 presents hypotheses and predictions about how beatboxing and
speech may (or may not) cooperate to support the achievement of their respective tasks.
Section 2 presents the method used for analysis and section 3 describes the results. Finally,
section 4 suggests that more studies of musical speech and other understudied linguistic
behaviors can offer new insights that challenge phonological theories based solely on talking.
244
Figure 112. Waveform, spectrogram, and text grid of the beatrhymed word “dopamine”.
Table 33. Sounds of beatboxing used in this chapter.
Name SBN IPA Description
Kick Drum {B} [p’] Voiceless ejective bilabial stop
PF Snare {PF} [p ͡f’] Voiceless ejective labio-dental affricate
Closed Hi-Hat {t} [t’] Voiceless ejective alveolar stop
Rimshot {K} [k’] Voiceless ejective velar stop
(Inward) K Snare {^K} [k ͡ʟ ̝ ̊ ↓] Voiceless pulmonic ingressive lateral velar
affricate
245
1.2 Hypotheses and predictions
1.2.1 Constrictor-matching
Depending on the nature of the replacements, cases like the complete replacement of /d/ and
/p/ in the word “dopamine” from Figure 112 could be detrimental to the tasks of speech
production. In the production of the word "got" [gat], the [g] is intended to be performed as
a dorsal stop. If the [g] were replaced by a beatboxing dorsal stop, perhaps a velar ejective
Rimshot {K’}, at least part of the speech task could be achieved while simultaneously
beatboxing. On the other hand, replacing the intended [g] with a labial Kick Drum {B}
would deviate farther from the intended speech tasks for [g]. If the difference were great
enough, making replacements that do not support the intended speech goals might lead to
listeners misperceiving beatryming lyrics—in this case, perhaps hearing “bot” [bat] instead of
“got”.
So then, if the speech task and the beatboxing task can influence each other during
beatrhyming, the speech task may prefer that beatrhyming replacements match the intended
speech signal as often as possible and along as many phonetic dimensions as possible. This
chapter investigates whether replacements support the speech task by making replacements
that match intended speech sounds in constrictor type (i.e., the lips, the tongue tip, the
tongue body) and constriction degree (approximated by manner of articulation). Lederer
(2005) offers the similar hypothesis that beatboxing sounds collectively sound as
un-speechlike as possible to differentiate beatboxing from speech—except during
246
simultaneous beatboxing and singing when perception might be maximized if beatboxing
sounds have the same place of articulation with the speech sounds they replace.
To summarize: the main hypothesis is that speech and beatboxing interact with each
other in beatrhyming in a way that supports the accomplishment of intended speech tasks.
This predicts that beatboxing sounds and the intended speech sounds they replace are likely
to match in constrictor and constriction degree. Conversely, the null hypothesis is that the
two systems do not interact in a way that supports the intended speech tasks, predicting that
beatboxing sounds replace speech sounds with no regard for the intended constrictor or
constriction degree.
The predictions of these hypotheses for constrictor matching are depicted Figures 113
and 3. Imagine a beatrhyming performance in which 90 intended speech sounds—30
intended speech labials, 30 intended speech coronals, and 30 intended speech dorsals—are
replaced by beatboxing sounds. The replacing beatboxing sounds come from a similar
distribution: 30 beatboxing labials, 30 beatboxing coronals, and 30 beatboxing dorsals. If
replacements are made with no regard to the constrictor of intended speech sounds
(following from the null hypothesis), constrictor matches should occur at chance. Each
replacement would have a 1 in 3 chance of having a constrictor match, resulting in 10
constrictor matches and 20 constrictor mismatches per intended constrictor as depicted in
Figure 113. But if replacements are sensitive to the intended constrictor (following from the
main hypothesis), then most beatboxing sounds should match the constrictor of the
intended speech sound they replace (Figure 114).
247
Figure 113. Bar plot of the expected counts of constrictor matching with no task interaction.
Figure 114. Bar plot of the expected counts of constrictor matching with task interaction.
Consider also the predicted distributions for any single beatboxing constriction (Figure 115).
For example, if 30 dorsal beatboxing replacements (i.e., K Snares) are made with no regard to
intended speech constrictor (following from the null hypothesis), then 10 of those
248
replacements should mismatch to intended speech labials, 10 should mismatch to intended
speech coronals, and 10 should match to intended speech dorsals. But if replacements are
sensitive to intended constrictor (following from the main hypothesis), then all 30
beatboxing dorsals are expected to replace intended speech dorsals (Figure 116).
The prediction of constriction degree matching follows a similar line of thinking. If
beatrhyming replacements are made with an aim of satisfying speech tasks, then
replacements are more likely to occur between speech sounds and beatboxing that have
similar constriction degrees. Since beatboxing sounds are stops and trills (see
), and since “Dopamine” is performed in a variety of Beatboxing sound frequencies
English that has no phonological trills, the prediction of the main hypothesis is that speech
stops will be replaced more frequently than speech sounds of other manners of articulation.
On the other hand, the null hypothesis would be supported by finding that beatboxing
sounds replace all manners of speech sounds equally.
249
Figure 115. Bar plots of the expected counts of K Snare constrictor matching with no task
interaction.
Figure 116. Bar plots of the expected counts of K Snare constrictor matching with task
interaction.
250
1.2.2 Beat pattern repetition
As established in earlier chapters, beatboxing beat patterns have their own predictable sound
organization within a beat pattern. The presence of a snare drum sound on the back beat
(beat 3 of each measure) of a beat pattern in particular is highly consistent, but beat patterns
are also often composed of regular repetition at larger time scales. Speech utterances are
highly structured as well, but the sequence of words (and therefore sounds composing those
words) is determined less by sound patterns and more by syntax (cf. Shih & Zuraw, 2017).
However, artistic speech (i.e., poetry, singing) is sometimes composed alliteratively or with
other specific sound patterns in mind, leveraging the flexibility of language to express similar
ideas with a variety of utterances.
There are (at least) two ways beatboxing and speech could interact while maximizing
constrictor matching as hypothesized in section 1.2.1. First, the words of the song could be
planned without any regard for the resulting beat pattern. Any co-speech beatboxing sounds
would be planned based on the words of the song, prioritizing faithfulness to the intended
spoken utterance. Alternatively, the lyrics could be planned around a beatboxing beat
pattern, prioritizing the performance of an aesthetically appealing beat pattern. The counts
of constrictor matches described in section 1.2.1 could look the same either way, but the two
hypotheses predict that the resulting beat patterns will be structured differently. Specifically,
prioritizing the beatboxing beat pattern predicts that beatrhyming will feature highly
regular/repetitive beatboxing sound sequences characteristic of beatboxing music, whereas
prioritizing the speech structure would lead to irregular/non-repeating beatboxing sound
sequences. The rest of this section discusses these predictions in more detail.
251
A sequence of beatboxing sounds often repeats itself after just two measures of
music—that is, a two-measure or “two-bar” phrase (and also in this study, a “line” of music)
might be performed several times. For example, Figure 117 shows a sixteen-bar beatboxed
(non-lyrical) section of “Dopamine”. As a sixteen-bar phrase, it is composed of eight smaller
two-bar phrases. Each two-bar phrase could be distinct from the others, but in fact there are
only two types of two-bar phrases: AB and AC, where A, B, and C each refer to a sequence of
sounds in a single measure of music. The two-bar phrase AB occurs six times in the beat
pattern on lines 1, 2, 3, 5, 6, and 7. Lines 4 and 8 of the beat pattern feature the two-bar
phrase AC.
The depiction of the sixteen-bar phrase in Figure 117 appears sequential, but is in fact
hierarchical: pairs of two-bar phrases compose four-bar phrases, pairs of four-bar phrases
compose eight-bar phrases, and a pair of eight-bar phrases composes the entire sixteen-bar
phrase. In fact, one way to model the creation of this structure is to merge progressively
larger repeating units. That is, given an initial two-bar phrase, a four-bar phrase can be
created by assembling two instances of that two-bar phrase into a larger unit. Likewise, an
eight-bar phrase can be thought of as copy-and-merge of a four-bar phrase with itself.
There is room for variation here, and lines may change based on the artist’s musical
choices. In Figure 117, the end of the first eight-bar phrase deviates from the rest of the
pattern, possibly to musically signal the end of the phrase. In this case, the whole eight-bar
phrase is then copied to create a sixteen-bar phrase, resulting in repetition of that deviation
at the end of both eight-bar phrases.
252
This hierarchical composition can be used to predict where repeating two-bar phrases
are most likely to be found in a sixteen-bar beat pattern. The initial repetition of a two-bar
phrase to make a four-bar phrase predicts that lines 1 & 2 should be similar (where each line
is a two-bar phrase). Likewise, repetition of that four-bar phrase to make an eight-bar phrase
would predict repetition between lines 3 & 4; at a larger time scale, this would also predict
that lines 1 & 3 should be similar to each other, as should lines 2 & 4. In the sixteen-bar
phrase composed of two repeating eight-bar phrases, the repetition relationships from the
previous eight-bar phrase would be copied over (lines pairs 5 & 6, 7 & 8, 5 & 7, and 6 & 8);
repetition would also be expected between corresponding lines of these two eight-bar
phrases, predicting similarity between lines 1 & 5, 2 & 6, 3 & 7, and 4 & 8.
Figure 117. Serial and hierarchical representations of a 16-bar phrase (8 lines with 2 measures
each).
Beat | 1 1.5 2 2.5 3 3.5 4 4.5 1 1.5 2 2.5 3 3.5 4 4.5
| ------------------------------------------------------------------------------------------------------------
Line 1 | B t t ^K th B in B
Line 2 | B t t ^K th B in B
Line 3 | B t t ^K th B in B
Line 4 | B t t ^K B h ^K t t
|
Line 5 | B t t ^K th B in B
Line 6 | B t t ^K th B in B
Line 7 | B t t ^K th B in B
Line 8 | B t t ^K B h ^K t t
16-bar Phrase: ... ...
/ \
8-bar Phrase: / \
/ \ / \
/ \ / \
/ \ / \
/ \ / \
/ \ / \
/ \ / \
/ \ / \
4-bar Phrase: / \ / \
/ \ / \ / \ / \
/ \ / \ / \ / \
/ \ / \ / \ / \
2-bar Phrase (line): 1 2 3 4 5 6 7 8
/ \ / \ / \ / \ / \ / \ / \ / \
Measure: A B A B A B A C* A* B A B A B A C
253
Because deviations from the initial two-bar pattern are expected to occur in the interest of
musical expression, some pairs of two-bar phrases are more likely to exhibit clear repetition
than others. Consider a four-bar phrase composed of two two-bar phrases AB and AC—their
first measures (A) are identical, but their second measures (B and C) are different. If this
four-bar phrase is repeated to make an eight-bar phrase, the result would be AB-AC-AB-AC.
In this example, lines 1 & 3 match as do lines 2 & 4, but lines 2 & 3 and 1 & 4 do not. For this
study, the discussion of repetition in beatrhyming is limited to just those pairs of lines
described earlier which are most likely to feature repetition (“cross-group” refers to
corresponding lines in two different eight-bar phrases):
● Adjacent two-bar phrases—lines 1 & 2, 3 & 4, 5 & 6, and 7 & 8
● Alternating two-bar phrases—lines 1 & 3, 2 & 4, 5 & 7, and 6 & 8
● Cross-group two-bar phrases—lines 1 & 5, 2 & 6, 3 & 7, and 4 & 8
If beatboxing structure is prioritized in beatrhyming—either because beatboxing and speech
aren’t sensitive to each other at all or because the speech system accommodates beatboxing
through lyrical choices that result in an ideal beat pattern—then sequences of co-speech
beatboxing sounds should have similarly high repetitiveness compared to beat patterns
performed without speech. But if speech structure is prioritized, then the beat pattern is
predicted to sacrifice repetitiveness in exchange for supporting the speech task by matching
the intended constrictor and constriction degree of any speech segments being replaced.
254
1.2.3 Summary of hypotheses and predictions
The main hypothesis is that speech and beatboxing interact during beatrhyming to
accomplish their respective tasks, and the null hypothesis is that they do not. Support for the
first hypothesis could appear in two different forms, or possibly both at the same time. First,
if beatrhyming replacements are sensitive to the articulatory goals of the intended speech
sound being replaced, then the beatboxing sounds that replace speech sounds are likely to
match their targets in constrictor and constriction degree. Second, if beatboxing sequencing
patterns are prioritized in beatrhyming, then sequences of beatrhyming sound replacements
should exhibit the same structural repetitiveness as non-lyrical beatboxing sequences. Failing
to support either of these predictions would support the null hypothesis and the notion that
speech and beatboxing have no cognitive relationship during beatrhyming.
Note that different beatrhyming performances may feature different relationships
between speech and beatboxing depending on the artist’s musical aims. The results of this
study should be taken as an account of one way that beatrhyming has been performed, but
not necessarily the only way to beatrhyme.
2. Method
This section describes how the data were collected and coded (section 2.1) and analyzed
(2.2).
255
2.1 Data
The data in this study come from a beatrhyming performance called "Dopamine", created
and performed by Kaila Mullady and made publicly available on YouTube (Mullady, 2017).
The composition of "Dopamine" includes sections of beatboxing in isolation and beatboxing
concurrently with speech. The lyrics of "Dopamine" were provided by Mullady over email.
An undergraduate in the Linguistics program at USC performed manual acoustic
segmentation of the beatrhymed portions of “Dopamine” using Praat (Boersma, 2001).
Segmentation was performed at the level of words, phonemes (“phones”), beatboxing sounds
(“beatphones”), and the musical beat (“beats”) on which beatboxing sounds were performed.
For complete sound replacements, the start and end of the annotation for the interval of the
intended speech phone were the same as the start and end of the beatboxing beatphone
interval.
Five beatboxing sounds were used in the beatrhymed sections of "Dopamine": Kick
Drum {B}, Closed Hi-Hat {t}, PF Snare {PF}, Rimshot {K}, and K Snare {^K}. (It was not clear
from the acoustic signal whether the K Snares were Inward or Outward; a choice was made
to annotate them consistently as Inward {^K}. The choice of Inward or Outward does not
affect the outcome of this study which addresses only constrictor—which Inward and
Outward K Snares share). Each beatboxing sound was coded by its major constrictor: {B} and
{PF} were coded as “labial”, {t} was coded as “coronal” (tongue tip), and {K} and {^K} were
coded as “dorsal” (tongue body). Finally, the metrical position of each replacement was
annotated with points on a PointTier aligned to the beginning of beatboxing sound intervals.
256
2.2 Analysis
2.2.1 Constrictor-matching analysis
The mPraat software (Bo řil, & Skarnitzl, 2016) for MATLAB was used to count the number
of complete one-to-one replacements (excluding partial replacements or cases where one
beatboxing sound replaced two speech sounds) (n = 88). The constrictor of the originally
intended speech sound was then compared against the constrictor for the replacing
beatboxing sound, noting whether the constrictors were the same (matching) or different
(mismatching).
Constriction degree matching was likewise measured by counting how many speech
sounds of different constriction degrees were replaced—or in this case, different manners of
articulation. All the beatboxing sounds that made replacements were stops {B} or affricates
{PF, t, K’, (^)K}; higher propensity for constriction degree matching would be found if the
speech sounds being replaced were more likely to also be stops and affricates instead of
nasals, fricatives, or approximants.
2.2.2 Repetition analysis
Four sixteen-bar sections labeled B, C, D, and E were chosen for repetition analysis.
(“Dopamine” begins with a refrain, section A, that was not analyzed because it has repeated
lyrics that were expected to inflate the repetition measurements. The intent of the ratios is to
assess whether beat patterns in beatrhyming are as repetitive as beat patterns without lyrics,
not how many times the same lyrical phrase was repeated.) Sections B and D were
257
non-lyrical beat patterns (no words) between the refrain and the first verse and between the
first and second verses, respectively. Sections C and E were the beatrhymed (beatboxing with
words) first and second verses, respectively. The second verse was 24 measures long, but was
truncated to 16 measures for the analysis.
Repetitiveness was assessed using two different metrics. The first metric counted how
many unique measure-long sequences of beatboxing sounds were performed as part of a
section of music. The more unique measures are found, the less repetition there is. Rhythmic
variations within a measure were ignored for this metric to accommodate artistic flexibility
in timing. For example, Figure 118 contains two two-bar phrases; of those four measures, this
metric would count three of them as unique: {B t t ^K}, {th PF ^K B}, and {B ^K B}. The first
measures of each two-bar phrase would be counted as the same because the sequence of
sounds in the measure is the same despite use of triplet timing on the lower line (using beats
1.67 and 2.33 instead of beats 1.5 and 2). This uniqueness metric provides an integer value
representing how much repetition there is over a sixteen-bar section; if beatrhyming beat
patterns resemble non-lyrical beatboxing patterns, each section’s uniqueness metric should
be approximately the same.
The second metric is a proportion called the repetition ratio. For a given pair of
two-bar phrases, the number of beats that had matching beatboxing sounds was divided by
the number of beats that hosted a beatboxing sound across both two-bar phrases. This
provides the proportion of beats in the two phrases that were the same, normalized by the
number of beats that could have been the same, excluding beats for which neither two-bar
phrase had a beatboxing sound.
258
For example, the two two-bar phrases in Figure 118 are the same for 4/10 beats,
resulting in a repetition ratio of 0.4. In measure 1 the sounds of beats 1 and 3 match, but the
second two sounds of the first phrase are on beats 1.5 and 2 whereas the second two sounds
of the second phrase are performed with triplet timing on beats 1.67 and 2.33. Therefore in
the first measure, six beats have a beatboxing sound in either phrase—beats 1, 1.5, 1.67, 2, 2.33,
and 3—but only two of those beats have matching sounds. In the second measure, four beats
have a beatboxing sound in either phrase—beats 1, 2, 3, and 4. While two of those beats have
the same beatboxing sound in both phrases, beat 1 only has a sound in the first phrase and
beat 2 has a PF Snare in the first phrase but a Kick Drum in the bottom phrase. Looking at
the phrases overall, ten beats carry a beatboxing sound in either phrase but only four beats
have the same sound repeated in both phrases for a repetition ratio of 0.4.
This calculation penalizes cases like the first half of the example in Figure 118 in
which the patterns are identical except for a slightly different rhythm. The high sensitivity to
rhythm of this repetition ratio measurement was selected to complement the rhythmic
insensitivity of the previous technique for counting how many unique measures were in a
beat pattern. In practice, this penalty happened to only lower the repetition ratio for phrases
that were beatboxed without lyrics (co-speech beat patterns rarely had patterns with the
same sounds but different rhythms, so there were few opportunities to be penalized in this
way); despite this, the repetition ratios for beatrhymed patterns were still lower than the
repetition ratios for beatboxed patterns in the same song (see section 3.3 for more details).
259
Figure 118. Example of a two-line beat pattern. Both lines have a sound on beats 1 and 3 of
the first measure and beats 2, 3, and 4, of the second measure.
1 1.5 2 2.5 3 3.5 4 4.5 1 1.5 2 2.5 3 3.5 4 4.5
------------------------------------------------------------------------------
B t t ^K th PF ^K B
B t t ^K B ^K B
Within each section, the repetition ratio was calculated for three types of two-bar phrase
pairs: adjacent pairs (phrases 1 & 2, 3 & 4, 5 & 6, and 7 & 8), alternating pairs (phrases 1 & 3,
2 & 4, 5 & 7, and 6 & 8), and cross-group pairs (phrases 1 & 5, 2 & 6, 3 & 7, and 4 & 8).
Additionally, repetition ratio was calculated between sections B & D and between sections C
& E to see if musically related sections used the same beat pattern. Repetition ratios
measured for the beatboxed and beatrhymed sections were then compared pairwise to assess
whether the beatrhymed sections were as repetitive as the beatboxed sections.
A transcription of the beatboxing sounds of “Dopamine” was used for both
measurement techniques. This transcription excluded phonation and trill sounds during the
beatboxing patterns because they extend over multiple beats and would inflate the number
of beats counted in the calculation of the repetition ratio. (The excluded beatboxing sounds
were repeated as consistently as the other sounds in the beatboxing section.)
3. Results
Section 3.1 measures the extent to which the beatrhyming replacements were
constrictor-matched and section 3.2 does likewise for manner of articulation; both assess
whether the selection of beatboxing sounds accommodates the speech task. Section 3.3
260
quantifies the degree of repetition during beatrhyming to determine whether the selection of
lyrics accommodated the beatboxing task.
3.1 Constrictor-matching
Section 3.1.1 shows that replacements are constrictor-matched overall. Section 3.1.2 considers
replacements in two groups, showing that there is a high degree of constrictor matching off
the back beat but little constrictor matching on the back beat. Section 3.1.3 offers possible
explanations for the few exceptional replacements that were off the back beat and not
constrictor-matched.
3.1.1 All replacements
Figure 119 shows the number of times an intended speech sound was replaced by a
beatboxing sound of the same constrictor (blue bars, the left of each pair) or by a beatboxing
sound of a different constrictor (orange bars, the right of each pair) for every complete
replacement in “Dopamine.”
The intended speech dorsals were predominantly replaced by beatboxing dorsals,
appearing to support the hypothesis that speech and beatboxing interact in beatrhyming. But
while the majorities of intended labials and intended coronals were also replaced by
beatboxing sounds with matching labial or coronal constrictors, there was still a fairly large
number of mismatches for each (10/28 mismatches for labials, 10/31 mismatches for
coronals). This degree of mismatching is less than the levels of chance predicted by a lack of
interaction between beatboxing and speech—the expectation at chance was 20 mismatches
per constrictor, not 10.
261
Figure 119. Bar plot showing measured totals of constrictor matches and mismatches.
Table 34 shows the contingency table of replacements by constrictor. Highlighted cells along
the upper-left-to-bottom-right diagonal represent constrictor matches; all other cells are
constrictor mismatches. Reading across each row reveals how many times an intended
speech constriction was replaced by each beatboxing constrictor. For example, intended
speech labials were replaced by beatboxing labials 18 times, by beatboxing coronals 0 times,
and by beatboxing dorsals 10 times. A chi-squared test over this table rejects the null
hypothesis that beatboxing sounds replace intended speech sounds at random ( = 79.15, df χ
2
= 4, p < 0.0001).
262
Table 34. Contingency table of beatboxing sound constrictors (top) and the speech sounds
they replace (left).
Intended speech sound Replacing beatboxing sound Total
Labial Coronal Dorsal
Labial
18 0 10 28
Coronal
2 21 8 31
Dorsal
2 0 27 29
Total
22 21 45 88
3.1.2 Replacements on and o the back beat
All 10 labial mismatches and 8/10 coronal mismatches were made by a dorsal beatboxing
sound replacement. Each of those mismatches also happens to occur on beat 3 of the meter,
and the replacing beatboxing sound is always a K Snare {^K}. In beatboxing, beat 3
corresponds to the back beat and almost always features a snare. This conspiracy of so many
dorsal replacements being made on the back beat suggests that it would be more informative
to split the analysis into two pieces.
A distinction can be made between replacements that occurred on beat 3 (n = 30) and
replacements made on any other beat or subdivision (n = 58). Figure 120 shows the counts of
matching and mismatching replacements excluding the back beat. With the inviolable back
beat K Snare out of the picture, 54 of 58 replacements have matching constrictor. This
distribution more closely matches the main hypothesis. Looking at just the replacements
made on the back beat (n = 30), however, appears to support the null hypothesis. Beatboxing
sounds on the back beat in "Dopamine" are restricted to the dorsal constrictor for the K
Snare {^K}. The replacements are fairly evenly distributed across all intended speech
263
constrictors, resembling the idealized prediction of no interaction between beatboxing
constrictions and intended speech constrictors (Figure 121). Taking this result with the
previous, this provides evidence for a trading relationship: the speech task is achieved during
replacements under most circumstances, but not on the back beat.
One smaller finding obfuscated by the coarse constrictor types is that speech labials
and labiodentals tended to be constrictor-matched to the labial Kick Drum {B} and
labiodental PF Snare {PF}, respectively. PF Snares only ever replaced /f/s, and 4 out of 6
replaced /f/s were replaced by PF Snares. (The other two were on the back beat, and so
replaced by K Snares.) There were two /v/s off the back beat, both of which were in the same
metrical position and in the word "of", and both of which were replaced by Kick Drums.
Labio-dentals were grouped with the rest of the labials to create a simpler picture about
constrictor matching and because the number of labio-dental intended speech sounds was
fairly small. However, for future beatrhyming analysis, it may be useful to separate bilabial
and labio-dental articulations into separate groups rather than covering them with “labial”.
264
Figure 120. Bar plots with counts of the actual matching and mismatching constrictor
replacements everywhere except the back beat.
265
Figure 121. Bar plot with counts of the actual matching and mismatching constrictor
replacements on just the back beat.
3.1.3 Examining mismatches more closely
There are four constrictor mismatches not on the back beat: two in which a labial beatboxing
sound replaces an intended speech coronal, and two in which a labial beatboxing sound
replaces an intended speech dorsal.
Both labial-on-coronal cases are of a Kick Drum replacing the word "and", which we
assume (based on the style of the performance) would be pronounced in a reduced form like
[n]. Acoustically, the low frequency burst of a labial Kick Drum {B} is probably a better
match to the nasal murmur of the intended [n] (and thus the manner of the nasal) than the
higher frequency bursts of a Closed Hi-Hat {t}, K Snare {^K}, or Rimshot {K}. All the other
266
nasals replaced by beatboxing sounds were on the back beat and therefore replaced by the K
Snare {^K}.
The two cases where a Kick Drum {B} replaced a dorsal sound can both be found in
the first four lines of the second verse (Figure 122). In one case, a {B} replaced the [g] in "got"
on the first beat 1 of line 3 (underlined in Figure 122). The reason may be a general
preference for having a Kick Drum on beat 1. Only 3 replacements were made on beat 1 in
"Dopamine", and all of them featured a Kick Drum {B}. (The overall scarcity of beat 1
replacements is due at least in part to the musical arrangement and style resulting in
relatively few words on beat 1.) The other case also involved a Kick Drum {B} replacing a
dorsal, this time the [k] in the word “come” on the second beat 2 of line 3 (also underlined).
The replacing {B} in this instance was part of a small recurring beatboxing pattern of {B B}
that didn't otherwise overlap with speech—it occurred on beats 1.5 and 2 of the second
measure of lines 1-3 as well as in the first measure of line 4.
Figure 122. Four lines of beatrhyming featuring two replacement mismatches (underlined).
1 1.5 2 2.5 3 3.5 4 4.5 1 1.5 2 2.5 3 3.5 4 4.5
----------------------------------------------------------------------------------------------------------------
{B t t B} {^K}an't you see {B B} {^K}ou are li- {K'}a
{B}mid- night s{^K}um- mer's {t}ream {B B} {^K}on- ly you
{B}o- {t}em {t}weet {^K}e- lo- {t}ies {B B}ome and {^K}lay with me
{B B} {^K}et's see what the {B}sky {t}urns {^K}in- to {B}
In short, tentative explanations are available for the few constrictor mismatches that occur
off the back beat: two mismatches could be because intended nasal murmur likely matches
the low frequency burst of a Kick Drum better than the burst of the other beatboxing sounds
available, and the other two could be due to established musical patterns specific to this
performance.
267
3.2 Constriction degree (manner of articulation) matching
Figure 123 shows that this is what happens. The sounds that made constrictor-matching
replacements—the Kick Drum {B}, PF Snare {PF}, Closed Hi-Hat {t}, and Rimshot {K’}—
collectively replaced 43 stops but replaced 0 approximants and only 2 nasals and 10 fricatives.
No affricates were replaced at all in the data set. The K Snare {K} replaced 16 stops but also 7
nasals, 8 fricatives, and 2 approximants. For comparison, Figure 124 breaks down the
replacements by intended speech segment ordered by top to bottom in order of stops [p b t
d k g], nasals [m n] and [ ŋ] (written as “ng”), fricatives [f v s z] and [ð] (written as “dh”),
and approximants [l j].
In a future study, it would be good to check if the non-replaced beatboxable sounds
have a uniform distribution or if stops are disproportionately high frequency across the
board. If many stops were in positions to be replaced by a beatboxing sound but were not
replaced, this finding would carry less weight. As of the time of writing, however, it was not
clear how to define which sounds in this song should be expected to be beatboxed; and as
this is the first major beatrhyming study, there was no precedent to draw from.
268
Figure 123. Counts of replacements by beatboxing sounds (bottom) against the manner of
articulation of the speech sound they replace (left).
Figure 124. Counts of replacements by beatboxing sounds (bottom) against the speech sound
they replace (left).
269
3.3 Repetition
3.3.1 Analysis 1: Unique measure count
The number of unique measures of beatboxing sound sequences in a 16-bar phrase indicates
how much overall repetition there is in that phrase. Sections B and D, the two 16-bar phrases
without lyrics (just beatboxing), had a combined total of just 3 unique measure-long
beatboxing sound sequences: the same three sound sequences were used over and over again.
Section C, the first beatrhymed verse, had 16 unique measures (no repeated measures), and
Section E, the second beatrhymed verse, had 13 unique measures (3 measures were repeated
once each). The beatrhymed sections therefore had far less repetition of measures than the
beatboxed sections. The unique sequences in each section are shown in Figure 125.
This is not to say that there was no repetition at all in the beatrhyming. Portions of
some beatboxed measures were repeated as subsets of some beatrhymed measures. The
beatboxed sequence A {B t t ^K}, for example, is also part of the beatrhymed sequences
sequences D {B t t ^K ^K}, L {B t t ^K B}, and N {B t t ^K K’}; similarly, sequence F {B B ^K}
can also be found in sequences G {B B ^K t K’}, O {B B ^K B B}, U {B B ^K K’}, and W {t B B
^K}. But it turns out that even these subsequences are brief non-lyrical chunks within larger
beatrhyming sections, which means that the repetition of sequences here is not related to the
organization of constrictor-matching or -mismatching replacements. The {B t t} portions of
sequences L and N (and partly of D) are not attached to any beatrhymed lyrics, and the
{^K}s are not constrictor-matching. Likewise, the {B B} of F, G, O, and U also have no lyrics
and the {^K}s do not necessarily constrictor-match with the sound of the lyrics they replace.
270
3.3.2 Analysis 2: Repetition ratio
The complete set of two-bar lines for each of the four analyzed sections and their
corresponding repetition ratios are presented in Figure 126. The repetition ratios of
beatrhyming sections were much lower than the repetition ratios for beatboxing sections.
The repetition ratios for the beatboxed sections B & D are greater than the pairwise
corresponding repetition ratios for the beatrhymed sections C & E in all but one comparison
(31/32 comparisons). The mean of repetition ratios calculated for verses C and E were 0.35
and 0.3, respectively, with a mean cross-section repetition ratio of 0.29. The mean repetition
ratios for the beatboxed sections B and D were 0.68 and 0.70, respectively, with a mean
cross-section repetition ratio of 0.96. The low repetition ratios for beatrhymed sections
corroborates the observation from the unique measure count analysis that there is relatively
little repetition among beatboxing sounds during beatrhyming.
271
Figure 125. Four 16-bar beatboxing (sections B and D) and beatrhyming (sections C and E)
phrases with letter labels for each unique sound sequence. Only three measure types were
used between both beatboxing sections.
Section B - first beatboxing section
A: {B t t ^K}
B: {th B in B}
C: {B h ^K t t}
Section C - first verse
D: {B t t ^K ^K}
E: {PF ^K}
F: {B B ^K}
G: {B B ^K t K’}
H: {B t t B ^K K’ B}
I: {B ^K B t}
J: {K’ B ^K t B}
K: {B ^K}
L: {B t t ^K B}
M: {B t ^K K’}
N: {B t t ^K K’}
O: {B B ^K B B}
P: {B ^K K’ t}
Q: {t t ^K K’}
R: {B K’ K’ ^K}
S: {t ^K t}
Section D - second beatboxing section
A: {B t t ^K}
B: {th B in B}
C: {B h ^K t t}
Section E - second verse
T: {B t t B ^K}
U: {B B ^K K’}
V: {B ^K t}
F: {B B ^K}
A: {B t t ^K}
W: {t B B ^K}
F: {B B ^K}
X: {B t ^K B}
Y: {B ^K B}
Z: {B ^K B t PF}
Z: {B ^K B t PF}
K: {B ^K}
Y: {B ^K B}
AA: {PF ^K B}
BB: {t t ^K t}
CC: {B}
Section B Beatboxed A B A B A B A C’ A’ B A B A B A C
Section C Beatrhymed D E F G H I J K L M N O P Q R S
Section D Beatboxed A B A B A B A C’ A’ B A B A B A C’
Section E Beatrhymed T U V F A W F X Y Z Z K Y AA BB CC
272
Figure 126. Beat pattern display and repetition ratio calculations for sections B, C, D, and E.
Section B (first beatboxing section)
Beat | 1 1.5 2 2.5 3 3.5 4 4.5 1 1.5 2 2.5 3 3.5 4 4.5
| ------------------------------------------------------------------------------------------------------------
Line 1 | B t t ^K th B in B
Line 2 | B t t ^K th B in B
Line 3 | B t t ^K th B in B
Line 4 | B t t ^K B h ^K t t
|
Line 5 | B t t ^K th B in B
Line 6 | B t t ^K th B in B
Line 7 | B t t ^K th B in B
Line 8 | B t t ^K B h ^K t t
Section C (first beatrhymed verse)
Beat | 1 1.5 2 2.5 3 3.5 4 4.5 1 1.5 2 2.5 3 3.5 4 4.5
| ------------------------------------------------------------------------------------------------------------
Line 1 | B t t ^K ^K PF ^K
Line 2 | B B ^K B B ^K t K'
Line 3 | B t t B ^K K' B B ^K B t
Line 4 | K' B ^K t B B ^K
|
Line 5 | B t t ^K B B t ^K K'
Line 6 | B t t ^K K' B B ^K B B
Line 7 | B ^K K' t t t ^K K'
Line 8 | B K' K' ^K t ^K t
Adjacent pairs Alternating pairs Cross-group pairs
1 & 2 3 & 4 5 & 6 7 & 8 1 & 3 2 & 4 5 & 7 6 & 8 1 & 5 2 & 6 3 & 7 4 & 8
B)
mean=
0.68
8/8 4/10 6/10 4/9 8/8 4/10 6/10 4/9 6/10 8/8 8/8 7/11
1.00 0.40 0.60 0.44 1.00 0.40 0.60 0.44 0.60 1.00 1.00 0.64
C)
mean=
0.35
3/10 5/11 6/12 3/11 5/12 3/11 3/12 3/11 5/10 5/12 3/13 3/10
0.30 0.45 0.50 0.27 0.42 0.27 0.25 0.27 0.50 0.42 0.23 0.30
273
Section D (second beatboxing section)
Beat | 1 1.5 2 2.5 3 3.5 4 4.5 1 1.5 2 2.5 3 3.5 4 4.5
| ------------------------------------------------------------------------------------------------------------
Line 1 | B t t ^K th B in B
Line 2 | B t t ^K th B in B
Line 3 | B t t ^K th B in B
Line 4 | B t t ^K B h ^K t t
|
Line 5 | B t t ^K th B in B
Line 6 | B t t ^K th B in B
Line 7 | B t t ^K th B in B
Line 8 | B t t ^K B h ^K t t
Section E (second beatrhymed verse)
Beat | 1 1.5 2 2.5 3 3.5 4 4.5 1 1.5 2 2.5 3 3.5 4 4.5
| ------------------------------------------------------------------------------------------------------------
Line 1 | B t t B ^K B B ^K K'
Line 2 | B ^K t B B ^K
Line 3 | B t t ^K t B B ^K
Line 4 | B B ^K B t ^K B
|
Line 5 | B ^K B B ^K B t PF
Line 6 | B ^K B t PF B ^K
Line 7 | B ^K B PF ^K B
Line 8 | t t ^K t B
Adjacent pairs Alternating pairs Cross-group pairs
1 & 2 3 & 4 5 & 6 7 & 8 1 & 3 2 & 4 5 & 7 6 & 8 1 & 5 2 & 6 3 & 7 4 & 8
F)
mean=
0.70
8/8 4/10 6/10 4/10 8/8 4/10 6/10 4/10 6/10 8/8 8/8 9/9
1.00 0.40 0.60 0.40 1.00 0.40 0.60 0.40 0.60 1.00 1.00 1.00
E)
mean=
0.30
5/10 2/11 3/11 2/7 5/12 2/10 3/10 2/9 4/12 3/9 3/10 2/9
0.50 0.18 0.27 0.29 0.42 0.20 0.30 0.22 0.33 0.33 0.30 0.22
Cross-section pairs
1 & 1 2 & 2 3 & 3 4 & 4 5 & 5 6 & 6 7 & 7 8 & 8
B & D
mean=0.9
6
8/8 8/8 8/8 9/9 8/8 8/8 8/8 7/11
1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.64
C & E
mean=0.2
9
5/9 5/9 4/15 2/10 2/12 3/12 2/10 1/9
0.56 0.56 0.27 0.20 0.17 0.25 0.20 0.11
274
4. Discussion
The analysis above investigated whether beatboxing and speech do (the main hypothesis) or
do not (the null hypothesis) interact during beatrhyming in a way that supports both speech
and beatboxing tasks being achieved. The results provide evidence for the main hypothesis.
Speech tasks are achieved, in a local sense, in beatrhyming by generally selecting
replacement beatboxing sounds that match the speech segment in vocal tract constrictor and
manner/constriction degree. This presumably serves to help the more global task of
communicating the speech message. But achieving the speech task comes at the cost of
inconsistent beat patterns during lyrical beatrhyming. Theoretically, both the speech task
and the beatboxing repetition task could have been achieved by careful selection of lexical
items whose speech-compatible beatboxing replacement sound would also satisfy repetition,
but this did not happen. Thus, beatboxing sounds are generally selected in such a way as to
optimize speech task achievement, but lexical items are not being selected so as to optimize
beatboxing repetition. That said, the task demands of other aspects of beatboxing do affect
beatboxing sound selection—this is the inviolable use of K Snares {^K} on beat 3 of each
measure to establish the fundamental musical rhythm, even at the expense of the dorsal
constriction of the K Snare not matching the constriction of the intended speech sound it
replaces. Thus the tasks do interact such that one or the other task achievement has priority
over the other at different moments in time.
275
4.1 Task interaction
Beatrhyming is the union of a beatboxing system and a speech system. Each system is
goal-oriented, defined by aesthetic tasks related to the musical genre, communicative tasks,
motor efficiency, and other tasks. These tasks act as forces that shape the organization of the
sounds of speech, beatboxing, and beatrhyming.
Ultimately, a core interest in the study of speech sounds is to understand how forces
like these influence speech. When answering questions of why sounds in a language pattern
a particular way, we turn to explanations of effective message transmission and motor
efficiency almost axiomatically. But until we understand how these tasks manifest under a
wider variety of linguistic behaviors, we will not have a full sense of the speech system’s
flexibility or limitations. To that end, the contribution of this chapter is to show how the goal
of message transmission is active in the linguistic behavior beatrhyming: it is satisfied during
beatrhyming sound replacements by matching the constrictor of an intended speech sound
and the beatboxing sound replacing it, and dissatisfied when aesthetic beatboxing tasks take
priority on the back beat.
To close, section 4.2 demonstrates one way this musical linguistic behavior can
impact phonological theory by briefly introducing a simple phonological model of
beatrhyming.
4.2 Beatrhyming phonology
The results show that when speech and beatboxing are interwoven in beatrhyming, the
selection of beatboxing sounds to replace a speech sound is generally constrained by the
276
intended speech task and overrides the constraints of beatboxing task, except in one
environment (beat 3) in which the opposite is true. Given that the selection of lexical items
does not appear to be sensitive to location in the beatboxing structure, the achievement of
both tasks simultaneously is not possible. The resulting optimization can therefore be
modeled by ranking the speech and beatboxing tasks differently in different environments,
which is exactly what Optimality Theory (Prince & Smolensky, 1993/2004) has been
designed to do.
In Optimality Theory, ranked constraints guide the prediction of a surface output
representation from an underlying input representation. The representations and constraints
used in Optimality Theory are designed specifically to operate in the domain of speech and
phonology, so representations and constraints involving beatboxing sounds are not
appropriate for a typical phonological model. This approach assumes that this grammar
specialized for beatrhyming exists separately from grammars specialized for speech or
beatboxing but draws on the representations from both systems—that is, speech and
beatboxing representations are the same as they would be in a speech or beatboxing
phonology, but the constraints and their rankings are different from any other domain. Based
on this chapter’s interpretation that beatboxing sounds replace speech sounds in
beatrhyming, the grammar takes speech representations as inputs and returns surface forms
composed of both beatboxing and speech representations as output candidates. For the
purposes of this simple illustration, the computations are restricted to the selection of a
single beatboxing sound that replaces a single speech segment. (Presumably there are
277
higher-ranking constraints that determine which input speech segment representations
should be replaced by a beatboxing sound in the output.)
Because the analysis requires reference to the metrical position of a sound, input
representations are tagged with the associated beat number as a subscript. The input / b
3
/,
for example, symbolizes a speech representation for a voiced bilabial stop on the third beat
of a measure. Output candidates are marked with the same beat number as the
corresponding input; the input-output pairs / b
3
/ ~ { B
3
} and / b
3
/ ~ { ^K
3
} are both possible
in the system because the share the same subscript, but the input-output pair / b
3
/ ~ { B
2
} is
never generated as an option because the input and output have different subscripts. We can
use two loosely defined constraints:
*BackbeatWithoutSnare - Assign a violation to outputs on beat three ({X
3
})
that are not snares.
*PlaceMismatch - Assign a violation to an output whose Place feature does not
match the Place feature of the corresponding input.
(“Place” feature corresponds to the abstract conception of the constrictor: labial, coronal, and
dorsal.) The tableaux in Figures 127 and 128 demonstrate how possible input-output pairs
like the ones just introduced might be selected differently by the grammar depending on the
beat associated with the input sound. *BackbeatWithoutSnare is ranked above
*PlaceMismatch to ensure that beat 3 always has a K Snare. Given an input voiced bilabial
stop on beat 3 / b
3
/ in tableau 15, the output candidate {B
3
} is constrictor-matched to the
input and satisfies *PlaceMismatch but violates high-ranking *BackbeatWithoutSnare; the
alternative output {^K
3
} violates *PlaceMismatch, but is a more optimal candidate than {B
3
}
278
based on this constraint ranking. On the other hand, for an input / b
1
/ which represents a
voiced bilabial stop on beat 1, the constrictor-matched candidate {B
1
} violates no constraints
and harmonically bounds {^K
1
} which violates *PlaceMismatch (Figure 128).
Figure 127. Tableau in which a speech labial stop is replaced by a K Snare on the back beat.
/ b
3
/ *BackbeatWithoutSnare *PlaceMismatch
a. {B
3
} *!
b. ☞ {^K
3
} *
Figure 128. Tableau in which a speech labial stop is replaced by a Kick Drum off the back
beat.
/ b
1
/ *BackbeatWithoutSnare *PlaceMismatch
a. ☞ {B
1
}
b. {^K
1
} *!
This phonological formalism is simple, but effective: just these two constraints produce the
desired outcome for 95% (84/88) of the replacements in this data set. The remaining 5%
described in section 3.1.3 may be accounted for either by additional constraints designed to
fit more specific conditions, by a related but more complicated model MaxEnt (Hayes &
Wilson, 2008), or by gradient symbolic representations (Smolensky et al., 2014) that permit
more flexibility in the input-output place relationships. It is with this optimism in mind that
we suggest below two reasons not to use symbolic representations in models of beatrhyming:
the arbitrariness of speech-beatboxing feature mappings and the impossibility of splitting an
atomic unit.
279
In most symbolic models of phonology, the vocal constriction plan executed by the
motor system is not part of a phonological representation. The purpose of a phonological
place feature like [labial] (or if not privative, [±labial]) is to encode linguistic information,
and that information is defined by the feature’s contrastive relationship to other features
within the same linguistic system. Different phonological theories propose stronger or
weaker associations between a mental representation like [labial] and the physical lips
themselves, but there is an inherent duality that separates abstract phonological
representations from the concrete phonetic constrictors that implement them.
It is not clear what a mental representation of beatboxing should look like—especially
compared to speech representations—because beatboxing sounds do not encode contrastive
meaning. But say that a language-like beatboxing {labial} feature did exist, defined according
to some iconic relationship with other beatboxing features and, like a linguistic [labial]
feature, associated to some degree with physical constriction of the lips. This {labial}
beatboxing feature and [labial] phonological feature would have no meaningful
correspondence or inherent connection because they would be defined by completely
different information-bearing roles within their respective systems. Mapping abstract features
[labial] to {labial} to would be arbitrary and just as computationally efficient as mapping
[labial] to {dorsal} or {ingressive}. The only reason to map [labial] with {labial} is because
they share an association to the physical lips. But in that case, the crux of the mapping—the
only property shared by both units—is a phonetic referent; the abstract symbolic units
themselves are moot. Given that the model is intended to be a phonological one, it seems
undesirable for the phonological units to have less importance than their phonetic output.
280
The second issue with symbols is that they are notoriously static, existing invariantly
outside of real time. When timing must be encoded in symbolic approaches, the
representations are laid out either in sequence or in other timing slots like autosegmental
tiers (Goldsmith, 1976). And, segments are temporally indivisible—they cannot start at one
time, pause for a bit, then pick up again where they left off. This is not a problem for
phonological models of talking or many other varieties of speech, but Figure 129 illustrates a
beatrhyming example of precisely this kind of split-segment behavior. In this case, the word
“move” [muv] is pronounced [mu]{B}[uv], with a Kick Drum temporarily interrupting the
[u] vowel. The same phenomenon is shown in Figure 130 with the word “sky” pronounced as
[skak ͡ʟ ̝ ̊ ↓a] (the canonical closure to [i] is not apparent in the spectrogram). Figure 112 from
the beginning of this chapter shows a related example of the [i] in “dopamine” prematurely
cut off in the pronunciation of the word as {t}[o]{B}[əmi]{^K}[n]. These cases of beatboxing
sounds that interrupt speech sounds are impossible to represent in a symbolic phonological
model because in many cases they would require splitting an indivisible representation into
two parts to achieve the appropriate output representation.
Even theories that permit a certain amount of intra-segment temporal flexibility
struggle with beatrhyming interruptions. Q-Theory (Shih & Inkelas 2014, 2018) may come
the closest: it innovates on traditional segments by splitting them into three quantal
sub-segmental pieces. These sub-segments roughly correspond articulatorily to the onset of
movement, target achievement, and constriction release for a given sound, and are especially
useful for representing a sound that has complex internal structure like a triphthong or a
three-part tone contour. It would be possible to represent the /u/ in “move” /muv/ as having
281
three sub-segmental divisions [u] [u] [u]. But based on our understanding of Q-Theory, it is
not possible to replace the middle sub-segment [u] with an entire and entirely different
segment {B}. Given enough time, it is inevitable that someone could imagine some phonetic
implementation rules or different flavor of symbolic representation that generates these
kinds of interruptions. In the meantime, we consider these interruptions and the
speech-beatboxing constrictor mapping discussed earlier as evidence against symbolic units
and in favor of gestural units as described next.
Articulatory Phonology is the hypothesis that the fundamental units of language are
action units, called “gestures” (Browman & Goldstein, 1986, 1989). Unlike symbolic features
which are time-invariant and only reference the physical vocal tract abstractly (if at all),
gestures as phonological units are spatio-temporal entities with deterministic and directly
observable consequences in the vocal tract. Phonological phenomena that are stipulated
through computational processes in other models emerge in Articulatory Phonology from
the coordination of gestures in an utterance. Gestures are commonly defined as dynamical
systems in the framework of task dynamics (Browman & Goldstein, 1989; Saltzman &
Munhall, 1989). While a gesture is active, it exerts control over a vocal tract variable (e.g., lip
aperture) to accomplish some linguistic task (e.g., a complete labial closure for the
production of a labial stop) as specified by the parameters of the system.
Constrictor-matching emerges from a gestural framework because gestures are
defined by the vocal tract variable—and ultimately, the constrictor—they control. Gestures
are motor plans that leverage and tune the movement potential of the vocal articulators for
speech-specific purposes, but speech gestures are not the only action units that can control
282
the vocal tract. The vocal tract variables used for speech purposes are publicly available to
any other system of motor control, including beatboxing. This allows for a non-arbitrary
relationship between the fundamental phonological units of speech and beatboxing: a speech
unit and a beatboxing unit that both control lip aperture are inherently linked in a
beatboxing grammar because they control the same vocal tract variable.
Figure 129. Waveform, spectrogram, and text grid of the beatrhymed word “move” with a
Kick Drum splitting the vowel into two parts.
283
Figure 130. Waveform, spectrogram, and text grid of the beatrhymed word “sky” with a K
Snare splitting the vowel into two parts.
The cases in which a beatboxing sound temporarily interrupts a vowel can be modeled in
task dynamics with a parameter called gestural blending strength. When two gestures that
use the same constrictor overlap temporally, the movement plan during that time period
becomes the average of the two gestures’ spatial targets (and their time constants or
stiffness) weighted by their relative blending strengths. A stronger gesture exerts more
influence, and a gesture with very high relative blending strength will effectively override any
co-active gestures. For beatrhyming, the interrupting beatboxing sounds could be modeled as
having sufficiently high blending strength that the vowels they co-occur with are overridden
by the beatboxing sound; when the gestures for a beatboxing sound end, control of the vocal
tract returns solely to the vowel gesture. The Gestural Harmony Model (Smith, 2018) uses a
similar approach to account for transparent segments in phonological harmony.
284
5. Conclusion
Vocal music is a powerful lens through which to study speech, offering insights about speech
that may not be accessible from studies of talking. Beatrhyming in particular demonstrates
how the fundamental units of speech can interact with the units of a completely different
behavior—beatboxing—in a complex but organized way. When combined with speech, the
aesthetic goals of musical performance lead to sound patterns that push the limits of
phonological theory and may even cause widely accepted paradigms to break down. This is
the advantage to be gained by building and testing theories based on insights from a more
diverse set of linguistic behaviors.
285
CHAPTER 8: CONCLUSION
This dissertation applied linguistic methods to an analysis of beatboxing and discovered that
beatboxing has a unit-level phonology rooted in the same types of fundamental mental
representations and organization as the phonology of speech, while embedded in a
performance task whose metrical structure is governed by musical organization principles.
Chapter 3: Sounds argued that beatboxing sounds have meaning and word-like frequency.
Each sound is composed combinatorially from a reusable set of constrictions; because the
sounds have meaning, these constrictions are contrastive—changing a constriction usually
changes the meaning of a sound. This contrastiveness resembles the contrastive organization
of speech sounds within a language. But just like in speech, not every articulatory change is a
contrastive one. Chapter 5: Alternations shows that the Kick Drum and PF Snare, and
perhaps also Closed Hi-Hat, have different phonetic manifestations depending on their
context: they are glottalic egressive in most contexts, but percussive when performed in
proximity to other (made with a tongue body closure and no glottalic airflow initiation).
Chapter 6: Harmony shows that these alternations are—like so often in speech—the result of
multiple constrictions overlapping temporally. Here the contrastive airstreams from Chapter
3: Sounds participate actively as triggers, undergoers, and blockers in a process akin to
phonological harmony.
Taken together, the combinatorial contrastiveness of the constrictions that form
beatboxing sounds, the context-dependent alternations of beatboxing sounds, and the
class-based patterning of beatboxing sounds based on their combinatorial constrictions all
indicate that beatboxing has a phonology rooted in the same types of fundamental mental
286
representations and organization as linguistic phonology. These representations are united
with music cognition through rhythmic patterns, metrical organization, and sound classes
with patterning based on musical function (i.e., regularly placing snare-category sounds in
specific metrical positions).
As discussed in Chapter 1: Introduction, the interaction and overlap of different
cognitive pieces is related to a question of domain specificity, a topic which is sometimes
related to a theoretical dichotomy of modular cognition versus integrated cognition. The
finding that beatboxing exhibits signs of phonological cognition indicates that the
fundamental structure of phonology is not domain-specific. Furthermore, the phonological
foundations of both beatboxing and speech (see below) collaborate with aspects of music
cognition, which indicates that the building blocks of different domains superimpose onto
each other in task-specific ways to create each vocal behavior. This can be accounted for in
both modular and integrated approaches to cognition. A story consistent with a modular
approach to cognition is that beatboxing takes mental representations and grammar from
speech, combines them with musical meaning and metrical organization, and thereby adapts
them to a new use. Borrowing representations enables beatboxing to create phonological
contrasts and to use natural classes as the currency of productive synchronic processes. A
different story consistent with a more integrated approach to cognition is that beatboxing
and phonology both, somewhat independently, are shaped by the interaction of the
capabilities of the vocal tract they share, the recruitment of some domain-general
computations (i.e., combinatorial mental units), and their respective communicative or
aesthetic tasks. Regardless of the interpretation, the inescapable result is that linguistic
287
phonology is not particularly unique: beatboxing and speech share the same vocal tract and
organizational foundations, including mental representations and coordination of those
representations.
Beatboxing has phonological behavior based in phonological units and organization.
One could choose to model beatboxing with adaptations of either features or gestures as its
fundamental units, and that choice of unit can serve a story of modular cognition or of
integrated cognition. But as Chapter 4: Theory discusses, gestures have the distinction of
explicitly connecting the tasks specific to speech or to beatboxing with the sound-making
potential of the vocal substrate they share, which in turn creates a direct link between speech
gestures and beatboxing gestures. This link is formalized at the graph level of the dynamical
systems by which gestures are defined. The analysis of the graph level theoretical embedding
in this dissertation was focused on individual beatboxing units, their temporal coordination,
and their paradigmatic organization. Future work could formalize the link between speech
and musical prosodic, hierarchical, metrical structure as a different part of the graph level, in
order to better capture the ability of the phonological unit system to integrate in different
ways with music cognition.
The direct formal link between beatboxing and speech units makes predictions about
what types of phonological phenomena beatboxing and speech units are able to
exhibit—including the phonological properties described above. These predictions are born
out insofar as beatboxing and speech phonological phenomena are both able to be accounted
for by the same theoretical mechanisms (e.g., intergestural timing and inhibition). Moreover,
it predicts that the phonological units of the two domains will be able to co-occur as they do
288
in Chapter 7: Beatrhyming, where phenomena that are challenging or impossible to
represent with symbolic units are easily represented using gestures.
These advantages of the gestural approach for describing speech, beatboxing, and
beatrhyming underscore a broader point: that regardless of whether phonology is modular or
not, the phonological system is certainly not encapsulated away from other cognitive
domains, nor impermeable to connections with other domains. On the contrary,
phonological units are intrinsically related to beatboxing units—and, presumably, to other
units in similar systems. This appears to fly in the face of conventional wisdom about
phonological units: at least as early as Sapir (1925), phonological units have been defined
exclusively by their psychological linguistic role—by their relationships with each other and
their synchronic patterning, but often without any phonetic or social aspects of their
manifestation and certainly without ties to non-linguistic domains. But the gestural approach
allows phonological units to have domain-specific meaning within their own system while
sharing a domain-general conformation with other behaviors.
The attributes that phonology shares with other domains allows it to manifest
flexibly—to be recruited into a multitude of speech behaviors while robustly fulfilling its
primary directives (e.g., communicating a linguistic message). This is different from, say, the
sensory processing involved in auditory spatial localization which is arguably a module in the
strongest sense—automatic, innate, and not (so far as we know) able to be tapped into for
different purposes by conscious cognitive thought (Liberman & Mattingly, 1989). Instead, the
conversational or laboratory-style speech that is the subject of the bulk of phonological
research is continuous with many other speech behaviors and at different levels of
289
phonological structure. Prosodically, conversational speech is continuous with poetry,
rapping, chanting, and singing: just a few small adjustments to rhythm or intonation
transform conversational speech into any of an abundance of genres of vocal linguistic art. A
non-musical speech utterance can even become perceived as musical when it is repeated a
few times (the speech to song illusion; Deutsch et al., 2011). Speech modality is not limited to
the typically-studied arrangement of vocal articulators: surrogate speech like talking drums
(Beier, 1954; Akinbo, 2019), xylophones (McPherson, 2018), and whistle speech (Rialland,
2005) shift phonological expression to new sound systems which are often integrated with
musical structure. And phonological units and grammar are not only used in speech
contexts—scat singing is utterly non-linguistic but follows phonological restrictions anyway
(Shaw, 2008). And as beatrhyming shows, the conformation of the most elemental
phonological units affords connections to similar units in beatboxing.
These different speech behaviors are collaborations between speech tasks and other
non-linguistic (e.g., musical) tasks, well-organized to maximize the satisfaction of all tasks
involved (or at least to minimize dissatisfaction). For vocal behaviors, these interactions are
constrained by the vocal substrate in which all of the tasks are active. In singing,
conversational speech prosody cannot manifest at the same time as sung musical melody
because they both require use of the larynx. Sustaining a note during a song therefore
requires selecting between a musical and speech-prosodic pitch and rhythm; but the
contrastive information and structure of the speech sound units are unperturbed—syllable
structure, sound selection, and relative sound order largely remain intact because they do not
compete with melody or rhythm. In some cases there is also text to tune alignment where
290
musical pitch and rhythm reflect the likely prosody of the utterance if it had been spoken
non-musically (Hayes & Kaun, 1996). Similar text to tune alignment is active in languages
with lexical tone, with tone contours exerting greater influence on the musical melody to
avoid producing unintended tones (Schellenberg, 2013; McPherson & Ryan, 2018). And in
beatrhyming, the speech and beatboxing tasks share the vocal tract through a relationship
that leverages their shared vocal apparatus to maximize their compatibility when possible
through constrictor matching.
In short, flexibility is a defining characteristic of the phonological system. If there is
anything special about speech, it is the speech tasks themselves and how they leverage all of
human vocal potential to flexibly produce these different behaviors. This is consistent with
an anthropophonic perspective of linguistic inquiry, initially framed by Catford (1977) and
Lindblom (1990) as an ideology for non-circularly defining and explaining which sounds
could be possible speech sounds. It is a deductive approach to explaining speech phenomena
as the result of an interaction between the tasks of speech—in Lindblom (1990), “selection
constraints—and the total sound-making potential of the vocal tract. With respect to the
question of “What is a possible speech sound?”, the anthropophonic perspective re-frames
the question as “How do the tasks of speech filter the whole vocal sound-making potential
into a smaller, possibly finite set of speech sounds?” (Figure 131). As discussed in Chapter 4:
Theory, gestures as phonological units are in a sense a formalization of the anthropophonic
perspective.
291
Figure 131. The anthropophonic perspective.
In light of the clear flexibility of the phonological system, however, it must be made clear
that the selection constraints are not only the tasks of speech. There are many musical and
other non-linguistic tasks which shape behavior too—not to mention the social and affective
forces that incessantly impact speech production and phonological variation. A robust
account of phonology needs to be able to explain how the phonological system interacts
with these other forces via both their shared structures and their shared vocal substrate.
292
REFERENCES
Abbs, J. H., Gracco, V. L., & Cole, K. J. (1984). Control of Multimovement Coordination.
Journal of Motor Behavior, 16(2), 195–232.
https://doi.org/10.1080/00222895.1984.10735318
Abler, W. (1989). On the particulate principle of self-diversifying systems. Journal of Social
and Biological Systems, 12(1), 1–13. https://doi.org/10.1016/0140-1750(89)90015-8
Akinbo, S. (2019). Representation of Yorùbá Tones by a Talking Drum. An Acoustic Analysis.
Linguistique et Langues Africaines, 5, 11–23. https://doi.org/10.4000/lla.347
Anderson, S. R. (1981). Why Phonology Isn’t “Natural.” Linguistic Inquiry, 12(4), 493–539.
Archangeli, D., & Pulleyblank, D. (2015). Phonology without universal grammar. Frontiers in
Psychology, 6. https://www.frontiersin.org/article/10.3389/fpsyg.2015.01229
Archangeli, D., & Pulleyblank, D. (2022). Emergent phonology (Volume 7). Language Science
Press. https://doi.org/10.5281/zenodo.5721159
Ball, M. J., Esling, J. H., & Dickson, B. C. (2018). Revisions to the VoQS system for the
transcription of voice quality. Journal of the International Phonetic Association, 48(2),
165–171. https://doi.org/10.1017/S0025100317000159
Ball, M. J., Esling, J., & Dickson, C. (1995). The VoQS System for the Transcription of Voice
Quality. Journal of the International Phonetic Association, 25(2), 71–80.
https://doi.org/10.1017/S0025100300005181
Ball, M. J., Howard, S. J., & Miller, K. (2018). Revisions to the extIPA chart. Journal of the
International Phonetic Association, 48(2), 155–164.
https://doi.org/10.1017/S0025100317000147
Ballard, K. J., Robin, D. A., & Folkins, J. W. (2003). An integrative model of speech motor
control: A response to Ziegler. Aphasiology, 17(1), 37–48.
https://doi.org/10.1080/729254889
Baudouin de Courtenay, J. (1972). Selected Writings of Baudouin de Courtenay. Stankiewicz,
E. (ed). Bloomington, Indiana University Press.
Beale, J. M., & Keil, F . C. (1995). Categorical effects in the perception of faces. Cognition,
57(3), 217–239. https://doi.org/10.1016/0010-0277(95)00669-X
Beier, U. (1954). The talking drums of the Yoruba. African Music: Journal of the International
Library of African Music, 1(1), 29–31.
293
Bidelman, G. M., Gandour, J. T., & Krishnan, A. (2011). Cross-domain Effects of Music and
Language Experience on the Representation of Pitch in the Human Auditory Brainstem.
Journal of Cognitive Neuroscience, 23(2), 425–434. https://doi.org/10.1162/jocn.2009.21362
Blaylock, R. (2021). VocalTract ROI Toolbox. Available online at
https://github.com/reedblaylock/VocalTract-ROI-Toolbox.
https://zenodo.org/badge/latestdoi/98065485
Blaylock, R., & Phoolsombat, R. (2019). Beatrhyming probes the nature of the interface
between phonology and beatboxing. The Journal of the Acoustical Society of America,
146(4), 3081–3081. https://doi.org/10.1121/1.5137696
Blaylock, R., Patil, N., Greer, T., & Narayanan, S. S. (2017). Sounds of the Human Vocal Tract.
INTERSPEECH, 2287–2291. https://doi.org/10.21437/Interspeech.2017-1631
Boersma, P ., & Weenink, D. (1992-2022). Praat: Doing phonetics by computer (6.1.13)
[Computer software]. https://www.praat.org
Boersma, Paul (2001). Praat, a system for doing phonetics by computer. Glot International
5:9/10, 341-345.
Bo řil, T., & Skarnitzl, R. (2016). Tools rPraat and mPraat. In P . Sojka, A. Horák, I. Kopeček, &
K. Pala (Eds.), Text, Speech, and Dialogue (Vol. 9924, pp. 367–374). Springer International
Publishing. https://doi.org/10.1007/978-3-319-45510-5_42
Bresch, E., Nielsen, J., Nayak, K., & Narayanan, S. (2006). Synchronized and noise-robust
audio recordings during realtime magnetic resonance imaging scans. The Journal of the
Acoustical Society of America, 120(4), 1791–1794. https://doi.org/10.1121/1.2335423
Browman, C. P ., & Goldstein, L. (1986). Towards an Articulatory Phonology. Phonology
Yearbook, 3, 219–252.
Browman, C. P ., & Goldstein, L. (1988). Some notes on syllable structure in articulatory
phonology. Phonetica, 45(2-4), 140-155.
Browman, C. P ., & Goldstein, L. (1989). Articulatory gestures as phonological units.
Phonology, 6(2), 201–251. https://doi.org/10.1017/S0952675700001019
Browman, C. P ., & Goldstein, L. (1991). Gestural Structures: Distinctiveness, Phonological
Processes, and Historical Change. In I. G. Mattingly & M. Studdert-Kennedy (Eds.),
Modularity and the Motor Theory of Speech Perception: Proceedings of a Conference to
Honor Alvin M. Liberman (pp. 313–338).
Browman, C. P ., & Goldstein, L. (1992). Articulatory Phonology: An Overview. Phonetica,
49(3–4), 155–180. https://doi.org/10.1159/000261913
294
Browman, C. P ., & Goldstein, L. (1995). Gestural Syllable Position Effects in American
English. In Bell-Berti, F . & Raphael, L. J. (Eds.), Producing Speech: Contemporary Issues.
For Katherine Safford Harris. AIP Press: New York.
Byrd, D., & Saltzman, E. (1998). Intragestural dynamics of multiple prosodic boundaries.
Journal of Phonetics, 26(2), 173–199. https://doi.org/10.1006/jpho.1998.0071
Byrd, D., & Saltzman, E. (2003). The elastic phrase: Modeling the dynamics of
boundary-adjacent lengthening. Journal of Phonetics, 31(2), 149–180.
https://doi.org/10.1016/S0095-4470(02)00085-2
Catford, J. C. (1977). Fundamental problems in phonetics. Indiana University Press.
Chomsky, N., & Halle, M. (1968). The Sound Pattern of English.
Clements, G. N. (2003). Feature economy in sound systems. Phonology, 20(3), 287–333.
https://doi.org/10.1017/S095267570400003X
Cohn, A. C. (2007). Phonetics in Phonology and Phonology in Phonetics. Working Papers of
the Cornell Phonetics Laboratory, 16, 1–31.
Collins, J. (2017). Faculties and Modules: Chomsky on Cognitive Architecture. In J.
McGilvray (Ed.), The Cambridge Companion to Chomsky (2nd ed., pp. 217–234).
Cambridge University Press. https://doi.org/10.1017/9781316716694.011
Coltheart, M. (1999). Modularity and cognition. Trends in Cognitive Sciences, 3(3), 115–120.
https://doi.org/10.1016/S1364-6613(99)01289-9
Cooke, J. D. (1980). The Organization of Simple, Skilled Movements. In G. E. Stelmach & J.
Requin (Eds.), Advances in Psychology (Vol. 1, pp. 199–212). North-Holland.
https://doi.org/10.1016/S0166-4115(08)61946-9
Cummins, F ., & Port, R. (1998). Rhythmic constraints on stress timing in English. Journal of
Phonetics, 26(2), 145–171. https://doi.org/10.1006/jpho.1998.0070
Danner, S. G., Krivokapić, J., & Byrd, D. (2019). Co-speech movement behavior in
conversational turn-taking. The Journal of the Acoustical Society of America, 146(4),
3082–3082.
Dehais-Underdown, A., Buchman, L., & Demolin, D. (2019, August). Acoustico-Physiological
coordination in the Human Beatbox: A pilot study on the beatboxed Classic Kick Drum.
19th International Congress of Phonetic Sciences.
https://hal.archives-ouvertes.fr/hal-02284132
295
Dehais-Underdown, A., Vignes, P ., Buchman, L. C., & Demolin, D. (2020). Human
Beatboxing: A preliminary study on temporal reduction. Proceedings of the 12th
International Seminar on Speech Production (ISSP), 142–145.
Dehais-Underdown, A., Vignes, P ., Crevier-Buchman, L., & Demolin, D. (2021). In and out:
Production mechanisms in Human Beatboxing. 060005. https://doi.org/10.1121/2.0001543
Deutsch, D., Henthorn, T., & Lapidis, R. (2011). Illusory transformation from speech to song.
The Journal of the Acoustical Society of America, 129(4), 2245–2252.
https://doi.org/10.1121/1.3562174
Diehl, R. L. (1991). The Role of Phonetics within the Study of Language. Phonetica, 48(2–4),
120–134. https://doi.org/10.1159/000261880
Diehl, R. L., & Kluender, K. R. (1989). On the Objects of Speech Perception. Ecological
Psychology, 1(2), 121–144. https://doi.org/10.1207/s15326969eco0102_2
Dresher, B. E. (2011). The Phoneme. In The Blackwell companion to phonology (pp.
241–266).
Drum tablature. (2022). In Wikipedia.
https://en.wikipedia.org/w/index.php?title=Drum_tablature&oldid=1085945913
DrumTabs—DRUM TABS. (n.d.). Retrieved June 3, 2022, from http://www.drumtabs.org/
Duckworth, M., Allen, G., Hardcastle, W., & Ball, M. (1990). Extensions to the International
Phonetic Alphabet for the transcription of atypical speech. Clinical Linguistics &
Phonetics, 4(4), 273–280. https://doi.org/10.3109/02699209008985489
Dunbar, E., & Dupoux, E. (2016). Geometric Constraints on Human Speech Sound
Inventories. Frontiers in Psychology, 7.
https://www.frontiersin.org/article/10.3389/fpsyg.2016.01061
Eklund, R. (2008). Pulmonic ingressive phonation: Diachronic and synchronic
characteristics, distribution and function in animal and human sound production and in
human speech. Journal of the International Phonetic Association, 38(3), 235–324.
https://doi.org/10.1017/S0025100308003563
Episode 4 | When Art Meets Therapy. (2019, March 23).
https://www.youtube.com/watch?v=iS4LsXmZpHE
Evain, S., Contesse, A., Pinchaud, A., Schwab, D., Lecouteux, B., & Henrich Bernardoni, N.
(2019). Beatbox Sounds Recognition Using a Speech-dedicated HMM-GMM Based
System.
296
Farmer, J. D. (1990). A Rosetta stone for connectionism. Physica D: Nonlinear Phenomena,
42(1), 153–187. https://doi.org/10.1016/0167-2789(90)90072-W
Feld, S., & Fox, A. A. (1994). Music and Language. Annual Review of Anthropology, 23,
25–53.
Flash, T., & Sejnowski, T. J. (2001). Computational approaches to motor control. Current
Opinion in Neurobiology, 11, 655–662.
Fodor, J. A. (1983). The Modularity of Mind. MIT Press.
Fowler, C. A. (1980). Coarticulation and theories of extrinsic timing. Journal of Phonetics,
8(1), 113–133. https://doi.org/10.1016/S0095-4470(19)31446-9
Fowler, C. A., & Rosenblum, L. D. (1990). Duplex perception: A comparison of monosyllables
and slamming doors. Journal of Experimental Psychology: Human Perception and
Performance, 16(4), 742–754. https://doi.org/10.1037/0096-1523.16.4.742
Fukuda, M., Kimura, Kosei, Blaylock, Reed, & Lee, Seunghun. (2022). Scope of beatrhyming:
Segments or words. Proceedings of the AJL 6 (Asian Junior Linguists), 59–63.
https://doi.org/10.1121/1.5137696
Gafos, A. I. (1996). The articulatory basis of locality in phonology [Ph.D., The Johns Hopkins
University].
https://www.proquest.com/docview/304348525/abstract/DAAEF1DFEB254E8BPQ/1
Gafos, A. I., & Benus, S. (2006). Dynamics of Phonological Cognition. Cognitive Science,
30(5), 905–943. https://doi.org/10.1207/s15516709cog0000_80
Gafos, A., & Goldstein, L. (2011). Articulatory representation and organization. In A. C. Cohn,
C. Fougeron, & M. K. Huffman (Eds.), The Oxford Handbook of Laboratory Phonology
(1st ed.). Oxford University Press.
https://doi.org/10.1093/oxfordhb/9780199575039.001.0001
Goldsmith, J. A. (1976). Autosegmental phonology [PhD, Massachusetts Institute of
Technology]. http://hdl.handle.net/1721.1/16388
Goldstein, L., Byrd, D., & Saltzman, E. (2006). The role of vocal tract gestural action units in
understanding the evolution of phonology. In M. A. Arbib (Ed.), Action to Language via
the Mirror Neuron System (pp. 215–249). Cambridge University Press.
https://doi.org/10.1017/CBO9780511541599.008
Goldstein, L., Nam, H., Saltzman, E., & Chitoran, I. (2009). Coupled Oscillator Planning
Model of Speech Timing and Syllable Structure. In C. G. M. Fant, H. Fujisaki, & J. Shen
(Eds.), Frontiers in phonetics and speech science (p. 239-249). The Commercial Press.
https://hal.archives-ouvertes.fr/hal-03127293
297
Greenwald, J. (2002). Hip-Hop Drumming: The Rhyme May Define, but the Groove Makes
You Move. Black Music Research Journal, 22(2), 259–271. https://doi.org/10.2307/1519959
Guinn, D., & Nazarov, A. (2018, January). Evidence for features and phonotactics in
beatboxing vocal percussion. 15th Old World Conference on Phonology, University
College London, United Kingdom.
Hale, K., & Nash, D. (1997). Damin and Lardil phonotactics [PDF]. Boundary Rider: Essays
in Honor of Geoffrey O’Grady, 247-259 pages. https://doi.org/10.15144/PL-C136.247
Hale, M., & Reiss, C. (2000). Phonology as Cognition. Phonological Knowledge: Conceptual
and Empirical Issues, 161–184.
Hauser, M. D., Chomsky, N., & Fitch, W. T. (2002). The Faculty of Language: What Is It, Who
Has It, and How Did It Evolve? 298, 1569–1579.
Hayes, B. (1984). The Phonology of Rhythm in English. Linguistic Inquiry, 15(1), 33–74.
Hayes, B., & Kaun, A. (1996). The role of phonological phrasing in sung and chanted verse.
The Linguistic Review, 13(3–4). https://doi.org/10.1515/tlir.1996.13.3-4.243
Hayes, B., & Wilson, C. (2008). A maximum entropy model of phonotactics and phonotactic
learning. Linguistic inquiry, 39(3), 379-440.
Hayes, B., Kirchner, R., & Steriade, D. (Eds.). (2004). Phonetically Based Phonology.
Cambridge University Press.
Himonides, E., Moors, T., Maraschin, D., & Radio, M. (2018). Is there potential for using
beatboxing in supporting laryngectomees? Findings from a public engagement project.
Hockett, C. F . (1955). A manual of phonology (Vol. 21). Indiana University Publications in
Anthropology and Linguistics.
Hoyt, D. F ., & Taylor, C. R. (1981). Gait and the energetics of locomotion in horses. Nature,
292(5820), 239–240. https://doi.org/10.1038/292239a0
Human Beatbox. (2014, September 16). Unforced. HUMAN BEATBOX.
https://www.humanbeatbox.com/glossary/unforced/
Icht, M. (2018). Introducing the Beatalk technique: Using beatbox sounds and rhythms to
improve speech characteristics of adults with intellectual disability: Using beatbox sounds
and rhythms to improve speech. International Journal of Language & Communication
Disorders, 54. https://doi.org/10.1111/1460-6984.12445
298
Icht, M. (2021). Improving speech characteristics of young adults with congenital dysarthria:
An exploratory study comparing articulation training and the Beatalk method. Journal of
Communication Disorders, 93, 106147. https://doi.org/10.1016/j.jcomdis.2021.106147
Icht, M., & Carl, M. (2022). Points of view: Positive effects of the Beatalk technique on speech
characteristics of young adults with intellectual disability. International Journal of
Developmental Disabilities, 1–5. https://doi.org/10.1080/20473869.2022.2065449
Jakobson, R., Fant, C. G., & Halle, M. (1951). Preliminaries to speech analysis: The distinctive
features and their correlates.
Kaun, A. R. (2004). The typology of rounding harmony. In B. Hayes, R. Kirchner, & D.
Steriade (Eds.), Phonetically based phonology (pp. 87–116).
Keating, P . A. (1996). The Phonetics-Phonology Interface. UCLA Working Papers in
Phonetics, 92, 45–60.
Kelso, J. A. S., & Tuller, B. (1984). A Dynamical Basis for Action Systems. In M. S. Gazzaniga
(Ed.), Handbook of Cognitive Neuroscience (pp. 321–356). Springer US.
https://doi.org/10.1007/978-1-4899-2177-2_16
Kelso, J. A. S., Holt, K. G., Rubin, P ., & Kugler, P . N. (1981). Patterns of Human Interlimb
Coordination Emerge from the Properties of Non-Linear, Limit Cycle Oscillatory
Processes. Journal of Motor Behavior, 13(4), 226–261.
https://doi.org/10.1080/00222895.1981.10735251
Kelso, J. A., & Tuller, B. (1984). Converging evidence in support of common dynamical
principles for speech and movement coordination. American Journal of
Physiology-Regulatory, Integrative and Comparative Physiology, 246(6), R928–R935.
https://doi.org/10.1152/ajpregu.1984.246.6.R928
Kelso, J. S., Tuller, B., Vatikiotis-Bateson, E., & Fowler, C. A. (1984). Functionally specific
articulatory cooperation following jaw perturbations during speech: Evidence for
coordinative structures. Journal of Experimental Psychology: Human Perception and
Performance, 10(6), 812–832. https://doi.org/10.1037/0096-1523.10.6.812
Kimper, W. A. (2011). Competing Triggers: Transparency and Opacity in Vowel Harmony
[PhD Dissertation]. University of Massachusetts Amherst.
Krivokapi ć, J. (2014). Gestural coordination at prosodic boundaries and its role for prosodic
structure and speech planning processes. Philosophical Transactions of the Royal Society
B: Biological Sciences, 369(1658), 20130397. https://doi.org/10.1098/rstb.2013.0397
Kröger, B. J., Schröder, G., & Opgen ‐Rhein, C. (1995). A gesture ‐based dynamic model
describing articulatory movement data. The Journal of the Acoustical Society of America,
98(4), 1878–1889. https://doi.org/10.1121/1.413374
299
Kugler, P . N., Kelso, J. A. S., & Turvey, M. T. (1980). On the Concept of Coordinative
Structures as Dissipative Structures: I. Theoretical Lines of Convergence. In G. E.
Stelmach & J. Requin (Eds.), Advances in Psychology (Vol. 1, pp. 3–47). North-Holland.
https://doi.org/10.1016/S0166-4115(08)61936-6
Kuhl, P . K., & Miller, J. D. (1978). Speech perception by the chinchilla: Identification
functions for synthetic VOT stimuli. The Journal of the Acoustical Society of America,
63(3), 905–917. https://doi.org/10.1121/1.381770
Ladefoged, P . (1989). Representing Phonetic Structure (No. 73; Working Papers in Phonetics).
Phonetics Laboratory, Department of Linguistics, UCLA.
Lammert, A. C., Melot, J., Sturim, D. E., Hannon, D. J., DeLaura, R., Williamson, J. R.,
Ciccarelli, G., & Quatieri, T. F . (2020). Analysis of Phonetic Balance in Standard English
Passages. Journal of Speech, Language, and Hearing Research, 63(4), 917–930.
https://doi.org/10.1044/2020_JSLHR-19-00001
Lammert, A. C., Proctor, M. I., & Narayanan, S. S. (2010). Data-Driven Analysis of Realtime
Vocal Tract MRI using Correlated Image Regions. Interspeech 2010, 1572–1575.
Lammert, A. C., Ramanarayanan, V., Proctor, M. I., & Narayanan, S. S. (2013). Vocal tract
cross-distance estimation from real-time MRI using region-of-interest analysis.
Interspeech 2013, 959–962.
Large, E. W. (2000). On synchronizing movements to music. Human Movement Science,
19(4), 527–566. https://doi.org/10.1016/S0167-9457(00)00026-9
Large, E. W., & Kolen, J. F . (1994). Resonance and the perception of musical meter.
Connection Science, 6(1), 177–208.
Lartillot, O., Toiviainen, P ., & Eerola, T. (2008). A Matlab Toolbox for Music Information
Retrieval. In C. Preisach, H. Burkhardt, L. Schmidt-Thieme, & R. Decker (Eds.), Data
Analysis, Machine Learning and Applications (pp. 261–268). Springer.
https://doi.org/10.1007/978-3-540-78246-9_31
Lartillot, O., Toiviainen, P ., Saari, P ., & Eerola, T. (n.d.). MIRtoolbox (1.7.2) [Computer
software]. http://www.jyu.fi/hum/laitokset/musiikki/en/research/coe/materials/mirtoolbox
Lederer, K. (2005, 2006). The Phonetics of Beatboxing. Introduction (The Phonetics of
Beatboxing).
https://www.humanbeatbox.com/articles/the-phonetics-of-beatboxing-abstract/
Lederer, K. (2005/2006). The Phonetics of Beatboxing.
https://www.humanbeatbox.com/articles/the-phonetics-of-beatboxing-abstract/
Lerdahl, F ., & Jackendoff, R. (1983/1996). A Generative Theory of Tonal Music. MIT press.
300
Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception revised.
Cognition, 21(1), 1–36. https://doi.org/10.1016/0010-0277(85)90021-6
Liberman, A. M., & Mattingly, I. G. (1989). A Specialization for Speech Perception. Science,
243(4890), 489–494.
Liberman, A. M., Isenberg, D., & Rakerd, B. (1981). Duplex perception of cues for stop
consonants: Evidence for a phonetic mode. Perception & Psychophysics, 30(2), 133–143.
https://doi.org/10.3758/BF03204471
Liberman, M., & Prince, A. (1977). On Stress and Linguistic Rhythm. Linguistic Inquiry, 8(2),
249–336.
Liljencrants, J., & Lindblom, B. (1972). Numerical Simulation of Vowel Quality Systems: The
Role of Perceptual Contrast. Language, 48(4), 839. https://doi.org/10.2307/411991
Lindblom, B. (1983). Economy of Speech Gestures. In P . F . MacNeilage (Ed.), The Production
of Speech (pp. 217–245). Springer New York. https://doi.org/10.1007/978-1-4613-8202-7_10
Lindblom, B. (1986). Phonetic universals in vowel systems. In Experimental phonology (pp.
13–44).
Lindblom, B. (1990). On the notion of “possible speech sound.” Journal of Phonetics, 18(2),
135–152. https://doi.org/10.1016/S0095-4470(19)30398-5
Lindblom, B., & Maddieson, I. (1988). Phonetic universals in consonant systems. In Language,
speech and mind.
Lindblom, B., Lubker, J., & Gay, T. (1979). Formant frequencies of some fixed-mandible
vowels and a model of speech motor programming by predictive simulation. Journal of
Phonetics, 7(2), 147–161. https://doi.org/10.1016/S0095-4470(19)31046-0
Lingala, S. G., Zhu, Y., Kim, Y.-C., Toutios, A., Narayanan, S., & Nayak, K. S. (2017). A fast and
flexible MRI system for the study of dynamic vocal tract shaping. Magnetic Resonance in
Medicine, 77(1), 112–125. https://doi.org/10.1002/mrm.26090
Llorens, M. (In progress). Dissertation, University of Southern California.
MacNeilage, P . F . (1998). The frame/content theory of evolution of speech production.
Behavioral and Brain Sciences, 21(4), 499–511. https://doi.org/10.1017/S0140525X98001265
Maess, B., Koelsch, S., Gunter, T. C., & Friederici, A. D. (2001). Musical syntax is processed in
Broca’s area: An MEG study. Nature Neuroscience, 4(5), 540–545.
https://doi.org/10.1038/87502
301
Mann, V. A., & Liberman, A. M. (1983). Some differences between phonetic and auditory
modes of perception. Cognition, 14(2), 211–235.
https://doi.org/10.1016/0010-0277(83)90030-6
Martin, M., & Mullady, K. (n.d.). Education. Lightship Beatbox. Retrieved June 6, 2022, from
https://www.lightshipbeatbox.com/education
McPherson, L. (2018). The Talking Balafon of the Sambla: Grammatical Principles and
Documentary Implications. Anthropological Linguistics, 60(3), 255–294.
https://doi.org/10.1353/anl.2019.0006
McPherson, L., & Ryan, K. M. (2018). Tone-tune association in Tommo So (Dogon) folk
songs. Language, 94(1), 119–156. https://doi.org/10.1353/lan.2018.0003
Mielke, J. (2011). Distinctive Features. In The Blackwell Companion to Phonology (pp. 1–25).
John Wiley & Sons, Ltd. https://doi.org/10.1002/9781444335262.wbctp0017
Moors, T., Silva, S., Maraschin, D., Young, D., Quinn V, J., Carpentier, J., Allouche, J., &
Himonides, E. (2020). Using Beatboxing for Creative Rehabilitation After Laryngectomy:
Experiences From a Public Engagement Project. Frontiers in Psychology, 10, 2854.
https://doi.org/10.3389/fpsyg.2019.02854
Mullady, K. (January 25, 2017). Beatboxing rapping and singing at the same time [Video].
YouTube. https://www.youtube.com/watch?v=4BCcydkZqUo
Nam, H., & Saltzman, E. (2003). A Competitive, Coupled Oscillator Model of Syllable
Structure. Proceedings of the 15th International Congress of Phonetic Sciences.
Nam, H., Goldstein, L., & Saltzman, E. (2009). Self-organization of Syllable Structure: A
Coupled Oscillator Model. In F . Pellegrino, E. Marsico, I. Chitoran, & C. Coupé (Eds.),
Approaches to Phonological Complexity (pp. 297–328). Walter de Gruyter.
https://doi.org/10.1515/9783110223958
Narayanan, S., Nayak, K., Lee, S., Sethy, A., & Byrd, D. (2004). An approach to real-time
magnetic resonance imaging for speech production. The Journal of the Acoustical Society
of America, 115(4), 1771–1776. https://doi.org/10.1121/1.1652588
Oh, M. (2021). Articulatory Dynamics and Stability in Multi-Gesture Complexes [Ph.D.,
University of Southern California].
https://www.proquest.com/docview/2620982636/abstract/EB4ABE9208A5428EPQ/1
Oh, M., & Lee, Y. (2018). ACT: An Automatic Centroid Tracking tool for analyzing vocal tract
actions in real-time magnetic resonance imaging speech production data. The Journal of
the Acoustical Society of America, 144(4), EL290–EL296. https://doi.org/10.1121/1.5057367
302
Ohala, J. J. (1980). Moderator’s summary of symposium on “Phonetic universals in
phonological systems and their explanation.” Proceedings of the 9th International
Congress of Phonetic Sciences, 3, 181–194.
Ohala, J. J. (1983). The Origin of Sound Patterns in Vocal Tract Constraints.
https://doi.org/10.1007/978-1-4613-8202-7_9
Ohala, J. J. (1990). There is no interface between phonology and phonetics: A personal view.
Journal of Phonetics, 18(2), 153–171. https://doi.org/10.1016/S0095-4470(19)30399-7
Ohala, J. J. (1994). Towards a universal, phonetically-based, theory of vowel harmony. The
3rd International Conference on Spoken Language Processing, ICSLP, Yokohama, Japan.
Ohala, J. J. (2008). Languages’ Sound Inventories: The Devil in the Details. UC Berkeley
Phonology Lab Annual Reports, 4. https://doi.org/10.5070/P76S79B30W
O’Dell, M. L., & Nieminen, T. (1999). Coupled oscillator model of speech rhythm.
Proceedings of the 14th International Congress of Phonetic Sciences, 2, 1075–1078.
O’Dell, M. L., & Nieminen, T. (2009). Coupled oscillator model for speech timing: Overview
and examples. Prosody: Proceedings of the 10th Conference, 179–190.
Palmer, C., & Kelly, M. H. (1992). Linguistic Prosody and Musical Meter in Song. Journal of
Memory and Language, 31(4), 525–542.
Park, J. (2016, September 12). 80 Fitz | Build your basic sound arsenal | HUMAN BEATBOX.
HUMAN BEATBOX.
https://www.humanbeatbox.com/lessons/80-fitz-build-your-basic-sound-arsenal/#tutorial3
Park, J. (2017, March 22). Spit Snare—HUMAN BEATBOX. HUMAN BEATBOX.
https://www.humanbeatbox.com/techniques/sounds/spit-snare/
Paroni, A., Henrich Bernardoni, N., Savariaux, C., Lœvenbruck, H., Calabrese, P ., Pellegrini, T.,
Mouysset, S., & Gerber, S. (2021). Vocal drum sounds in human beatboxing: An acoustic
and articulatory exploration using electromagnetic articulography. The Journal of the
Acoustical Society of America, 149(1), 191–206. https://doi.org/10.1121/10.0002921
Paroni, A., Lœvenbruck, H., Baraduc, P ., Savariaux, C., Calabrese, P ., & Bernardoni, N. H.
(2021). Humming Beatboxing: The Vocal Orchestra Within. MAVEBA 2021 - 12th
International Workshop Models and Analysis of Vocal Emissions for Biomedical
Applications, Universita Degli Studi Firenze.
Parrell, B., & Narayanan, S. (2018). Explaining Coronal Reduction: Prosodic Structure and
Articulatory Posture. Phonetica, 75(2), 151–181. https://doi.org/10.1159/000481099
303
Patil, N., Greer, T., Blaylock, R., & Narayanan, S. S. (2017). Comparison of Basic Beatboxing
Articulations Between Expert and Novice Artists Using Real-Time Magnetic Resonance
Imaging. Interspeech 2017, 2277–2281. https://doi.org/10.21437/Interspeech.2017-1190
Pike, K. L. (1943). Phonetics: A Critical Analysis of Phonetic Theory and a Technique for the
Practical Description of Sounds. University of Michigan Publications.
Pillot-Loiseau, C., Garrigues, L., Demolin, D., Fux, T., Amelot, A., & Crevier-Buchman, L.
(2020). Le human beatbox entre musique et parole: Quelques indices acoustiques et
physiologiques. Volume !, 16 : 2 / 17 : 1, 125–143. https://doi.org/10.4000/volume.8121
Pouplier, M. (2012). The gaits of speech: Re-examining the role of articulatory effort in
spoken language. In M.-J. Solé & D. Recasens (Eds.), Current Issues in Linguistic Theory
(Vol. 323, pp. 147–164). John Benjamins Publishing Company.
https://doi.org/10.1075/cilt.323.12pou
Prince, A., & Smolensky, P . (1993/2004). Optimality Theory: Constraint Interaction in
Generative Grammar. Manuscript, Rutgers University and University of Colorado
Boulder. Published 2004 by Blackwell Publishing.
Proctor, M., Bresch, E., Byrd, D., Nayak, K., & Narayanan, S. (2013). Paralinguistic
mechanisms of production in human “beatboxing”: A real-time magnetic resonance
imaging study. The Journal of the Acoustical Society of America, 133(2), 1043–1054.
https://doi.org/10.1121/1.4773865
Proctor, M., Lammert, A., Katsamanis, A., Goldstein, L., Hagedorn, C., & Narayanan, S. (2011).
Direct Estimation of Articulatory Kinematics from Real-Time Magnetic Resonance Image
Sequences. Interspeech 2011, 284–281.
Ravignani, A., Honing, H., & Kotz, S. A. (2017). Editorial: The Evolution of Rhythm
Cognition: Timing in Music and Speech. Frontiers in Human Neuroscience, 11.
https://www.frontiersin.org/article/10.3389/fnhum.2017.00303
Rialland, A. (2005). Phonological and phonetic aspects of whistled languages. Phonology,
22(2), 237–271. https://doi.org/10.1017/S0952675705000552
Roon, K. D., & Gafos, A. I. (2016). Perceiving while producing: Modeling the dynamics of
phonological planning. Journal of Memory and Language, 89, 222–243.
https://doi.org/10.1016/j.jml.2016.01.005
Rose, S., & Walker, R. (2011). Harmony Systems. In The Handbook of Phonological Theory
(pp. 240–290). John Wiley & Sons, Ltd. https://doi.org/10.1002/9781444343069.ch8
Saltzman, E. L., & Munhall, K. G. (1989). A Dynamical Approach to Gestural Patterning in
Speech Production. Ecological Psychology, 1(4), 333–382.
https://doi.org/10.1207/s15326969eco0104_2
304
Saltzman, E. L., & Munhall, K. G. (1992). Skill Acquisition and Development: The Roles of
State-, Parameter-, and Graph-Dynamics. Journal of Motor Behavior, 24(1), 49–57.
https://doi.org/10.1080/00222895.1992.9941600
Saltzman, E., & Kelso, J. A. (1987). Skilled actions: A task-dynamic approach. Psychological
Review, 94(1), 84–106. https://doi.org/10.1037/0033-295X.94.1.84
Saltzman, E., Nam, H., Goldstein, L., & Byrd, D. (2006). The Distinctions Between State,
Parameter and Graph Dynamics in Sensorimotor Control and Coordination. In M. L.
Latash & F . Lestienne (Eds.), Motor Control and Learning (pp. 63–73). Kluwer Academic
Publishers. https://doi.org/10.1007/0-387-28287-4_6
Saltzman, E., Nam, H., Krivokapic, J., & Goldstein, L. (2008). A task-dynamic toolkit for
modeling the effects of prosodic structure on articulation. Proceedings of the 4th
International Conference on Speech Prosody (Speech Prosody 2008), 175–184.
Sapir, E. (1925). Sound Patterns in Language. Language, 1(2), 37–51.
Schellenberg, M. H. (2013). The Realization of Tone in Singing in Cantonese and Mandarin.
The University of British Columbia.
Schellenberg, M., & Gick, B. (2020). Microtonal Variation in Sung Cantonese. Phonetica,
77(2), 83–106. https://doi.org/10.1159/000493755
Schyns, P . G., Goldstone, R. L., & Thibaut, J.-P . (1998). The development of features in object
concepts. Behavioral and Brain Sciences, 21(1), 1–17.
https://doi.org/10.1017/S0140525X98000107
Shadmehr, R. (1998). The Equilibrium Point Hypothesis for Control of Movements.
Baltimore, MD: Department of Biomedical Engineering, Johns Hopkins University.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical
Journal, 27(3), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
Shaw, P . A. (2008). Scat syllables and markedness theory. Toronto Working Papers in
Linguistics, 27, 145–191.
Shih, S. S., & Inkelas, S. (2014). A Subsegmental Correspondence Approach to Contour Tone
(Dis)Harmony Patterns. Proceedings of the Annual Meetings on Phonology, 1(1), Article
1. https://doi.org/10.3765/amp.v1i1.22
Shih, S. S., & Inkelas, S. (2018). Autosegmental Aims in Surface-Optimizing Phonology.
Linguistic Inquiry, 50(1), 137–196. https://doi.org/10.1162/ling_a_00304
Shih, S. S., & Zuraw, K. (2017). Phonological conditions on variable adjective and noun word
order in Tagalog. Language, 93(4), e317–e352. https://doi.org/10.1353/lan.2017.0075
305
Smith, C. M. (2018). Harmony in Gestural Phonology [Ph.D., University of Southern
California].
https://www.proquest.com/docview/2128021312/abstract/149BC12A53B84C53PQ/1
Smolensky, P ., Goldrick, M., & Mathis, D. (2014). Optimization and quantization in gradient
symbol systems: A framework for integrating the continuous and the discrete in
cognition. Cognitive science, 38(6), 1102-1138.
Sorensen, T., & Gafos, A. (2016). The Gesture as an Autonomous Nonlinear Dynamical
System. Ecological Psychology, 28(4), 188–215.
https://doi.org/10.1080/10407413.2016.1230368
Stevens, K. N. (1989). On the quantal nature of speech. Journal of Phonetics, 17(1–2), 3–45.
https://doi.org/10.1016/S0095-4470(19)31520-7
Stevens, K. N., & Keyser, S. J. (2010). Quantal theory, enhancement and overlap. Journal of
Phonetics, 38(1), 10–19. https://doi.org/10.1016/j.wocn.2008.10.004
Stowell, D. (2003). The Beatbox Alphabet. The Beatbox Alphabet.
http://www.mcld.co.uk/beatboxalphabet/
Stowell, D., & Plumbley, M. D. (2008). Characteristics of the beatboxing vocal style (No.
C4DM-TR-08–01; pp. 1–4). Queen Mary, University of London.
Studdert-Kennedy, M., & Goldstein, L. (2003). Launching Language: The Gestural Origin of
Discrete Infinity. In M. H. Christiansen & S. Kirby (Eds.), Language Evolution (pp.
235–254). Oxford University Press.
https://doi.org/10.1093/acprof:oso/9780199244843.003.0013
Tiede, M. (2010). MVIEW: Multi-channel visualization application for displaying dynamic
sensor movements.
Tilsen, S. (2009). Multitimescale Dynamical Interactions Between Speech Rhythm and
Gesture. Cognitive Science, 33(5), 839–879. https://doi.org/10.1111/j.1551-6709.2009.01037.x
Tilsen, S. (2018, March 28). Three mechanisms for modeling articulation: Selection,
coordination, and intention. Cornell Working Papers in Phonetics and Phonology.
Tilsen, S. (2019). Motoric Mechanisms for the Emergence of Non-local Phonological Patterns.
Frontiers in Psychology, 10. https://www.frontiersin.org/article/10.3389/fpsyg.2019.02143
Tyte, & SPLINTER. (2014, September 18). Standard Beatbox Notation (SBN). HUMAN
BEATBOX. https://www.humanbeatbox.com/articles/standard-beatbox-notation-sbn/
306
Tyte, G. and Splinter, M. (2002/2004). Standard Beatbox Notation (SBN). Retrieved
December 8, 2019 from
https://www.humanbeatbox.com/articles/standard-beatbox-notation-sbn/.
WIRED. (2020, March 17). 13 Levels of Beatboxing: Easy to Complex | WIRED.
https://www.youtube.com/watch?v=E_z9kg2MU
Walker, R. (2005). Weak Triggers in Vowel Harmony. Natural Language & Linguistic Theory,
23(4), 917. https://doi.org/10.1007/s11049-004-4562-z
Walker, R., Byrd, D., & Mpiranya, F . (2008). An articulatory view of Kinyarwanda coronal
harmony. Phonology, 25(3), 499–535. https://doi.org/10.1017/S0952675708001619
Werker, J. F ., & Tees, R. C. (1984). Cross-language speech perception: Evidence for perceptual
reorganization during the first year of life. Infant Behavior and Development, 7(1), 49–63.
https://doi.org/10.1016/S0163-6383(84)80022-3
Westbury, J. R. (1983). Enlargement of the supraglottal cavity and its relation to stop
consonant voicing. The Journal of the Acoustical Society of America, 73(4), 1322–1336.
https://doi.org/10.1121/1.389236
Woods, K. J. (2012). (Post)Human Beatbox Performance and the Vocalisation of Electronic
and Mechanically (Re)Produced Sounds.
Wyttenbach, R. A., May, M. L., & Hoy, R. R. (1996). Categorical Perception of Sound
Frequency by Crickets. Science, 273(5281), 1542–1544.
Ziegler, W. (2003a). Speech motor control is task-specific: Evidence from dysarthria and
apraxia of speech. Aphasiology, 17(1), 3–36. https://doi.org/10.1080/729254892
Ziegler, W. (2003b). To speak or not to speak: Distinctions between speech and nonspeech
motor control. Aphasiology, 17(2), 99–105. https://doi.org/10.1080/729255218
Zipf, G. K. (1949). Human Behavior And The Principle Of Least Effort. Addison-Wesley
Press, Inc. http://archive.org/details/in.ernet.dli.2015.90211
de Saussure, F . (1916). Cours de linguistique générale (1916). Payot.
de Torcy, T., Clouet, A., Pillot-Loiseau, C., Vaissière, J., Brasnu, D., & Crevier-Buchman, L.
(2014). A video-fiberscopic study of laryngopharyngeal behaviour in the human beatbox.
Logopedics Phoniatrics Vocology, 39(1), 38–48.
https://doi.org/10.3109/14015439.2013.784801
307
APPENDIX: Harmony beat pattern drum tabs
Beat pattern 1: Clickroll showcase
b |x-----------x---|--x-----------x-|x-----------x---|--x-------------
B |------x---------|------x---x-----|------x---------|------x---x-----
t |----------------|----------------|----------------|------------x---
dc|----x-----x-----|x---x-------x---|----x-----x-----|x---x-----------
^K|--------x-------|--------x-------|--------x-------|--------x-----x-
CR|x~~~--------x~~~|--x~------------|x~~~--------x~~~|--x~------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
Beat pattern 2: Clop showcase
C |x---x-x---x-x-x-|x-x---x-x-x-xxx-|x---x-x---x-x-x-|x-x---x-x-x-xxx-
ex|x---x---x---x---|x---x---x---x---|x---x---x---x---|x---x---x---x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
Note: exhale may be some kind of voicing, given the larynx activity
Beat pattern 3: Duck Meow SFX showcase
b |------x---------|--x---x---------|x-----x-----x---|--x---x---x-x---
ac |--x-------x---x-|----------------|--x-------x---x-|----x-----------
dc |----x-----------|x---------------|----x-----------|x---------------
tbc|----------------|----x-----------|----------------|----------------
DM |x-----------x---|----------x-----|x-----------x---|------------x---
^K |--------x-------|--------x-------|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
308
Beat pattern 4: Liproll showcase
Bars 1-4
b |x-----x-----x---|--x---x-----x---|x-----x-----x---|--x-------x---x-
ac |----------x-----|----------x-----|----------x-----|--------x-------
dc |----------------|----x-----------|----------------|------------x---
tbc|----------------|----------------|----------------|----x-----------
pf |--------x-------|--------x-------|--------x-------|------x---------
LR |x~~~~~------x~~~|~~----------x~~~|x~~~~~------x~~~|~~--------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
Bars 5-8
b |x---x-------x---|x---x-------x---|x---x-------x---|x---x-----------
ac |----------x-----|----------x-----|----------x-----|----------------
dc |----------------|--------------x-|----------------|----------------
tbc|----------------|----------------|----------------|----------------
pf |--------x-------|--------x-------|--------x-------|--------x-------
LR |x~~~x~~~----x~~~|x~~~x~~~--------|x~~~x~~~----x~~~|x~~~x~~~--------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
Beat pattern 5: Spit Snare showcase
b |x-----x-----x---|----x-------x---|x-----x-----x---|----x-------x---
dc |----x-----------|----------------|----x-----------|----------------
tll|----------------|x---------------|----------------|x---------------
SS |--------x-------|--------x-------|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
Beat pattern 6: Water Drop Air showcase
b |x---------------|x---------------|x---------------|x---------------
ac |--x-------x---x-|--x-------x---x-|--x-------x---x-|--x-------x---x-
WDA|----x~~~---x~~--|----x~~~---x~~--|----x~~~---x~~--|----x~~~---x~~--
pf |--------x-------|--------x-------|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
309
Beat pattern 7: Water Drop Tongue showcase
b |x-----x-----x---|--x---x---x---x-|x-----x-----x---|--x---x---x---x-
WDT|--x-x---------x-|x---x-------x---|--x-x---------x-|x---x-------x---
SS |--------x-------|--------x-------|--------x-------|--------x-------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
Beat pattern 8: Inward Bass showcase
B |x---------------|----------------|----------------|----------------
b |------------x---|----------------|x-----------x---|----------------
SS |--------x-------|--------x-------|--------x-------|----------------
IB |x---x---x---x---|x---x---x---x---|x---x---x---x---|x---x---x---x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
Beat pattern 9: Humming while Beatboxing showcase
b |x-----x-----x---|--x---x-------x-|x-----x-----x---|--x---x---------
dc |--x-----------x-|----------------|--x-----------x-|------------x---
tbc|----x-----------|x---x-------x---|----x-----------|x---x-----------
SS |--------x-------|--------x-------|--------x-------|--------x-------
hm |x---x-------x---|x---x-------x---|x---x-------x---|x---x---x---x---
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
310
Beat pattern 10: Unknown 1
Bars 1-4
B |x-----x-----x---|--x-------------|x-----x-----x---|--x-------------
^LR|x~~~~~------x~~~|~~--------------|x~~~~~------x~~~|~~--------------
^K |----------------|----------------|----------------|----------------
SS |--------x-------|--------x-------|--------x-------|--------x-------
tbc|----------------|----x-----------|----------------|----x-----------
HTB|----------------|------------x~~~|----------------|------------x~~~
b |----------------|------x---------|----------------|------x---------
dc |----------------|----------------|----------------|----------------
dac|----------------|----------------|----------------|----------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
Bars 5-8
B |x-----x-----x---|--x-------------|x---------------|----------------
^LR|x~~~~~------x~~~|~~--------------|----------------|----------------
^K |----------------|----------------|----------------|------------x---
SS |--------x-------|--------x-------|--------x-------|--------x-------
tbc|----------------|----x-----------|----------------|----x-----------
HTB|----------------|------------x~~~|----------------|----------------
b |----------------|------x---------|------x-----x---|--x---x---------
dc |----------------|----------------|--x-----------x-|----------------
dac|----------------|----------------|----x-----------|x---------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
311
Beat pattern 11: Unknown 2
Bars 1-4
hm |x---x-------x---|x---x-------x---|x---x-------x---|x---x---x---x---
b |x-----x-----x---|--x---x---------|x-----x-----x---|--x---x---x---x-
B |----------------|----------------|----------------|----------------
dc |--x-----------x-|----------------|--x-------------|----------------
tll|----x-----------|x---------------|----x-----------|----------------
tbc|----------------|----x-----------|----------------|x---x-----------
SS |--------x-------|--------x-------|--------x-------|--------x-------
WDT|----------------|------------x---|----------------|------------x---
PF |----------------|----------------|----------------|----------------
ta |----------------|----------------|----------------|----------------
^K |----------------|----------------|----------------|----------------
^LR|----------------|----------------|----------------|----------------
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
Bars 5-8
hm |x---x-------x---|x---x-------x---|----------------|----------------
b |x-----x-----x---|--x-----------x-|----------------|----x-x---------
B |----------------|----------------|----------------|----------x-----
dc |--x-------------|----x-------x---|----x-----x---x-|--x-------------
tll|----x-----------|------x---------|----------------|----------------
tbc|----------------|----------------|----------------|----------------
SS |--------x-------|--------x-------|----------------|----------------
WDT|--------------x-|x---------------|----------------|----------------
PF |----------------|----------------|x-----x-----x---|x---------------
ta |----------------|----------------|--x-----x-------|----------------
^K |----------------|----------------|----------------|--------x-------
^LR|----------------|----------------|----------------|----------x~~~~~
|1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 + |1 + 2 + 3 + 4 +
312
Abstract (if available)
Abstract
Beatboxing is a type of non-linguistic vocal percussion that can be performed as an accompaniment to linguistic music or as a standalone performance. This dissertation is the first major effort to begin to probe beatboxing cognition—specifically beatboxing phonology—and to develop a theoretical framework relating representations in speech and beatboxing that can account for phonological phenomena that speech and beatboxing share. In doing so, it contributes to the longstanding debate about the domain-specificity of language: because hallmarks of linguistic phonology like contrastive units (Chapter 3), alternations (Chapter 5), and harmony (Chapter 6) also exist in beatboxing, beatboxing phonology provides evidence that beatboxing and speech share not only the vocal tract but also organizational foundations, including a certain type of mental representations and coordination of those representations.
Beatboxing has phonological behavior based in its own phonological units and organization. One could choose to model beatboxing with adaptations of either features or gestures as its fundamental units. But as Chapter 4: Theory discusses, a gestural approach captures both domain-specific aspects of phonology (learned targets and parameter settings for a given constriction) and domain-general aspects (the ability of gestural representations to contrast, to participate in class-based behavior, and to undergo qualitative changes). Gestures have domain-specific meaning within their own system (speech or beatboxing) while sharing a domain-general conformation with other behaviors. Gestures can do this by explicitly connecting the tasks specific to speech or to beatboxing with the sound-making potential of the vocal substrate they share; this in turn creates a direct link between speech gestures and beatboxing gestures. This link is formalized at the graph level of the dynamical systems by which gestures are defined.
The direct formal link between beatboxing and speech units makes predictions about what types of phonological phenomena beatboxing and speech units are able to exhibit—including phonological alternations and harmony mentioned above. It also predicts that the phonological units of the two domains will be able to co-occur, with beatboxing and speech sounds interwoven together by a single individual. This type of behavior is known as “beatrhyming” (Chapter 7: Beatrhyming).
These advantages of the gestural approach for describing speech, beatboxing, and beatrhyming underscore a broader point: that regardless of whether phonology is modular or not, the phonological system is not encapsulated away from other cognitive domains, nor impermeable to connections with other domains. On the contrary, phonological units are intrinsically related to beatboxing units—and, presumably, to other units in similar systems—via the conformation of their mental representations. As beatrhyming helps to illustrate, the properties that the phonological system shares with other domains are also the foundation of the phonological system’s ability to flexibly integrate with other (e.g., musical) domains.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Individual differences in phonetic variability and phonological representation
PDF
Harmony in gestural phonology
PDF
Signs of skilled adaptation in the co-speech ticking of adults with Tourette's
PDF
Articulatory knowledge in phonological computation
PDF
Tone gestures and constraint interaction in Sierra Juarez Zapotec
PDF
Articulatory dynamics and stability in multi-gesture complexes
PDF
Soft biases in phonology: learnability meets grammar
PDF
Generalized surface correspondence in reduplicative opacity
PDF
Speech production in post-glossectomy speakers: articulatory preservation and compensation
PDF
Sound sequence adaptation in loanword phonology
PDF
Building phrase structure from items and contexts
PDF
Sources of non-conformity in phonology: variation and exceptionality in Modern Hebrew spirantization
PDF
Dynamics of consonant reduction
PDF
Effects of speech context on characteristics of manual gesture
PDF
The Spanish feminine el at the syntax-phonology interface
PDF
Effects of language familiarity on talker discrimination from syllables
PDF
The prosodic substrate of consonant and tone dynamics
PDF
Processing the dynamicity of events in language
PDF
Investigating the production and perception of reduced speech: a cross-linguistic look at articulatory coproduction and compensation for coarticulation
PDF
The role of individual variability in tests of functional hearing
Asset Metadata
Creator
Blaylock, Reed (author)
Core Title
Beatboxing phonology
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Linguistics
Degree Conferral Date
2022-08
Publication Date
07/26/2022
Defense Date
06/15/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
beatboxing,linguistics,music and language,OAI-PMH Harvest,phonetics,phonology,sound
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Goldstein, Louis (
committee chair
), Iskarous, Khalil (
committee member
), Zevin, Jason (
committee member
)
Creator Email
gblayloc@usc.edu,reed.blaylock@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111375166
Unique identifier
UC111375166
Legacy Identifier
etd-BlaylockRe-10998
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Blaylock, Reed
Type
texts
Source
20220728-usctheses-batch-962
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
beatboxing
music and language
phonetics
phonology