May 16, 2018, Joe Toscano and Members of the WRAP Lab
What's going on in this audio clip? Why do some people hear "yanny" and others hear "laurel"? And why does it switch between the two? Here's our lab's explanation.
The short story is that whoever created the sound probably intended to create laurel. But there is a critical cue in the sound that's easy to miss, which results in hearing yanny. There are a lot of factors that affect whether or not you hear the cue, such as which device you are using to listen to it and whether you were expecting to hear laurel or yanny. Read below for a more detailed explanation of how this "illusion" works.
Media coverage of this story featuring members of our lab:
If we look closely at the sounds in the words "yanny" and "laurel", we can gain some insights into why the audio clip is interpreted two different ways. First, let's look at the difference between the "l" sound in laurel and the "y" sound in "yanny". We can write these sounds as /l/ and /j/ using the International Phonetic Alphabet. We can also visualize the sounds using a speech spectrogram, which shows differences in sound frequency over time. We created computer synthesized laurel and yanny sounds in our lab to look at the differences between them. Here's what they sound like:
laurel:
yanny:
Those should sound like relatively clear examples of laurel and yanny. Next, we can look at their spectrograms and compare them. Here is the spectrogram for laurel:
The dark bands in the spectrogram are called formants, which represent frequencies with increased energy caused by resonances in the vocal tract. The red dots are the computer's best guess as to where the formants are. There are five formants in this sound, which we label (from lowest to highest frequency), F1, F2, F3, F4, and F5. Pay particular attention to F3 in this word: It starts out high and then has a rapid drop. That's a good cue that there's an /l/ at the beginning of the word.
Now, let's look at the spectrogram for yanny:
Again, you see five formants. But there's something different at the beginning of the sound: Instead of F3 showing a big drop, F2 does instead. This falling F2 is a good cue for the sound /j/ (as in yanny).
So far, the story is simple enough: a falling F3 produces /l/ as in laurel, and a falling F2 produces /j/ as in yanny. So, why does the original audio clip lead to differences in perception depending who is listening and what device they are listening on? To answer that, we can look at the spectrogram for the original sound:
F3 shows a large drop, like in laurel. However, F2 is barely visible on the spectrogram. In fact, the computer has a hard time tracking F2, which is why the line for it is so bumpy. When this sound was created, it looks like F2 was very low in both frequency and amplitude. This makes it almost merge with F1 (the lowest formant) and probably makes it hard to hear the F2 cue at all. As a result, F3 could be mistaken for F2. When this happens, you hear a falling F2 instead of a falling F3, which is perceived as a /j/ as in yanny. A similar effect appears to happen with the "r" and "n" sounds in the middle of the word.
Whether you interpret the falling format as F2 or F3 (and, in turn, interpret the word as laurel or yanny) probably also depends on how you perceive the pitch and the first formant (F1). Those cues gives you information about the size of the speaker's vocal tract, which helps you set up an expectation for whether that falling formant should be in the F2 range (if you interpret the speaker as having a high pitch) or the F3 range (if you interpret them as having a low pitch).
This also explains why changing the bass, the pitch, or the device you're listening on changes what many people here. Speakers on small devices, such as phones, might not transmit the low frequency information as well, leading to more "yanny" percepts. We can test this by filtering out the low frequencies in the original sound. Here is the original sound:
And here it is with the low frequencies filtered out:
The second one might sound more like yanny to you than the first one. But maybe not! Remember, it depends on lots of different factors, including which device you're listening on, how loud it is, which word you were expecting to hear, and which specific acoustic cues you weight most heavily when perceiving speech.
To summarize, the word that the creator of this sound was trying to make was probably laurel, but due to the complexities of speech recognition and the specific acoustic cues in that sound, it can be perceived as yanny under the right conditions. (Update: It looks like it is laurel, and was from a recording of natural speech on vocabulary.com.) Remarkably, human listeners usually have little difficulty understanding speech in real-world conditions, despite variability like this in the signal. Our lab's research aims to find out how we do this.
If you have any questions, please feel free to contact us at wraplab@villanova.edu. Thanks for your interest!
— Dr. Joe Toscano and the Members of the WRAP Lab
Acknowledgements: Speech samples created using the eSpeak synthesizer implemented in Praat. Spectrograms created in Praat.
As long as we have your attention... Are you interested in more work like this? Our lab at Villanova University studies these and other quetsions about speech recognition and language processing, including projects studying how the brain responds to differences between words, how we can build computer systems that learn to understand speech the way that human listeners do, and how we can develop better tools for detecting hearing loss based on speech recognition. If you're an undergraduate and this work sounds interesting to you, contact Joe (joseph.toscano@villanova.edu) to find out more about applying to our research-based Master's program.