An auditory illusion has recently gone viral – the recording can be heard either as “yanny” or “laurel”. A number of people have explained the acoustics behind it – but what does this mean for how we actually hear?
(Author’s note: to make this more accessible for laypeople, I have oversimplified; for those of you with a stronger background in phonetics; I’d instead refer you to either:
http://languagelog.ldc.upenn.edu/nll/?p=38274 or to https://twitter.com/suzyjstyles/status/996560301548945413)
The first thing to understand is that there are no “absolute” properties of any given sound in speech – rather, we all hear things relative to their context. Because of this, the very same acoustic signal can sound very different when played in one syllable than it would in another. This is actually very related to the color-dress controversy: we hear sounds relative to their surrounding context in much the same way that we see colors relative to theirs.
The second important factor is that we hear “patterns” in sounds, not particular frequencies.
For example, imagine the case where there is acoustic energy at three different frequencies: say, 500 Hz, 1500 Hz, and 2500 Hz. To us, that would sound like the vowel we say when we are feeling a bit clueless (“uhhhh….”). But none of those three frequencies by itself would make us think of that sound. Rather, it is the PATTERN across the frequencies that makes us hear it as that sound. So, if instead of three evenly spaced frequencies, we instead hear one at 500 Hz, and then ones at 2000 Hz and 2500 Hz, we might be more likely to hear it as the vowel in “hey!”
But what if the pattern is a bit unclear? Or, worse yet, what if you clearly heard sounds at 500 and 2500 Hz, but then heard what might be a really soft sound at 1500 or at 2000 Hz – which sound would you hear?
With natural, clear speech, there would be a variety of other cues that your brain could use to help figure out the real identity of that middle sound.
But with computer-generated speech, or with a poorer sound recording, a lot of those other cues may not be there. And that can make the signal ambiguous.
In essence, this is akin what is happening in the laurel/yanny case. The English “r” sound typically has a third frequency that is particularly low – which should make it really easy to decide when you hear an “r”. But in this particularly recording, it isn’t clear if what you are hearing is really a low third frequency, or a high second frequency – and which interpretation your brain picks determines which sound you hear.
To make things more complicated, this particular recording not only has something that could be a high-2nd or low-3rd sound, but also has some low-volume (technically, low-amplitude) energy just below that point – if you make that piece louder, you’re more likely to hear it as being the 2nd sound, and hear the ambiguous 2nd/3rd piece as a low third sound. But if you make that low-volume part softer, you don’t hear it at all, and thus you hear the ambiguous piece as being the 2nd sound (making it a high second, rather than a low third).
Ok, you’re thinking – that explains why some interactive demos that allow you to change the frequency emphasis can cause your perception to change. But why don’t we all hear it the same to start with?
This sound is clearly right near the border between two different percepts – and this is where small differences between listeners can really play out: we likely differ from one another in 1) our underlying hearing ability, 2) the quality of the speakers or headphones we’re listening to the illusion on (and the volume we’re playing it at!), and 3) we have slightly different listening experiences, based on the accents and speaking styles of our friends and family.
While these differences might not cause to hear speech all that differently in most cases, they can push our perception just enough with these unusual cases.
What does all of this mean for the way we hear in general? First of all, it points to how complicated speech understanding really is. This is one reason we have had such a hard time making machines that really understand speech well – particularly in noisy environments! The computer may not be sure whether the energy at a particular frequency is actually part of the signal, or is an extraneous noise from the environment – and that can affect how it interprets the other sounds.
This can also help demonstrate why people don’t always recognize when they have hearing loss – they can have difficulty interpreting the speech correctly when some frequencies are soft, yet hear everything clearly for other sounds, making it seem like the people talking are simply mumbling. (In fact, listeners with age-related hearing loss tend to hear vowels just fine, but have difficulty with the consonants, which often contain some lower-amplitude sounds).
Rochelle Newman is Chair of the Department of Hearing and Speech Sciences, as well as Associate Director of the Maryland Language Science Center. She helped found the UMD Infant & Child Studies Consortium and the University of Maryland Autism Research Consortium. She is interested in how the brain recognizes words from fluent speech, especially in the context of noise, and how this ability changes with development.