Do the kids prefer "mp3 sizzle" ? Bullshizzle ! - Mastering Media Blog

Thursday, 12 March 2009

Do the kids prefer "mp3 sizzle" ? Bullshizzle !

In the last week there has been a lot of attention paid to an informal study made by Jonathan Berger of Stanford University, which he claims shows that some young people (a) prefer the sound of mp3s and (b) that this is becoming more true as time passes.

To be quite frank, Professor Berger should know better.

He says that he:

tests his incoming students each year in a similar way. He has them listen to a variety of recordings which use different formats from MP3 to ones of much higher quality. He described the results with some disappointment and frustration, as a music lover might, that each year the preference for music in MP3 format rises. In other words, students prefer the quality of that kind of sound over the sound of music of much higher quality. He said that they seemed to prefer "sizzle sounds" that MP3s bring to music. It is a sound they are familiar with.

Leaving aside all the reasons there are to doubt the testing methods and results of an "informal study" like this, even if it's true that more and more students are choosing the mp3s, that doesn't necessarily mean they prefer the sound overall, especially for longer-term listening.

Why not ? First, the short version:

MP3 encoders typically don't have enough headroom to handle the very high peak level of modern CDs, and so introduce extra clipping distortion as well as all the encoding artefacts - this is the so-called "sizzle".

In a short-term A/B test, I can believe people would respond positively to the extra high-frequency distortion, just as they do to small level increases and quantisation distortion.

But I want to hear long-term testing.

Play those same students the same music for 2 or 3 hours straight - in CD and in mp3. Then, don't ask them if they can hear a difference or which they prefer - ask them how they FEEL.

My prediction is that there will be more irritable, edgy people with headaches in the mp3 pool.

Just like over-compressed high-level music, typical mp3 encodes are fatiguing to listen too, less involving and sound less "real". 

So, even though Professor Berger has observed a short-term preference for mp3-encoded audio, I don't believe it's possible to conclude from this that they would genuinely choose to listen to mp3s rather than CDs - or lossless audio formats like FLAC, for example.

Now, a little more detail:

It's been known for many years that in short-term A/B comparisons of otherwise nearly identical audio, people will prefer the version which is a little louder and as a result seems to have a fraction more bass and treble. This is because of a psycho-acoustic effect known as the Fletcher–Munson or "Smile" Curve, and may be an evolutionary process to make us prioritise sounds which louder and therefore nearer - and might be a predator.

What does this have to do with mp3s ? In a nutshell - encoding an mp3 from a modern CD release which constantly "maxes out" the level results in an mp3 which is more distorted than the original.

This is a result of "intersample peaks" in the decoded digital signal adding extra clipping distortion in the mp3 encoder - and possibly even more being added by the player itself. Couple this with the swirling, squelchy high-frequency artefacts caused by the data-reduction process of mp3 encoding, and you have the "mp3 sizzle" Professor Berger is talking about.

So the students may well choose this slightly toppier, fizzier-sounding encoded version in a short-term A/B test, especially since modern mp3 encodes have fewer obvious artefacts than a few years ago.

But we shouldn't underestimate the acuity of the human ear and brain. A typical mp3 encode discards 90% of the original signal - and it's harder to listen to as a result. Even without intersample encoding distortion and obvious artefacts, mp3s don't sound as good as the originals in other, more subtle ways. There is a loss of "3D" stereo imaging, a blurring and flattening of the audio. Often mp3s sound as if they have less reverb than the original. Complex sounds lose their interest, the audio overall is less rich and involving - the result is harsher and more crude.

Especially with today's over-compressed, heavily processed music.

As a result, the brain has to work harder to "decode" the music. Listening to music in "real life" is an almost effortless process - listening to a CD requires a little more concentration. Right at the other end of the scale, listening to music squawking from the tiny speaker of a mobile phone, it's often a struggle to even pick out the tune. mp3s lie somewhere in the middle of this - thankfully, closer to CD than mobile !

Here's an experiment you can try yourself, though. Spend a day listening to your favourite radio station. Next day listen to the same station, but streamed on the internet. (The data-compression here will typically sound like an extreme version of mp3.)

How do you feel ?

Personally, listening to heavily data-compressed internet streams makes me feel nauseous. Literally - I don't mean that in some namby-pamby audiophile sense - an hour or two of internet radio and I start to feel slightly car-sick.

The same applies to mp3s, to a lesser extent. 

Now obviously not all mp3s are that bad. And some data-compressed audio sounds pretty good - the Ogg Vorbis streams from Spotify, for example, or Apple's AAC codec . (Ironic that everyone blames the iPod for mp3's ills, even though the iPod's own compression codec sounds substantially better than straight mp3)

But just as mp3 lies somewhere between CD hi-fi and a mobile phone, it also lies somewhere between 24/96 PCM and 64 kbps internet radio - however in this case, closer to the bad end of the scale.

To summarise:

In a short-term test, the distortion/artefact "sizzle" may be appealing to some people, but give them the chance to use a lossless codec like FLAC for long-term listening, and I'm confident they will settle for the better quality.

The great news is that even of I'm wrong, mp3 is already on the way out - it won't be long now before player drive space and internet bandwidth make the requirement for data-compression a thing of the past, and we can all get back to appreciating good audio again.

I mean - does anyone remember how "great" AM radio sounded ?!?


vcmc said...

I'm not sure I entirely agree with your reasoning. With the way most mp3 encoding algorithms work, clipping is primarily introduced one of two ways: through the removal of high-frequency energy - which as you know are commonly the smallest amplitude in most recorded music, or conversion to mid-side (vs. the "joint stereo" of CD audio), though since the most prominent musical elements are generally mixed close to center, this is not generally a large source of change in absolute amplitude.

In most music, even hypercompressed pop already hard limited to -0.1dbfs, the amount of actual truncated signal created during the mp3 conversion process is generally a fraction of a db.

I've a/b'ed music hard limited to -0.3db, -0.1db, and -0.01db and run through an mp3 encoder - and could not differentiate between them, though spectral analysis did indicate increased clipping/HF spikes as maximum amplitude increased. Then again, my hearing is only good to about 18k - perhaps on monitors with sufficient response in that range and above, clipping energy is audible to those with younger ears.

It'd be interesting to do a sharp low-pass filter above 17k or so and see if students still had a preference either way...

In any case, I wouldn't be surprised to find that "the kids" do, for some reason, prefer mp3-encoded audio. I'd be disappointed, but not surprised. Naivete comes in many forms...

ianshepherd said...

Thanks for the reply.

It's not my reasoning though, this is well-established. MP3 works by band-pass filtering the audio in 20 (?) distinct bands and dynamically removing the ones which are (supposedly) inaudible. Filtering of this kind needs to be done with great care in order to sound ok, and typically it isn't - this is one place where the dramatic differences in encode quality stem from.

Regardless of the frequency content of the input, this can cause quite major changes in peak level. As another example I see all the time in mastering - adding a high-pass filter at (say) 40 Hz to remove confused sub-bass info will actually increase the peak level.

If you take almost any modern release peaking near zero, check it out in AudioLeak, then mp3 encode and look again - you'll see the peak level increasing by as much as 2 dB on some releases. Some players have enough headroom on the D to A to cope with that - most don't.

How audible this is depends on the material of course, and the playback equipment.

Even if this clipping isn't the "sizzle" that Prof Berger describes, my argument stands. Whatever superficial differences there is that the students are saying they prefer, my opinion is that in the longer term they will find the mp3 material less satisfying and ultimately fatiguing.

vcmc said...

Makes sense, my testing was done with high-quality variable bitrate LAME encoding - 128kbps necessarily has to alter the audio more and in doing so is likely to introduce more clipping. (Though I think much of this effect is through a reduction in the bit depth used to express audio information in each band, only removing audio entirely at very low encoder bitrates or for the highest frequencies.)

Perhaps the best encoders also anticipate where filtration/processing might introduce signals above 0dbfs and allocate bits accordingly so as to avoid clipping?
Now I'm curious which encoder was used...

ianshepherd said...

Sorry, no - the clipping is simply due to insufficient headroom in the encoders. and not allowing for the fact that there may be intersample peaks intrinsic in the original audio. This effect is independent of the bit depth or encode data-rate. Check out the link in the post to read about intersample peaks and inadequate headroom.

Lou Kash said...

Quote: "only removing audio entirely at very low encoder bitrates or for the highest frequencies"

MP3 is surprisingly "lo-fi" even at the highest bitrates. Once last year it just made me wonder, so I made a sonogram comparison of a well produced pop song, converted from the CD to various lossy formats at various bitrates. Even at 320 kbps, MP3 obviously cuts off most of the signal above 16 kHz. (Not that it bothers me much though, as it's unfortunately already out of my range anyway... :)

Richard Tollerton said...

Ian, I've emailed Dr. Berger for clarification on the protocol and will post any responses on the lively HydrogenAudio thread on the subject.

In response to your post:

long-term testing: I'm not aware of any confirmed evidence that long term blind tests are any more sensitive than short term blind tests. In my own experience, switching times must be extremely short - perhaps no longer than a few seconds - to get maximum sensitivity to lossy encoding artifacts. This anecdotal evidence is backed up with extremely extensive research. In terms of emotional responses... well, given that I lived with BladeEnc for a year without ripping my hair out, I'm not sure how effective a long term test based on emotional responses would be. I've heard of no positive results coming out of long term blind tests and one negative.

MP3s and headroom: I think we can reasonably, feasibly make a distinction here between an intrinsic defect in the format, and defective decoders. Simply put, every decoder without adequate headroom to deal with encoder clipping is defective. And individual decoder defects should not be held against the encoders as a whole.

Everybody decoding with ReplayGain, SoundCheck, or some form of audio stack with built in headroom control and/or compression (read: Vista) is going to satisfy this. Potentially, other devices have some built-in headroom that we just don't know about.

This certainly isn't a majority of listening environments, but it's still a lot of people, and it's a number that should grow. But complaining about the format when it's not the fundamental problem is not going to make this better.

Anyways, IIRC, AAC gives higher peaks than MP3 does at 128k, so I'm kinda suspicious that this is even a clipping problem to begin with. Just because MP3 is lower quality doesn't mean it gives absolutely greater peaks.

Lou: sonograms are only really sensitive to lowpass filtering for evaluating encoder artifacts, which as you've discovered don't matter all that much to begin with. The ears remain the best evaluative tool.

ianshepherd said...

Hi Richard,

I emailed the Stanford address marked for Prof Berger's attention, but so far have had no response. If you have a direct address please pass it on so I can contact him.

To be clear - I'm not attacking mp3 here. I listen to most music on my iPod, which is AAC but it's still lossy compression.

I'm criticising the conclusion he draws - or rather, the way he does it. He may be right - perhaps young people are so used to mp3 artefacts that they actually expect it - in my generation the equivalent was getting used to listening to cassettes with no Dolby B, ie. hearing a great deal of HF compression.

But I don't believe it's valid for him to say "young people prefer mp3" based on this. THAT's my key point - I raised the issue of clipping because I thought readers of this blog would be interested, but I agree it may not be the sole issue.

Although - let's say you're right that many encoders DO have sufficient headroom - how do they handle the additional peak information ? Do they pad the level to take account ? Not in my experience. If not, then further clipping or limiting has to take place. And PLAYERs certainly don't usually have sufficient analogue headroom - not CD players, let alone mp3 players.

Both short-term A/B and long-term are useful. Personally I can often hear mp3 artefacts without a reference, (even up to 320 kbps)but obviously this varies from person to person.

Try the long-term testing yourself, you may be surprised at how effective it is !

To repeat - personally I dislike mp3, but I'm not suggesting the format is flawed. But I strongly question the logical leap from "some students say they prefer the mp3 versions of the tracks in this test" to "young people prefer the sound of mp3".


Richard Tollerton said...

I used his login address, I'd imagine it's the same one that you used.

I'll wholeheartedly agree with the statement that MP3 is "flawed" insofar as it has well-known encoding deficiencies which have been successfully attacked with ABX tests in the high bitrates, that other, more modern codecs do not have. And, of course, it has very inferior performance at bitrates below 128k. If Berger posed this result for 64k MP3 I actually would believe it a bit more....

Besides that, I completely agree with your premise, and I'll look into seeing what I can do with long-term blind testing.

Encoders generally make no attempt to deal with clipping issues, because they operate almost entirely in floating point up to the point of quantization. LAME IIRC may have a slight attenuation (0.95? 0.99?) to take the edge off of a lot of samples that would otherwise clip.

I assert (admittedly, without evidence) that for the vast majority of listening environments for lossy music, the dynamic range of the DAC (and its output stage) is vastly, incomprehensibly larger than what is required for inaudibility. In fact, I would go so far as to say that every iPod listener who does not use IEMs and does not listen inside a soundproofed room would not be able to tell if their SNR was reduced by 30db. Even on my Ety ER-4Ss in a quiet room, the iPod only has a barely audible amount of hiss for extremely loud playback.

Players have lots of dynamic range to burn here. Hissy iPod docks and CD players, not so much - but given that the headroom work is being done upstream of that, the amplifiers aren't going to care whether it's there or not.

ianshepherd said...

Hi Richard,

I'm not sure you're right about encoders using floating point - some of the biggest names in pro audio didn't, until relatively recently. And sadly, even if they do, it doesn't really help unless decent limiting is used when the final file is created.

As far as the headroom of DACs is concerned, you're absolutely right, there is masses of headroom available - but it's not used, typically. There is a deeply entrenched mindset that in a digital system there is no need to allow for level exceeding 0dBFS - and let's face it, if there were more CDs with sane levels in the world, that would be a perfectly reasonable assumption.

Sadly we know this isn't the case.

As an example, the last time I tried using a "boost" EQ on an iPod, it distorted like crazy. If there was headroom available, it wasn't being used.

All this stuff is perfectly possible, it's just that most of the time, it's not happening.


Dean Whitbread said...

I thought I'd test my own ears, aged 47, and fading, by running an experiment based on environment as a key informer of subjective quality.

I have an audio cassette recorder here, which takes chrome tapes and has Dolby C. I recorded from 12" vinyl straight to tape, and went out wearing a Sony Walkman v1. It sounded great, and I was listening to Heaven 17's "Fascist Groove Thing" feeling nostalgically hip in my local until the tape jammed. I removed the cassette which had become tangled in the mechanism, and using an old trick with a six-sided pencil, re-tensioned the oxide strip.

Nothing like the sound of tape compression, hiss, warble and a song running several BPM slower as the batteries run out, IMHO and it beats MP3 every time.

Richard Tollerton said...

"some of the biggest names in pro audio didn't [use floating point encoding], until relatively recently. And sadly, even if they do, it doesn't really help unless decent limiting is used when the final file is created."

??? Like what names? I'd be shocked if any encoder used in mainstream pro audio operated in fixed point, at any time. At least, LAME is double precision, and I'm 99.9% sure that Fraunhofer, and Quicktime are too (not to mention anything else based on dist10). That covers the entire PC encoder market, right? And pro MP3 encoding is done on PCs instead of DSPs, riiiiiight?

By the time clipping occurs in the encode, the data is already in the frequency domain, so limiting generally is not possible.

On 0dbFS - Given the relative ease of solving a lot of loudness issues by attenuation, I'd argue that an alternative deeply entrench mindset needs to be overturned instead: that peak output needs to be anywhere near 0dbFS in the first place. MP3Gain your iPod music and all of your eq distortion issues will probably go away. A lot of hardware would have to change to actually support fixed-point data with a shifted decimal point. It's so much easier to keep the 0dbFS restriction and just attenuate everything.

Dean: that could just as easily be evidence for each generation being partial to their own portable music technology than for MP3s being intrinsically inferior :) That kinda makes me want to try my old portable CD players sometime.

ianshepherd said...

Well, ProTools, to start with. I'm not even sure it uses floating point now. Not so long ago the internal mixer wasn't even correctly dithered.


As far as 0dBFS goes - Richard, you're not listening. All this stuff is easy to solve but it's not being done.

For one thing, in a world where technical specs are everything, no-one is going to allow 6 dB of headroom and let their signal-to-noise figures look worse than the competition.

And again - all the floating point in the world won't help once the final signal is reconstructed. If it introduces peaks higher than in the original signal - and at today's levels IT WILL - those peaks have to be attenuated, limited, or clipped. And they're not being attenuated.

Try it yourself - use Audioleak to look at the peak levels of a track before and after mp3 encoding. iTunes routinely introduces peaks of up to +2dB, for example.


Richard Tollerton said...

Pro Tools MP3 encoding is in floating point. It uses the Fraunhofer codec and I can assure you that it is at the very least single precision under the hood. I just disassembled MP3.dll (it's available for free off Digidesign's website) and it has quite a lot of x86 and SSE floating-point code in there - that's hard to explain if the signal path is actually fixed point!

Pro Tools's internal signal path is ultimately impertinent to the internal signal path of the encoder. And it's a lot more effort to build a fixed-point encoder than a floating-point encoder (for questionable performance improvements on PCs) - Fraunhofer or Digidesign would be making great hay of that if it were true.

That said, I think I'm mistaken in the first place with making the connection between clipping and floating point. I think I'm barking up the wrong tree. The MP3 output data is obviously fixed point, and we both know that it can represent >0dbFS. (I've seen AACs with +5dbFS peaks so the issue is not news to me.) It can do this because the data is in the frequency domain. Nothing prevents an encoder from being built entirely in fixed point that encodes the same thing, as long as the input samples are <0dbFS - insofar as the frequency components never exceed 0dbFS.

What I should have said a couple posts ago was what I said last post: encoders can't deal with clipping, because by the time it occurs, it's in the frequency domain. There's no such thing as a peak limiter on a spectrogram. One could imagine schemes to avoid it in the encoder but they may have a grave risk of audibility.

But by the same token, they can't handle intersample peaks, either. But more importantly, they don't have to! That is: if a sine signal has an intersample peak >0dbFS but all the samples are <0, as far as the FFT is concerned, it's still a sine wave, and will happily encode that >0dbFS sine wave. And insofar as the entire process is essentially linear, the same argument applies for arbitrary audio signals. I think that's the more important disagreement here: you think intersample peaks on the input signal are an important failure mode of encoders; I don't. Lowpasses and frequency domain quantization are far more important in affecting output peaks than intersample peaks on the input. (Of course, if significant intersample peaks do exist on the input it is extremely likely that a lot of HF energy exists which will get lowpassed and up the peaks even further - but that is correlation, not causation.)

I think it would be fair for playback manufacturers to attenuate their decoders by 6db (defeatable of course) but still quote their SNR and DR figures without attenuation, as if the headroom was 0db. It's all the same under the hood anyhow. As long as the default headroom setting is quoted, and it is defeatable, there is no ambiguity, and listeners will receive higher quality sound.

ianshepherd said...

Hi Richard,

You're right, the internal signal path of PT is irrelevant, although I was astonished to find it was fixed until relatively recently - SADiE has been 32-bit float for pver 10 years. My point is, the software is seldom written in the best/most effective/efficient way, and this certainly can't be assumed or relied upon.

And, you may well be correct that any intersample clipping is less significant than the actual mp3 filtering itself.

However again we agree/disagree - I don't see it as an issue of time vs frequency, that's just different ways of representing the same data. Encoders could deal with the issue of peaks, intersample or not - by padding the input. But they don't. I'm not saying this is their failing, necessarily.

Why should they be designed for input audio with such ridiculous properties ? Who would have thought that a medium with over 96dB of dynamic range would have music squashed into the top 6 ?!

Finally you are correctly highlighting low-pass filtering as an essential part of a good encode, but again this isn't something that can be relied upon. So many mp3s have not been filtered, and have "twinkly", "swirly" or "swishy" HF artefacts as a result.

All these factors contribute to a greater or lesser extent to the sound of an mp3 - and all might be recognised as familiar and chosen as a preference in short-term trials.

My objections to Prof Berger's conclusions still stand.

Richard Tollerton said...

I can't disagree with that!

Richard Tollerton said...

Did you ever hear back from Dr. Berger? Should one of us give him another ring?

ianshepherd said...

I haven't heard anything, no, and I must admit my attention has been elsewhere... let me know if you can get in touch with him !

James said...

Hmm -- why would a half-generation of listeners _not have some sort of romantic attachment to the kinds of sounds they've been listening to for years.

Got to be said that most engineers have some kind of attachment to.. distortion - in some form or another.

I feel your anger! (and fears?) I just don't think the battle is with the results of Dr. Berger's study, surely he just proves that the battle should be directed at the aggregators and .mp3 providers that don't offer uncompressed downloads.

Some do -- Bleep, Warp records download store offers FLAC for some releases... why doesn't iTunes and the rest? .. big surprise that you can't play FLAC's in iTunes or your ipods.

HDD space too expensive?
Bandwidth too expensive?

My meager 2gB ipod paid for at least a few tetrabytes.

Listeners can't be expected to demand what they don't know they're missing out on.. so can Apple's arm ever be twisted an inch?


ianshepherd said...

My irritation isn't aimed at the results - it's at his methods, which I think are flawed, and the conclusions, which I think are unwarranted. it's a good point about people not knowing what they want, though - and, of course as bandwidth and HDD space increase, the debate will be forgotten.

I guess I'm complaining about Bad Science, mainly :-)

On the positive side, iTunes offers Apple Lossless files, I think ?

Dr. Sean Olive said...

I've done some of my own controlled listening tests with high school and college students to determine their preferences for MP3 (Lame 128 kbps) versus CD. The college students preferred CD over MP3 in 72% of the trials, and the high school students in 69% of the trials. There was no evidence that any of the students tested preferred MP3 over CD.

In a double-blind loudspeaker test the same students preferred the most neutral, accurate loudspeaker. So Generation Y prefers good sound over bad sound when given the opportunity to directly compare them under controlled conditions.


Ian Shepherd said...

Hi Sean,

Thanks for that - very heartening ! I'm going to take a look and maybe do a post about this, over on my new site:

Thanks again !


Imran Malik said...

moonis elahi
moonis elahi
moonis elahi
moonis elahi
moonis elahi
moonis elahi
moonis elahi
moonis elahi
moonis elahi
moonis elahi
moonis elahi

Jhon Marshal said...

Thanks for sharing your thoughts with us.. they are really interesting.. I would like to read more from you.
Sound Insulation Testing

Jason Policy said...

I stumbled upon a couple posts on KVR and Redd It asking for an MP3 "watery effect". One person said it sounded "much cooler" than the original! According to another poster, a distortion effect of this kind gets requested twice a year.

This "soft" sound could perhaps be described as blurring in other words, and it is introduced when the signal is split into bands and they are separately adjusted in some way: in denoisers, harmonizers, pitch shifters. Similar softness is heard in some older recordings that predate DSP; I don't know what causes it there. I don't like this effect one bit.

If by "sizzle" only the harsh sound that is clipping is meant, I find it orthogonal to MP3. That was the sound of the past: linear interpolation, sound cards that clipped early, intentionally overdriven samples. With modern loudness levels this low fidelity sound is returning in the releases of young artists such as "Lights". Given the choice between harsh and watery, I would pick the first, for example if an early digital recording gets processed with a denoiser it becomes blurry and watery (dequantizer in Stereo Tool). The real SNR doesn't of course improve, but it is made unrecognizable. Someone posted a release of MOD music which he converted to "16-bit" with the aid of a denoiser. Oh dear.

I'm familiar with the concept of intersample peaks, but I can't hear them in real music at all.

As for the internal dynamic range of MP3: I tested encoding at levels from -200 to +60 dB (!) with LAME. Accurate software such as mpglib and Reaper recovered the signal without noticeable artifacts. The coding error stayed relatively constant. SIMD-accelerated decoders (ffmpeg) failed early (accurate range reduced to about -100/+10). The format itself has immense foot- and head-room, with the potential for steganography that won't register even on a 24-bit scale. One could say that range is wasted. On the other hand, iZotope RX5 surprisingly decoded in fixed-point.

Headroom of other formats/implementations: Nero AAC 1.0.7 over +60 dB, Opus 1.1 over +60 dB, Fraunhofer AAC +18 dB, Apple qAAC +14 dB, Vorbis 1.3.5 around +20 dB, Dolby AC-3 7.0 around +3 dB. The two AAC encoders seem to have the input samples clamped as a safeguard.

Good old FhG ACM codec and Producer Pro encodes up to the 16-bit limit, the internal precision must be higher. Later encoders, notably FhG FastEnc, have raised ATH, and as a result DR of about 14-bit, which also leads to "watery" sound if the playback level is boosted. Almost all modern programs have the FastEnc codec.

Peak level of MP3 is relatively stable, but only at high bitrate. However, unlike all other codecs, Opus now includes an undefeatable HPF, which, along with potentially adding clicks, changes the shape of the waveform.

AM radio did sound better in the past. It was not bandpassed with unnatural sounding digital brickwall filters. A station was allowed to overlap with neighboring channels, which usually didn't cause any problems because the local station was much stronger.