So with all of this Guitar Hero and Rock Band garbage going around lately, every kid out there wants to make a rhythm based Flash game. It turns out that there have been some fairly complex attempts at this, and it also turns out that I didn’t know that until AFTER I started looking into it on my own.
The story goes that some random person on a forum asked for some help with this. I suggested using Actionscript’s FFT option on the computeSpectrum function. Some other random person acted like the idea wasn’t worth investigating, so I decided to take a stab at it myself to prove this person wrong. My attempt would be pretty trivial and straightforward, but hopefully a “good start”
Turned out that it was really easy to make excellent beat detection this way. And by “excellent”, I mean “absolutely awful”.
You can check out the finished product here. There’s no preloader, so just wait (2.5mb). Not impressed? Might want to skip this. Think it’s at least *sort of* cool and *potentially useful*? Read on, youngin’!
computeSpectrum
The function we’re interested in is a static member of the SoundMixer class called computeSpectrum. The function takes in a byte array and returns raw wav data. The data is scaled to the +/- 1.0 range. The array returned will have 512 slots of this data, where the first two 256 slots are set aside for the left channel, and the second 256 slots are used for the right channel.
SoundMixer.computeSpectrum(some_byte_array);
Using this raw data, you can make a sweet graphical spectrum analyzer, and pretty much every single person who has read about the function immediately does that and posts their own version online. But we’re not looking for pretty pictures here–we want to detect some instruments in a song. The problem with this raw data is that it shows no correlation to the instruments being played. If only there were a way… SOME way, to transform this data into a format which would be better suited for our purposes.
Fast Fourier Transform
So there’s this thing called a “Fourier Transform”. And then there’s this other thing called “fast”. Stick them together and BAM, you have a Fast Fourier Transform.
I’ll go out on a ledge and assume most people understand the concept of “fast”. I’ll go out on another ledge and say that most people probably don’t know what a Fourier Transform is. I mean, if you did, then you probably wouldn’t be reading this!
Long story short, the Fourier transform will take our wav data and transform split it up into a bunch of frequencies. This way, data at one end of the spectrum will represent the low sounds, and data at the other end will represent high sounds. The mathematics for this transform are a lot of fun, but we’ll go ahead and skip them for now as Actionscript has the ability to perform this operation out of the box. I know, I know.. you really wanted to go out and code your own Fourier transform function, but stupid Actionscript has to go and make things easy. Oh well, what can you do?
SoundMixer.computeSpectrum(some_byte_array, true); // true means perform the fft... it's THAT easy
The Fourier transform is great for us because, let’s face it, a lot instruments are locked into a few frequencies. A bass drum, for instance, usually has a pretty low sound. A flute, on the other hand, generally has a pretty high pitched noise. If we had a some wav data which had nothing but a flute and bass drum playing, we could probably sort out which instrument was playing based on the frequencies we got out of a Fourier transform. Killer!
One thing to note, which isn’t really too important for this particular example, is that your data from the transform will be in a different range than your raw data (which was +/- 1.0). I think I read somewhere that the upper bound turns into 1.414… (sqrt(2)), but I’m not really sure.
Our Naive Approach
First, we’ll be trying to recognize “beats” from specific instruments, rather than something like “a guitar”. We do this because we want a simple starting point where we can reliably test our idea. This is also good as these “beats” usually belong to a very discrete set of frequencies. A bass drum drum usually makes the same sound every time, so does a snare, etc. A piano, on the other hand, covers a huge range of frequencies and, if you take into account slurred notes, we start to have some big issues to deal with. We’re not looking to write a dissertation here, just prove that the FFT can be useful for this sort of thing, somehow.
Second, we realize that approaches like “detect two beats and then predict the third based on the time it took between beats” is a pretty poor idea, as music doesn’t usually consist of a guy hitting a snare every 2 seconds exactly. If music was always reliably predictable, it’d be very, very boring and nobody would listen to it. Crazy, experimental techno songs, for instance, are often very chaotic and such an approach would be slaughtered by a song with that much variation.
What we’ll do, then, is pick a specific frequency and monitor it for change. Since this is just a proof of concept and not a scientific experiment, we’ll just randomly pick some values in our Fourier transformed ByteArray until we get some reasonable results. The song we’ll be testing with is called The Butcher and is by a guy named snyak. This is a good song for testing purposes because it starts off fairly quiet and becomes more noisy as time goes on, until it hits a pretty chaotic end–we can see how our instrument detection holds up as the song gets busier.
The instruments we’re aiming for here are the bass drum and the high pitched noise sound–I think it’s a distorted snare. Anyway, they’re the two sounds that alternate back and forth in front of the driving, low synth thing at the start of the song.
Code it up!
First, we grab our FFT ByteArray:
var frequency_data:ByteArray = new ByteArray();
SoundMixer.computeSpectrum(frequency_data, true);
Then, we figure out which spots in our array correspond to the frequencies most affected by the instruments we’re looking for. For a bass drum, the slot will probably be very early in the array, as a bass drum produces a sound with a very low frequency. For the snare, we’ll pick a much higher frequency, as the snare is sort of high pitched. To save everyone else the trouble of randomly stabbing at the FFT ByteArray for reasonable values, I’ll just post what I came up with:
private const BASS_DRUM_OFFSET = 0;
private const HIGH_NOISE_OFFSET = 480;
What this means is that our_byte_array[0] is the data we’ll be looking at for our bass drum, and our_byte_array[480] is what will hopefully contain data which is affected by our high pitched, distorted noise.
There are a few caveats, though. First, even at the mostly-quiet start of the song, there’s that ugly low synth to deal with. What we need is some way to figure out when the bass drum is actually hit.
If we watch our FFT spectrum (because we made one, btw), we can see a noticeable “spike” in the lower portions of our ByteArray when the bass drum is played. So if we can recognize this spike, then we can hopefully assume that the drum is being played.
To do this, we simply reject all values which are below a certain threshold–this way the low noises that drone on throughout the song won’t trigger the bass drum detection, but a loud, low noise will. To further refine this, we would make sure the the noise was “sharp” (to reject, say, a very loud, low synth), but we’ll skip that here for simplicity.
private const BASS_DRUM_THRESHHOLD = 0.9;
Arbitrary value? Hell yeah!
Anyway, the last thing we have to consider is that a high frequency sounds in our FFT byte array are represented with much smaller numbers. While our bass drum might return values of 1.2 or 1.3, our snare might only hit numbers like 0.2 or 0.3. We’ll go ahead and scale the snare drum values up, then, to make the two more comparable:
private const BASS_HIT_SCALE = 50;
private const HIGH_NOISE_HIT_SCALE = 255;
We end up with something along the lines of:
private function update_bass_drum(spectrum_data:ByteArray):void
{
spectrum_data.position = BASS_DRUM_OFFSET;
var magnitude:Number = spectrum_data.readFloat();
if(magnitude > BASS_DRUM_THRESHHOLD)
{
bass_drum.width = BASE_INSTRUMENT_SIZE + (BASS_HIT_SCALE * magnitude);
bass_drum.height = BASE_INSTRUMENT_SIZE + (BASS_HIT_SCALE * magnitude);
}
else
{
shrink_instrument(bass_drum);
}
}
This little ditty will read our “bass drum” value from the ByteArray and, if the value is above our threshold for the bass drum, it’ll increase the size of our bass drum graphic by the value in the byte array multiplied by our scale. If the value is not above the threshold, we send our bass_drum graphic to a function that’ll slowly shirnk the drum until it’s back to its original size.
The little snippets that have been provided probably aren’t good for anything more than a vague idea of what’s going on. You can get the full source right…. HERE
Results
Below are the files you can use to see how things worked out:
Source: View
SWF: View
Remember, no preloader on the swf–just wait. Also, ignore the dubstep stuff. I had originally intended to use a few different songs as examples, but hand tuning for one took so long that I decided not to.
In the end, the beat detection worked fairly well at the start of the song. There were a few false positives and some misses here and there. The “detections” aren’t too smooth–each beat causes the graphic to grow “violently”, as opposed to growing in a single, smooth motion. With a little bit of tinkering, I think those wrinkles could be ironed out.
As the song actually picks up, though, the bass drum detection loses it. This is likely because the bass synth gets much louder, so our threshold is constantly being broken. Turning up the threshold misses many bass hits, so an algorithm to detect “spikes” better would be in order. The bass hits are still being detected, but there are so many false positives that the bass drum graphic acts as if a beat is *constantly* being detected. The snare, on the other hand, is a little bit more successful. Still, by the end of the song it loses control.
So, at the end of the day, the experiment was a pretty big failure. With a lot of research, and maybe a much more finer grained spectrum (256 per channel just isn’t going to cut it, I don’t think), something reasonable could probably be produced.
The problem, though, is that we’re trying to detect beats for a wanna-be Guitar Hero/DDR game. Chances are that it would probably take much less time to listen for the beats by ear and manually record when they happen. This would also allow for hand tuning of the results, easier creation of difficulty levels, etc.
If nothing else, the FFT is still a great tool to use for aesthetic effects. Maybe our beat detection approach sucked for a Rock Band clone, but it’d still be workable as the basis for a subtle color transform, particle effect, etc in just about any game or website.
If you’re still interested in automatic beat detection, and all of this rough arbitrary guessing is way below you, then you might want to check out this article to get started with some more advanced stuff.
James | 06-Apr-08 at 4:40 pm | Permalink
I made “Monsters of Rock” a while back for a contest. You can check it out on my site if you like. That was a trip! It was one of my first trips in ActionScript 3.0 development. I tried the Sound.computeSpectrume() route at first, but it wasn’t until I was almost done getting it right that I realized that looking at the spectrum wasn’t really a smart idea. If the game is supposed to be a true guitar hero clone, then the user should only really care about playing the notes of the guitar. They could probably care less about the back beat, or the accompanying instruments.
I found it was much simpler to implement a pseudo-midi note tracking system. It places a falling note onto the screen that correlates to the time at which a specific guitar string should be, well, plucked. Granted, the downside of this is that I had to manually go through the song’s wav data and document all the notes myself. But in the end, it gave me a bit more control. I was able to implement sustained notes, where the user had to hold down the string button for the desired length of time in order to get a higher score.
Zack Jordan | 24-Apr-08 at 9:13 pm | Permalink
I think your examples are missing (404′ed, that is). Stick ‘em back up- I’d like to see them.
I do find it hard to believe that games like Guitar Hero are using anything like this for gameplay- but I think you came to that conclusion yourself. Audio has so much freaking information in it, and it’s the stuff you don’t think about that freaks an algorithm out Sure, vocals are in the mids, but sibilants can be all over the place. A bass drum is low but the actual transient- the part you use for beat detection- is quite a bit higher. Plus, as James mentioned, those games are adding a lot more information for gameplay then they could possibly get from the audio stream.
Regardless of all that, it’s a fascinating experiment that I may have to play with. Get your examples back up so I can see ‘em.
Douglas Thompson | 24-Apr-08 at 10:00 pm | Permalink
James - Nice game! I’m always impressed with how good your art is alongside your ability to program the stuff. Anyway, I agree with the hand-tuning of the notes. Even if you could perfectly isolate the sounds you wanted to, it’d probably be difficult to ‘automate’ difficulty levels, etc. And in the end, coming up with a perfect instrument recognition algorithm would likely take way more time than just doing it by hand anyway.
Zack - Should be fixed now… I had changed the capitialization on one of my directories and didn’t go around updating links. Unfortunately the example is REALLY REALLY BAD. I had just suggested the FFT to a guy who asked about beat detection on Newgrounds, and someone else basically said “No that’s dumb and useless” and I just couldn’t let it slide, so I made a quick demo in an hour or two. THEN when I threw this blog up, I decided to slap up a bunch of posts I’d made on Newgrounds as ‘filler’ until I got around to updating regularly (which I haven’t yet and probably never will). Pretty good story, huh? At any rate, this is one of the articles I like seeing the least, just because the results were so bad–just good enough to prove my point to some douchebag on a Flash forum who probably never read the post anyway, but not much else. The link I gave at the end of the article is a more serious attempt by someone else.
Also, sorry about the weird spacing in the code–it shows up aligned properly in my IDE. I know it never translates anywhere else properly, but I’ve been too lazy to fix it, heh. Like I said, filler!