Speech Recognition with ChucK

Inventor

I have been interested for some time in creating a simple, crude yet effective speech recognizer or more exactly a phoneme recognizer to help people who have difficulty hearing. My concept is to look at the FFT spectrum of the speech and have pattern recognizers fire off when they think they have heard a phoneme. The output would be something like this: "m-ah-r-ee h-d a l-eh-t-l l-a-m", for "Mary had a little lamb", where the computer does part of the work and the human does the rest of the work. That is, the computer converts the phonemes into text output and the human learns to interpret that crude text output.

The development path is to prototype the system using a software tool and ChucK is just perfect for that, and then translate the resulting algorithm to a little PIC chip so that simple portable devices could be constructed to aid the hearing impaired. Until i found and learned some of ChucK, I did not have the skill set to prototype this effectively, but I do now. Take a look at this little video that I made of me saying the phoneme "ee" several times, it is only ten seconds long at 30 FPS but you can download it and then set your player for looping if you wish to view it for a while.

http://www.freedomodds.com/music/movies/FFTScrollee.MPG

Notice that the sound has the same general characteristic shape each time. It seems to me that it will be quite reasonable to write a set of mathematical rules that look for peaks and valleys and match a pattern to detect the phoneme. I know that neural nets would be perfect for this, but I never really got the hang of the backprop algorithm, so instead I am planning to hand-code rules like: "if there are two or three peaks and the second peak is higher than the first and the second peak is approximately twice the frequency of the first, then print "ee".

At any rate, it is worth a shot to do something like this because of all the people it could help in so many ways. Products would include a hand-held device with a display, a super miniaturized one that fits on a pair of glasses, or a TV subtitle generator. Based on a PIC chip, the products could sell for $20 or less and still have a healthy profit margin.

Anyway, that's the concept and I'll be working on it. Oh, also here is a download of the ChucK files that create the scrolling FFT from microphone input.

rec_in.ck

Description:

Just a simple microphone to dac listener so you can record audio and/or make an FFT graph with ChucK.

Download

Filename:

rec_in.ck

Filesize:

383 Bytes

Downloaded:

456 Time(s)

FFTScroll.ck

Description:

Listens to the dac and makes a POV-Ray file that renders into a scrolling FFT.

Download

Filename:

FFTScroll.ck

Filesize:

2.84 KB

Downloaded:

436 Time(s)

blue hell

Inventor

blue hell

Kassen · Posted: Thu Nov 08, 2007 12:46 pm Post subject:

Yeah, I like it a lot too

For in-depth questions you may want to cross-post to the list because Perry Cook is on the list. Perry is Ge's teacher/adviser/$academic_term and co-author of ChucK, BUT (and this is where it gets interesting) he also did research/experiments into vocal synthesis.

I know this is the other way around but if you're going to re-express the characteristics of phonemes in ChucK code he's likely your best bet for questions and he's on the list but not the forum (as far as I know).

Inventor · Posted: Fri Nov 09, 2007 4:01 am Post subject:

Kassen · Posted: Fri Nov 09, 2007 8:04 am Post subject:

Coolness.

What kind of game is this?

Inventor · Posted: Fri Nov 09, 2007 9:07 am Post subject:

Kassen · Posted: Fri Nov 09, 2007 9:42 am Post subject:

Hmmmm, I think ChucK could even work as a real voice over IP thingy using OSC with encryption based on a stream-cypher set loose on FFT blobs?

Let me think about this, it would be cool to use a assignment that involves actual decoding. It would also be fun to implement actual strong crypto in ChucK.

Inventor · Posted: Fri Nov 09, 2007 10:50 am Post subject:

Here's an update on the speech recognizer: After a few hours I have gotten it to the point that it can almost recognize an "a" as in "hay" and an "ee" as in "teeth". However sometimes it recognizes "a" as "eeaee" or some mix like that, and sometimes it does not recognize a at all. It does, however often recognize "ee". This is a promising result for just a few hours of coding, but it looks like there is a long way to go. Well, I did say baby-steps, didn't I?

dewdrop_world · Posted: Sat Nov 10, 2007 4:43 pm Post subject:

Really interesting idea... I've thought about it for SuperCollider but never had time to look in depth.

There must be a huge amount of comp-sci research on phonetic recognition. Rather than reinvent the wheel, you might make progress a lot faster by looking up some algorithms that are known to work.

http://en.wikipedia.org/wiki/Speech_recognition

This one at first glance (found thru google) looks interesting:

http://www.owlnet.rice.edu/~elec431/projects97/Dynamic/main.html

James

Inventor · Posted: Sun Nov 11, 2007 10:41 am Post subject:

Kassen · Posted: Tue Nov 13, 2007 1:39 am Post subject:

Oh. Dear.

I simple Google told me this is way over my head as well, as of yet. Some rather informal thinking over morning coffee is telling me just plain FFT might not pull it. Asuming we are talking English, I'd say the difference between "sick" and "tick" would be hard to tell by pure spectrum analysis and would also depend on transient detection. If we don't use "normal" transient detection there we will end up with very short FFT frames which might not be that bad from a spectral resolution point of view (at least not for speech) but it will result in a LOT of data, probably more then is usefull to sort the voiced sounds.

I'd say we might be best off first trying to build a algorithem to deal with music recognition, sorting a picked banjo E from a strumed G would be way easier then sorting "tick" from "sick" but the first would at least bring us closer to the second.

I do have a friend who's a mathematician who tries to model some types of economical behaviour, he may know about HMM, I could ask him about where to start but I fear there really won't be a "for dummies" version.

Inventor · Posted: Tue Nov 13, 2007 4:00 am Post subject:

Sigh, I think you're right Kassen there is no easy explanation that I can find on the web for HMMs, but the good news is that in the process I learned more about the problem. One common way to solve the problem - and it has been solved over and over as dewdrop_world suggests, is feature extraction followed by pattern recognition followed by hidden markov models. Also, I have learned that the HMMs are used for detecting words from phonemes, not for detecting phonemes. In this example we are offloading the HMM task to the human brain, so - whew - no need to learn about HMMs.

I have a good beginning of feature extraction by peak detection, but I have learned that I must modify my peak detection to recognize adjacent peaks (just a detail). Then the pattern recognition though best done with a neural net, could be done with rules. I also learned that there are only about 40 phonemes to recognize and we could probably get away with a subset of those, say 30 or so because we only want to approximate the speech. Plus one tidbit I picked up was that 310 ms is the ideal time duration for each FFT sample, so if I set up the FFT to capture roughly 1/3 of a second worth of samples (about 8k or 16k samples at 44,100 sample rate), then the snapshot of the sound would be the right size.

At this point I'm not overly concerned about the difference between similar words like "sick" and "tick" because the brain should usually be able to determine by context what the meaning is, or so I believe. Sorry I have not done any more coding since I posted last, but over the weekend I suddenly got a lot of traffic on my sports site, which is now averaging over 150 hits a day, so I've been updating the site and I even added a radio show. At least I got a little research done and I feel pretty good about the phoneme recognition project's possibilities. Oh well, they say if you're not moving forward, at least stay pointed in the right direction, haha!

Inventor · Posted: Tue Nov 13, 2007 6:33 am Post subject:

OK, this morning i completed movies for the five vowel sounds: a, e, i, o, an u. I am pleased to report that although the a and e are similar, the others are all quite different. I have a peaks feature extraction that tells me where the normalized harmonics above a threshold are, so i can count the peaks and look for harmonics of the first peak as two of the pattern recognition cues. I can also look for which peak is the maximum peak as a cue.

Now I think i need a centroid-like thingie that tells me where the average value of the spectral content is, then I can compare that as a ratio to the maximum peak to act as a pattern recognition cue. I wonder how to do that. For one thing I remember seeing in an example that I can get a centroid out of the FFT somehow but I don't know what that centroid means.

Thinking out loud, I think what I need to do is something like in sophomore statics mechanical engineering class, pretend that each frequency bin is a force acting on a lever hinged at f=0 and find the location of the normalized unit force that would equate to all those. So for example loop from f=1 to fmax and sum up the fft(f)*f/fmax components. Then maybe that sum is the unit force location. Not sure. How would you make the calculation?

Kassen · Posted: Tue Nov 13, 2007 12:40 pm Post subject:

This is exactly the sort of problem that makes me []ick!

*stick tongue out, moves hands next to ears*

Nah, a while back I linked here to a algorithem for tempo-dependant shuffel timing, that came from research from a department that deals with exactly this sort of issue that had decided to first look at music and then move on to speech.

Going to look into your last question in more detail tomorow.

Inventor · Posted: Tue Nov 13, 2007 1:45 pm Post subject:

Inventor · Posted: Tue Nov 13, 2007 2:03 pm Post subject:

OK, here it is, I just had to add a print statement to make it work. The following "notes" are from my voice when I say "electro" into the microphone:

7 8 6 9 12 8 5 5 5 5 5 9 10 10 11 12 11 10

The numbers are frequency bins, AKA frequency indexes from the FFT, so to really polish off the application I would have to convert that into a frequency and then use the frequency to MIDI function in the libraries to report an actual "note". But anyway, you might give it a try (or someone else) to see if it does the job of recognizing notes properly.

Note that you must ChucK-up both rec_in.ck and speech2text2.ck to make it work right. Alternatively you could play a chuck file and also add speech2text2.ck to the list of programs and the speech2text2.ck file will just grab your notes off of the dac. Hope it works right.

Oh, and thanks for taking a thoughtful look at my ramblings in the morning, Kassen. Cheers!

rec_in.ck

Description:

microphone listener, just sends the mic to the dac

Download

Filename:

rec_in.ck

Filesize:

294 Bytes

Downloaded:

256 Time(s)

speech2text2.ck

Description:

simplified speech2text ChucK program, prints out the frequency bin from the FFT of the maximum peak in the FFT.

Download

Filename:

speech2text2.ck

Filesize:

3.31 KB

Downloaded:

254 Time(s)

deknow · Posted: Wed Nov 14, 2007 8:03 am Post subject:

...if doing this as a diy programing project is the point, then i can't be of much help.

...that said, you could easily use naturallyspeaking as an intermediary, and make it a "proof of concept"....after that, there is a number of sr products designed to run in a cell phone, in voicemail, or a car that could eventually be appropriated.

remember that 'tick' and 'sick' are close together, but not as hard to deal with as "excuse me while i kiss the sky" vs "excuse me while i kiss this guy" or "ice cream" vs "i scream".

with naturallyspeaking, you could start with a completely empty vocabulary, and only add the phonemes you want to recognize, and have the output be whatever you want.

Inventor · Posted: Sat Nov 17, 2007 8:25 pm Post subject:

Yes, deknow, I guess using naturallyspeaking would be an excellent and simple way to prototype this concept if I had a copy of it. That way the mechanics underlying the process of the phoneme recognition would be already dealt with - and very well also, allowing us to concentrate on evaluating whether or not a stream of phonemes really does make for an intelligible speech output.

Unfortunately I don't have a copy of a speech recognition tool and I don't know of any freeware ones. It's OK, though, I think I have enough with the rule-based stuff to do 10 or more phonemes and that should suffice for some simple test sentences. Then I'll be able to get a better feel for whether the method makes sense or not, I hope. Also I plan to test it on multiple speakers as soon as possible.

I'm not really trying to develop the next best speech recognizer, just trying to get something passable to act as a working beginning - playing around, really. As for the text accuracy, I'd be happy if "ice cream" and "I scream" were output as "aeyskreem", or even "ahahaheeeeysskeeeeeem" for the moment. It's just a beginning. Thanks for your ideas! Smile

Inventor · Posted: Sun Nov 18, 2007 7:00 am Post subject:

Here is a progress report to keep you up to date. I decided to use the Centroid feature extractor that is part of the chuck language and also to look at the number of peaks. I formed a probability for each phoneme based on the ratio of the maximum peak to the centroid, put into a gaussian function and also based on the number of peaks put into a gaussian function. Whichever phoneme has the highest probability gets selected. The results are as follows:

When I say "a", I get "a i u".
When I say "e", I get "u e".
When I say "i", I get "i ".
When I say "o", I get "a o a o".
When I say "u", I get "e u".

That is for a good run, they are usually somewhat worse than that. So it looks like it is properly recognizing the phonemes pretty well, but it is also adding extra phonemes in with the desired ones. I am concerned that this may be unavoidable using feature extraction alone because for example when I listen to myself say "a", it really does have an "i" sound at the end of it. Similarly, the "u" phoneme kind of has an "e" sound in front of it, as if you were saying "eu".

This may not be such a bad thing, for example once I get a "y" sound as in "you" in place, then perhaps the program will produce "yeu" in response to "you", which would be just fine. I think I need to play around with some more features and more phonemes for a while and it will become at least passable. I'm not posting the code yet because it isn't really reliable yet, but anyway there's a progress report for "yeu". Smile

Inventor · Posted: Sun Nov 25, 2007 7:03 am Post subject:

Here is a phoneme recognizer that works on the five vowel sounds: A, E, I, O, and U. After trying with the rules based stuff for a while, I realized that it would be much more effective to use a neural net to do the pattern recognition. Fortunately there were some simplified articles on the subject and I was able to code up the backprop algorithm that had eluded me in the past.

The algorithm does an FFT and gets the centroid of that, then it normalizes the FFT so that the highest array element is one. It then finds all the peaks above a certain threshold value. At the input of the neural net the program applies the peak values divided by the centroid, the normalized FFT magnitude at those peaks, the centroid divided by 4k Hz, and the number of peaks divided by a maximum number of peaks.

The neural net is a feedforward type with two hidden layers of width equal to twice the number of inputs and five output nodes (one for each phoneme). Each neuron has a sigmoid activation function to simplify the backprop algorithm. The program just looks at the outputs and picks the largest one to indicate which phoneme is being recognized. Only FFT samples with maximum peak greater than a noise threshold are displayed, so the program just sits there and does nothing when nobody is talking.

If you look at the source code, please pardon the crude nature of my programming - I should have used OOP with classes and all that, but I just threw it all together with global variables in order to get something working quickly. Maybe I will re-write it later.

Interestingly, I found that one hidden layer was not enough to do the job but two hidden layers worked out well. As to recognition accuracy, it detects my voice correctly most of the time although there is a tendency for it to detect extra sounds that are really there. For example when I say "A", i actually pronounce "AE" or "AI" sometimes, and the program will catch that on occasion.

I am hoping that a couple of people might run the program to see if it works on other people's voices. When you run it, the first thing it does is ask you to pronounce the five vowel sounds one at a time, then it gets some additional repeat samples from you to form the training set. Next it trains the network for a few minutes or so until it reaches an error threshold that is purposely large (1%) to prevent overtraining and to speed things up a bit. Finally it invites you to say any of the five phonemes, printing the five output neuron values plus a text character representing your phoneme.

Most of the control parameters are located at the beginning section of the file so they can be easily adjusted if you choose to mess around with them, and a little bit further down in the file are the neural network size parameters.

All in all I had a good time re-learning neural nets and coding up this phoneme recognizer, and I'm looking forward to incorporating the full set of 40 or so phonemes into the program.

speech2text5.ck

Description:

5 phoneme speech recognizer using neural net pattern recognition.

Download

Filename:

speech2text5.ck

Filesize:

15.72 KB

Downloaded:

234 Time(s)

Inventor · Posted: Wed Dec 05, 2007 12:45 pm Post subject:

Just an update here, I entered all 44 phonemes and could never get it to train very well. I tried different net sizes and longer training intervals, but it was training for like half a day and still not working. Unfortunately at the moment, or rather fortunately actually I have discovered a new sine wave oscillator opamp circuit. This will occupy most of my time for a little while. I'm going to try to publish it in the IEEE because to my knowledge it is the first of its kind, so wish me luck! I'll get back to ChucKing when things settle down a bit with the oscillator.

blue hell · Posted: Wed Dec 05, 2007 12:57 pm Post subject:

Inventor · Posted: Wed Dec 05, 2007 5:34 pm Post subject: