electro-music.com   Dedicated to experimental electro-acoustic
and electronic music
 
    Front Page  |  Radio
 |  Media  |  Forum  |  Wiki  |  Links
Forum with support of Syndicator RSS
 FAQFAQ   CalendarCalendar   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   LinksLinks
 RegisterRegister   ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in  Chat RoomChat Room 
 Forum index » DIY Hardware and Software » ChucK programming language
Speech Recognition with ChucK
Post new topic   Reply to topic Moderators: Kassen
Page 1 of 2 [31 Posts]
View unread posts
View new posts in the last week
Mark the topic unread :: View previous topic :: View next topic
Goto page: 1, 2 Next
Author Message
Inventor
Stream Operator


Joined: Oct 13, 2007
Posts: 6221
Location: near Austin, Tx, USA
Audio files: 267

PostPosted: Wed Nov 07, 2007 1:23 pm    Post subject: Speech Recognition with ChucK
Subject description: Baby steps toward a phoneme-based speech recognizer
Reply with quote  Mark this post and the followings unread

I have been interested for some time in creating a simple, crude yet effective speech recognizer or more exactly a phoneme recognizer to help people who have difficulty hearing. My concept is to look at the FFT spectrum of the speech and have pattern recognizers fire off when they think they have heard a phoneme. The output would be something like this: "m-ah-r-ee h-d a l-eh-t-l l-a-m", for "Mary had a little lamb", where the computer does part of the work and the human does the rest of the work. That is, the computer converts the phonemes into text output and the human learns to interpret that crude text output.

The development path is to prototype the system using a software tool and ChucK is just perfect for that, and then translate the resulting algorithm to a little PIC chip so that simple portable devices could be constructed to aid the hearing impaired. Until i found and learned some of ChucK, I did not have the skill set to prototype this effectively, but I do now. Take a look at this little video that I made of me saying the phoneme "ee" several times, it is only ten seconds long at 30 FPS but you can download it and then set your player for looping if you wish to view it for a while.

http://www.freedomodds.com/music/movies/FFTScrollee.MPG

Notice that the sound has the same general characteristic shape each time. It seems to me that it will be quite reasonable to write a set of mathematical rules that look for peaks and valleys and match a pattern to detect the phoneme. I know that neural nets would be perfect for this, but I never really got the hang of the backprop algorithm, so instead I am planning to hand-code rules like: "if there are two or three peaks and the second peak is higher than the first and the second peak is approximately twice the frequency of the first, then print "ee".

At any rate, it is worth a shot to do something like this because of all the people it could help in so many ways. Products would include a hand-held device with a display, a super miniaturized one that fits on a pair of glasses, or a TV subtitle generator. Based on a PIC chip, the products could sell for $20 or less and still have a healthy profit margin.

Anyway, that's the concept and I'll be working on it. Oh, also here is a download of the ChucK files that create the scrolling FFT from microphone input.


rec_in.ck
 Description:
Just a simple microphone to dac listener so you can record audio and/or make an FFT graph with ChucK.

Download
 Filename:  rec_in.ck
 Filesize:  383 Bytes
 Downloaded:  456 Time(s)


FFTScroll.ck
 Description:
Listens to the dac and makes a POV-Ray file that renders into a scrolling FFT.

Download
 Filename:  FFTScroll.ck
 Filesize:  2.84 KB
 Downloaded:  436 Time(s)

Back to top
View user's profile Send private message Send e-mail
blue hell
Site Admin


Joined: Apr 03, 2004
Posts: 24083
Location: The Netherlands, Enschede
Audio files: 278
G2 patch files: 320

PostPosted: Wed Nov 07, 2007 2:32 pm    Post subject: Re: Speech Recognition with ChucK
Subject description: Baby steps toward a phoneme-based speech recognizer
Reply with quote  Mark this post and the followings unread

Inventor wrote:
Based on a PIC chip


You're thinking dsPIC?

_________________
Jan
also .. could someone please turn down the thermostat a bit.
Posted Image, might have been reduced in size. Click Image to view fullscreen.
Back to top
View user's profile Send private message Visit poster's website
Inventor
Stream Operator


Joined: Oct 13, 2007
Posts: 6221
Location: near Austin, Tx, USA
Audio files: 267

PostPosted: Wed Nov 07, 2007 10:24 pm    Post subject: Re: Speech Recognition with ChucK
Subject description: Baby steps toward a phoneme-based speech recognizer
Reply with quote  Mark this post and the followings unread

Blue Hell wrote:
Inventor wrote:
Based on a PIC chip


You're thinking dsPIC?


Yes, although I believe the larger of the regular PIC's can do an FFT if I recall correctly. I know a company that got PIC's without packages because they needed the miniaturization, which would work for the glasses-clip version, and surface mount or thru-hole would be fine for the larger products.

But it all depends on whether or not a reliable set of rules could be made that works for different voices. The rules should not be frequency-specific, i.e. only refer to frequencies as ratios. Accents are no problem as the brain is equipped to deal with those as long as the phonemes are correct. I don't know, we will see where this latest effort goes, and then take the next step. I have been kind of busy working on a game today though.
Back to top
View user's profile Send private message Send e-mail
blue hell
Site Admin


Joined: Apr 03, 2004
Posts: 24083
Location: The Netherlands, Enschede
Audio files: 278
G2 patch files: 320

PostPosted: Thu Nov 08, 2007 11:09 am    Post subject: Re: Speech Recognition with ChucK
Subject description: Baby steps toward a phoneme-based speech recognizer
Reply with quote  Mark this post and the followings unread

Inventor wrote:
Yes, although I believe the larger of the regular PIC's can do an FFT if I recall correctly.
Of course they can, question only is how fast Wink But they do have a multiplier so when you can squeeze it into 8 bits for multiplication and 16 bits for addition, that might be fast enough. And that's something that would be testable in ChucK as well.

Quote:
But it all depends on whether or not a reliable set of rules could be made that works for different voices.


Yes, I think it's an interesting idea though !

_________________
Jan
also .. could someone please turn down the thermostat a bit.
Posted Image, might have been reduced in size. Click Image to view fullscreen.
Back to top
View user's profile Send private message Visit poster's website
Kassen
Janitor
Janitor


Joined: Jul 06, 2004
Posts: 7678
Location: The Hague, NL
G2 patch files: 3

PostPosted: Thu Nov 08, 2007 12:46 pm    Post subject: Reply with quote  Mark this post and the followings unread

Yeah, I like it a lot too

For in-depth questions you may want to cross-post to the list because Perry Cook is on the list. Perry is Ge's teacher/adviser/$academic_term and co-author of ChucK, BUT (and this is where it gets interesting) he also did research/experiments into vocal synthesis.

I know this is the other way around but if you're going to re-express the characteristics of phonemes in ChucK code he's likely your best bet for questions and he's on the list but not the forum (as far as I know).

_________________
Kassen
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Inventor
Stream Operator


Joined: Oct 13, 2007
Posts: 6221
Location: near Austin, Tx, USA
Audio files: 267

PostPosted: Fri Nov 09, 2007 4:01 am    Post subject: Reply with quote  Mark this post and the followings unread

Kassen wrote:
Yeah, I like it a lot too

For in-depth questions you may want to cross-post to the list because Perry Cook is on the list. Perry is Ge's teacher/adviser/$academic_term and co-author of ChucK, BUT (and this is where it gets interesting) he also did research/experiments into vocal synthesis.

I know this is the other way around but if you're going to re-express the characteristics of phonemes in ChucK code he's likely your best bet for questions and he's on the list but not the forum (as far as I know).


Cool, I'm glad you and Blue Hell like the idea - that gives me more confidence to proceed with it. I posted to the chuck-users list and mentioned Perry Cook's name in the post. It should be interesting to hear the responses. Now I've got to get to work on the peak and valley detectors and the pattern recognizers. I would have something by now, but I had a vivid dream about a game and I've coded up four spy missions for the game and written a web page for it in the last couple of days. Oh well, all in good time...
Back to top
View user's profile Send private message Send e-mail
Kassen
Janitor
Janitor


Joined: Jul 06, 2004
Posts: 7678
Location: The Hague, NL
G2 patch files: 3

PostPosted: Fri Nov 09, 2007 8:04 am    Post subject: Reply with quote  Mark this post and the followings unread

Coolness.

What kind of game is this?

_________________
Kassen
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Inventor
Stream Operator


Joined: Oct 13, 2007
Posts: 6221
Location: near Austin, Tx, USA
Audio files: 267

PostPosted: Fri Nov 09, 2007 9:07 am    Post subject: Reply with quote  Mark this post and the followings unread

Kassen wrote:
Coolness.

What kind of game is this?


Hehe, a fun game! It is actually on-topic for ChucK too since Mission 3 is a simple ChucK mission. I'll spare you the details but in my dream, my sister and I broke into a government computer using a chip with an LED soldered to it. When I woke up, I realized that one could make a game by presenting the player with various 007 spy-like challenges which the player must write software and/or build hardware to solve.

So far the missions are all software but later I will create hardware missions as well (Enter the PIC chips once again). Three of the missions are Perl missions in which the user installs Perl and Perl modules, decodes a password, breaks a Rot13 guardian, and uses MD5 encryption to obtain passwords for a password file.

In Mission 3, the ChucK mission, the user downloads and installs ChucK, then listens to a voice recording I made using ChucK which tells them how to write the simplest of ChucK programs: a microphone to dac listener. Then the player loads that up with a slightly modified rec-auto.ck to form an imperfect but mostly functional voice recorder (there are some gaps in the recordings).

If you like you can read about the game on the web page that I wrote for it, which is here:

http://www.freedomodds.com/hstechspy/

Oh, the game is called HS Tech Spy where the HS stands for Hardware / Software. It is open source freeware with profit (if any) derived from selling the hardware and software solutions, though all of the hardware will be do-it-yourself stuff so that people on a budget can play with little or in some cases no cost (if they already have the parts).

Can you think of other ChucK missions that I could write? I'd like to do more with ChucK in this game if possible. Perhaps ChucK could encrypt a voice file as FFT data and play it back in a garbled but hearable form. Or maybe I'll do the sound effects for a safe-tumbler access using ChucK. Hmmm...
Back to top
View user's profile Send private message Send e-mail
Kassen
Janitor
Janitor


Joined: Jul 06, 2004
Posts: 7678
Location: The Hague, NL
G2 patch files: 3

PostPosted: Fri Nov 09, 2007 9:42 am    Post subject: Reply with quote  Mark this post and the followings unread

Hmmmm, I think ChucK could even work as a real voice over IP thingy using OSC with encryption based on a stream-cypher set loose on FFT blobs?

Let me think about this, it would be cool to use a assignment that involves actual decoding. It would also be fun to implement actual strong crypto in ChucK.

_________________
Kassen
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Inventor
Stream Operator


Joined: Oct 13, 2007
Posts: 6221
Location: near Austin, Tx, USA
Audio files: 267

PostPosted: Fri Nov 09, 2007 10:50 am    Post subject: Reply with quote  Mark this post and the followings unread

Here's an update on the speech recognizer: After a few hours I have gotten it to the point that it can almost recognize an "a" as in "hay" and an "ee" as in "teeth". However sometimes it recognizes "a" as "eeaee" or some mix like that, and sometimes it does not recognize a at all. It does, however often recognize "ee". This is a promising result for just a few hours of coding, but it looks like there is a long way to go. Well, I did say baby-steps, didn't I?
Back to top
View user's profile Send private message Send e-mail
dewdrop_world



Joined: Aug 28, 2006
Posts: 858
Location: Guangzhou, China
Audio files: 4

PostPosted: Sat Nov 10, 2007 4:43 pm    Post subject: Reply with quote  Mark this post and the followings unread

Really interesting idea... I've thought about it for SuperCollider but never had time to look in depth.

There must be a huge amount of comp-sci research on phonetic recognition. Rather than reinvent the wheel, you might make progress a lot faster by looking up some algorithms that are known to work.

http://en.wikipedia.org/wiki/Speech_recognition

This one at first glance (found thru google) looks interesting:

http://www.owlnet.rice.edu/~elec431/projects97/Dynamic/main.html

James

_________________
ddw online: http://www.dewdrop-world.net
sc3 online: http://supercollider.sourceforge.net
Back to top
View user's profile Send private message Visit poster's website AIM Address
Inventor
Stream Operator


Joined: Oct 13, 2007
Posts: 6221
Location: near Austin, Tx, USA
Audio files: 267

PostPosted: Sun Nov 11, 2007 10:41 am    Post subject: Reply with quote  Mark this post and the followings unread

dewdrop_world wrote:
There must be a huge amount of comp-sci research on phonetic recognition. Rather than reinvent the wheel, you might make progress a lot faster by looking up some algorithms that are known to work.


Thanks for the pointer, I do have a tendency to reinvent the wheel rather than look it up - sometimes that's good and sometimes not. In this case I've learned that something called "Hidden Markov Models" is just the ticket for doing the FFT pattern recognition. Unfortunately I cannot understand the math behind the web examples I am finding. I gather that HMMs are a statistical model based on the Bayesian concept that what has happened in the past is likely to happen in the future, but that's as far as I have gotten so far.

Does anyone have a good reference to "Hidden Markov Models for Dummies" or similar?
Back to top
View user's profile Send private message Send e-mail
Kassen
Janitor
Janitor


Joined: Jul 06, 2004
Posts: 7678
Location: The Hague, NL
G2 patch files: 3

PostPosted: Tue Nov 13, 2007 1:39 am    Post subject: Reply with quote  Mark this post and the followings unread

Oh. Dear.

I simple Google told me this is way over my head as well, as of yet. Some rather informal thinking over morning coffee is telling me just plain FFT might not pull it. Asuming we are talking English, I'd say the difference between "sick" and "tick" would be hard to tell by pure spectrum analysis and would also depend on transient detection. If we don't use "normal" transient detection there we will end up with very short FFT frames which might not be that bad from a spectral resolution point of view (at least not for speech) but it will result in a LOT of data, probably more then is usefull to sort the voiced sounds.

I'd say we might be best off first trying to build a algorithem to deal with music recognition, sorting a picked banjo E from a strumed G would be way easier then sorting "tick" from "sick" but the first would at least bring us closer to the second.

I do have a friend who's a mathematician who tries to model some types of economical behaviour, he may know about HMM, I could ask him about where to start but I fear there really won't be a "for dummies" version.

_________________
Kassen
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Inventor
Stream Operator


Joined: Oct 13, 2007
Posts: 6221
Location: near Austin, Tx, USA
Audio files: 267

PostPosted: Tue Nov 13, 2007 4:00 am    Post subject: Reply with quote  Mark this post and the followings unread

Sigh, I think you're right Kassen there is no easy explanation that I can find on the web for HMMs, but the good news is that in the process I learned more about the problem. One common way to solve the problem - and it has been solved over and over as dewdrop_world suggests, is feature extraction followed by pattern recognition followed by hidden markov models. Also, I have learned that the HMMs are used for detecting words from phonemes, not for detecting phonemes. In this example we are offloading the HMM task to the human brain, so - whew - no need to learn about HMMs.

I have a good beginning of feature extraction by peak detection, but I have learned that I must modify my peak detection to recognize adjacent peaks (just a detail). Then the pattern recognition though best done with a neural net, could be done with rules. I also learned that there are only about 40 phonemes to recognize and we could probably get away with a subset of those, say 30 or so because we only want to approximate the speech. Plus one tidbit I picked up was that 310 ms is the ideal time duration for each FFT sample, so if I set up the FFT to capture roughly 1/3 of a second worth of samples (about 8k or 16k samples at 44,100 sample rate), then the snapshot of the sound would be the right size.

At this point I'm not overly concerned about the difference between similar words like "sick" and "tick" because the brain should usually be able to determine by context what the meaning is, or so I believe. Sorry I have not done any more coding since I posted last, but over the weekend I suddenly got a lot of traffic on my sports site, which is now averaging over 150 hits a day, so I've been updating the site and I even added a radio show. At least I got a little research done and I feel pretty good about the phoneme recognition project's possibilities. Oh well, they say if you're not moving forward, at least stay pointed in the right direction, haha!
Back to top
View user's profile Send private message Send e-mail
Inventor
Stream Operator


Joined: Oct 13, 2007
Posts: 6221
Location: near Austin, Tx, USA
Audio files: 267

PostPosted: Tue Nov 13, 2007 6:33 am    Post subject: Reply with quote  Mark this post and the followings unread

OK, this morning i completed movies for the five vowel sounds: a, e, i, o, an u. I am pleased to report that although the a and e are similar, the others are all quite different. I have a peaks feature extraction that tells me where the normalized harmonics above a threshold are, so i can count the peaks and look for harmonics of the first peak as two of the pattern recognition cues. I can also look for which peak is the maximum peak as a cue.

Now I think i need a centroid-like thingie that tells me where the average value of the spectral content is, then I can compare that as a ratio to the maximum peak to act as a pattern recognition cue. I wonder how to do that. For one thing I remember seeing in an example that I can get a centroid out of the FFT somehow but I don't know what that centroid means.

Thinking out loud, I think what I need to do is something like in sophomore statics mechanical engineering class, pretend that each frequency bin is a force acting on a lever hinged at f=0 and find the location of the normalized unit force that would equate to all those. So for example loop from f=1 to fmax and sum up the fft(f)*f/fmax components. Then maybe that sum is the unit force location. Not sure. How would you make the calculation?
Back to top
View user's profile Send private message Send e-mail
Kassen
Janitor
Janitor


Joined: Jul 06, 2004
Posts: 7678
Location: The Hague, NL
G2 patch files: 3

PostPosted: Tue Nov 13, 2007 12:40 pm    Post subject: Reply with quote  Mark this post and the followings unread

This is exactly the sort of problem that makes me []ick!

*stick tongue out, moves hands next to ears*

Nah, a while back I linked here to a algorithem for tempo-dependant shuffel timing, that came from research from a department that deals with exactly this sort of issue that had decided to first look at music and then move on to speech.

Going to look into your last question in more detail tomorow.

_________________
Kassen
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Inventor
Stream Operator


Joined: Oct 13, 2007
Posts: 6221
Location: near Austin, Tx, USA
Audio files: 267

PostPosted: Tue Nov 13, 2007 1:45 pm    Post subject: Reply with quote  Mark this post and the followings unread

Kassen wrote:
This is exactly the sort of problem that makes me []ick!

*stick tongue out, moves hands next to ears*

Nah, a while back I linked here to a algorithem for tempo-dependant shuffel timing, that came from research from a department that deals with exactly this sort of issue that had decided to first look at music and then move on to speech.

Going to look into your last question in more detail tomorow.


HaHa, what a great sense of humor you have Kassen! I agree with you that I should detect musical notes first, it's just that I think I have that one licked now. (um, get it, stick tongue out, got the problem licked - doh!) What I mean is that for an instrument that has a maximum spectral peak at its fundamental frequency, my peak detector finds it every time. I could take a moment out to write a frequency detector that does that... hmm perhaps I will right now, post it soon.
Back to top
View user's profile Send private message Send e-mail
Inventor
Stream Operator


Joined: Oct 13, 2007
Posts: 6221
Location: near Austin, Tx, USA
Audio files: 267

PostPosted: Tue Nov 13, 2007 2:03 pm    Post subject: Reply with quote  Mark this post and the followings unread

OK, here it is, I just had to add a print statement to make it work. The following "notes" are from my voice when I say "electro" into the microphone:

7 8 6 9 12 8 5 5 5 5 5 9 10 10 11 12 11 10

The numbers are frequency bins, AKA frequency indexes from the FFT, so to really polish off the application I would have to convert that into a frequency and then use the frequency to MIDI function in the libraries to report an actual "note". But anyway, you might give it a try (or someone else) to see if it does the job of recognizing notes properly.

Note that you must ChucK-up both rec_in.ck and speech2text2.ck to make it work right. Alternatively you could play a chuck file and also add speech2text2.ck to the list of programs and the speech2text2.ck file will just grab your notes off of the dac. Hope it works right.

Oh, and thanks for taking a thoughtful look at my ramblings in the morning, Kassen. Cheers!


rec_in.ck
 Description:
microphone listener, just sends the mic to the dac

Download
 Filename:  rec_in.ck
 Filesize:  294 Bytes
 Downloaded:  256 Time(s)


speech2text2.ck
 Description:
simplified speech2text ChucK program, prints out the frequency bin from the FFT of the maximum peak in the FFT.

Download
 Filename:  speech2text2.ck
 Filesize:  3.31 KB
 Downloaded:  254 Time(s)

Back to top
View user's profile Send private message Send e-mail
deknow



Joined: Sep 15, 2004
Posts: 1307
Location: Leominster, MA (USA)
G2 patch files: 15

PostPosted: Wed Nov 14, 2007 8:03 am    Post subject: Reply with quote  Mark this post and the followings unread

...if doing this as a diy programing project is the point, then i can't be of much help.

...that said, you could easily use naturallyspeaking as an intermediary, and make it a "proof of concept"....after that, there is a number of sr products designed to run in a cell phone, in voicemail, or a car that could eventually be appropriated.

remember that 'tick' and 'sick' are close together, but not as hard to deal with as "excuse me while i kiss the sky" vs "excuse me while i kiss this guy" or "ice cream" vs "i scream".

with naturallyspeaking, you could start with a completely empty vocabulary, and only add the phonemes you want to recognize, and have the output be whatever you want.
Back to top
View user's profile Send private message Send e-mail
Inventor
Stream Operator


Joined: Oct 13, 2007
Posts: 6221
Location: near Austin, Tx, USA
Audio files: 267

PostPosted: Sat Nov 17, 2007 8:25 pm    Post subject: Reply with quote  Mark this post and the followings unread

Yes, deknow, I guess using naturallyspeaking would be an excellent and simple way to prototype this concept if I had a copy of it. That way the mechanics underlying the process of the phoneme recognition would be already dealt with - and very well also, allowing us to concentrate on evaluating whether or not a stream of phonemes really does make for an intelligible speech output.

Unfortunately I don't have a copy of a speech recognition tool and I don't know of any freeware ones. It's OK, though, I think I have enough with the rule-based stuff to do 10 or more phonemes and that should suffice for some simple test sentences. Then I'll be able to get a better feel for whether the method makes sense or not, I hope. Also I plan to test it on multiple speakers as soon as possible.

I'm not really trying to develop the next best speech recognizer, just trying to get something passable to act as a working beginning - playing around, really. As for the text accuracy, I'd be happy if "ice cream" and "I scream" were output as "aeyskreem", or even "ahahaheeeeysskeeeeeem" for the moment. It's just a beginning. Thanks for your ideas! Smile
Back to top
View user's profile Send private message Send e-mail
Inventor
Stream Operator


Joined: Oct 13, 2007
Posts: 6221
Location: near Austin, Tx, USA
Audio files: 267

PostPosted: Sun Nov 18, 2007 7:00 am    Post subject: Reply with quote  Mark this post and the followings unread

Here is a progress report to keep you up to date. I decided to use the Centroid feature extractor that is part of the chuck language and also to look at the number of peaks. I formed a probability for each phoneme based on the ratio of the maximum peak to the centroid, put into a gaussian function and also based on the number of peaks put into a gaussian function. Whichever phoneme has the highest probability gets selected. The results are as follows:

When I say "a", I get "a i u".
When I say "e", I get "u e".
When I say "i", I get "i ".
When I say "o", I get "a o a o".
When I say "u", I get "e u".

That is for a good run, they are usually somewhat worse than that. So it looks like it is properly recognizing the phonemes pretty well, but it is also adding extra phonemes in with the desired ones. I am concerned that this may be unavoidable using feature extraction alone because for example when I listen to myself say "a", it really does have an "i" sound at the end of it. Similarly, the "u" phoneme kind of has an "e" sound in front of it, as if you were saying "eu".

This may not be such a bad thing, for example once I get a "y" sound as in "you" in place, then perhaps the program will produce "yeu" in response to "you", which would be just fine. I think I need to play around with some more features and more phonemes for a while and it will become at least passable. I'm not posting the code yet because it isn't really reliable yet, but anyway there's a progress report for "yeu". Smile
Back to top
View user's profile Send private message Send e-mail
Inventor
Stream Operator


Joined: Oct 13, 2007
Posts: 6221
Location: near Austin, Tx, USA
Audio files: 267

PostPosted: Sun Nov 25, 2007 7:03 am    Post subject: Reply with quote  Mark this post and the followings unread

Here is a phoneme recognizer that works on the five vowel sounds: A, E, I, O, and U. After trying with the rules based stuff for a while, I realized that it would be much more effective to use a neural net to do the pattern recognition. Fortunately there were some simplified articles on the subject and I was able to code up the backprop algorithm that had eluded me in the past.

The algorithm does an FFT and gets the centroid of that, then it normalizes the FFT so that the highest array element is one. It then finds all the peaks above a certain threshold value. At the input of the neural net the program applies the peak values divided by the centroid, the normalized FFT magnitude at those peaks, the centroid divided by 4k Hz, and the number of peaks divided by a maximum number of peaks.

The neural net is a feedforward type with two hidden layers of width equal to twice the number of inputs and five output nodes (one for each phoneme). Each neuron has a sigmoid activation function to simplify the backprop algorithm. The program just looks at the outputs and picks the largest one to indicate which phoneme is being recognized. Only FFT samples with maximum peak greater than a noise threshold are displayed, so the program just sits there and does nothing when nobody is talking.

If you look at the source code, please pardon the crude nature of my programming - I should have used OOP with classes and all that, but I just threw it all together with global variables in order to get something working quickly. Maybe I will re-write it later.

Interestingly, I found that one hidden layer was not enough to do the job but two hidden layers worked out well. As to recognition accuracy, it detects my voice correctly most of the time although there is a tendency for it to detect extra sounds that are really there. For example when I say "A", i actually pronounce "AE" or "AI" sometimes, and the program will catch that on occasion.

I am hoping that a couple of people might run the program to see if it works on other people's voices. When you run it, the first thing it does is ask you to pronounce the five vowel sounds one at a time, then it gets some additional repeat samples from you to form the training set. Next it trains the network for a few minutes or so until it reaches an error threshold that is purposely large (1%) to prevent overtraining and to speed things up a bit. Finally it invites you to say any of the five phonemes, printing the five output neuron values plus a text character representing your phoneme.

Most of the control parameters are located at the beginning section of the file so they can be easily adjusted if you choose to mess around with them, and a little bit further down in the file are the neural network size parameters.

All in all I had a good time re-learning neural nets and coding up this phoneme recognizer, and I'm looking forward to incorporating the full set of 40 or so phonemes into the program.


speech2text5.ck
 Description:
5 phoneme speech recognizer using neural net pattern recognition.

Download
 Filename:  speech2text5.ck
 Filesize:  15.72 KB
 Downloaded:  234 Time(s)

Back to top
View user's profile Send private message Send e-mail
Inventor
Stream Operator


Joined: Oct 13, 2007
Posts: 6221
Location: near Austin, Tx, USA
Audio files: 267

PostPosted: Wed Dec 05, 2007 12:45 pm    Post subject: Reply with quote  Mark this post and the followings unread

Just an update here, I entered all 44 phonemes and could never get it to train very well. I tried different net sizes and longer training intervals, but it was training for like half a day and still not working. Unfortunately at the moment, or rather fortunately actually I have discovered a new sine wave oscillator opamp circuit. This will occupy most of my time for a little while. I'm going to try to publish it in the IEEE because to my knowledge it is the first of its kind, so wish me luck! I'll get back to ChucKing when things settle down a bit with the oscillator.
Back to top
View user's profile Send private message Send e-mail
blue hell
Site Admin


Joined: Apr 03, 2004
Posts: 24083
Location: The Netherlands, Enschede
Audio files: 278
G2 patch files: 320

PostPosted: Wed Dec 05, 2007 12:57 pm    Post subject: Reply with quote  Mark this post and the followings unread

Inventor wrote:
so wish me luck!


I wish you luck ! Very Happy

Holy Mozes a new sine wave generator Shocked

I was going to quickly find you a nice article about scalability of neural networks, but erm ... I got over 400.00 hits on that - seems to be a field for research Laughing

_________________
Jan
also .. could someone please turn down the thermostat a bit.
Posted Image, might have been reduced in size. Click Image to view fullscreen.
Back to top
View user's profile Send private message Visit poster's website
Inventor
Stream Operator


Joined: Oct 13, 2007
Posts: 6221
Location: near Austin, Tx, USA
Audio files: 267

PostPosted: Wed Dec 05, 2007 5:34 pm    Post subject: Reply with quote  Mark this post and the followings unread

Blue Hell wrote:

Holy Mozes a new sine wave generator Shocked


Hehe, yes it is really simple, neat, compact, and elegant. Makes a nice pure sinusoid too. As near as I can tell, its a new thingie that nobody else has thought up before. I know this because everyone uses junky kluges to make their sine waves be, well, sinusoidal. They wouldn't go through all that extra effort if they knew of this easier way to do it. I can't wait to tell you all about it, but first I'm going to write it up in a formal paper and perhaps submit it to the IEEE. I also sent Bob Pease, the analog guru at National Semiconductor an email about it to see what he thinks. I am so joyful at having discovered such a little gem. Having said all that, I really hope it is actually original cause it would be a letdown if it wasn't. But I'm almost certain it is new. What fun!
Back to top
View user's profile Send private message Send e-mail
Display posts from previous:   
Post new topic   Reply to topic Moderators: Kassen
Page 1 of 2 [31 Posts]
View unread posts
View new posts in the last week
Goto page: 1, 2 Next
Mark the topic unread :: View previous topic :: View next topic
 Forum index » DIY Hardware and Software » ChucK programming language
Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Forum with support of Syndicator RSS
Powered by phpBB © 2001, 2005 phpBB Group
Copyright © 2003 through 2009 by electro-music.com - Conditions Of Use