today I want to share how to deal with the Speech Recognition with Neural Networks.
So, the speech recognition task has following stages:
- Pre-processing: convert the sound wave into a vector of acoustic coefficients. Extract a new vector about every 10 milliseconds
- Acoustic model: Use a few adjacent vectors of acoustic coefficients to place bets on which par of which phoeneme is being spoken.
- Decoding: Find the sequence of bets that does the best job of fitting the acoustic data and also fitting a model of the kinds of thinks people say.