Unsupervised learning approach is demonstrated by state-of-the-art NLP model (e.g. BERT, GPT-2) to be a good way to learn feature for downstream task. Researchers demonstrated data-driven learned features provide a better audio feature than traditional acoustic feature such as Mel-frequency cepstrum (MFCC).
This story will discuss about how can you use unsupervised learning to learn audio feature and applying it to downstream tasks.
Lee et al. propose to use convolutional deep belief network (CDBN, aksdeep learning representation nowadays) to replace traditional audio features (e.g. spectrogram and Mel-frequency cepstrum (MFCC)). The original input is spectrogram from each utterance and window size is 20ms with 10ms overlaps. Small window and overlapping setting is common when handling audio input. Believe that computer resource is limited in that time (it was 2009), they leverage principal component analysis (PCA) to reduce dimension before feeding into neural network.
The setting for the neural network includes 2 convolutional neural network (CNN) layer with 300 dimensions, filter length of 6 and max-pooling ratio of 3.
Lee et al. evaluated data-driven feature (CDBN) with traditional features which are spectrogram (i.e. RAW in the following figure) and MFCC for speaker identification, speaker gender classification, phone classification, music genre classification and music artist classification.
Xu et al. use spectrogram as raw input to learn vector representation. Asymmetric de-noising auto-encoder (aDAE) is presented in the research paper. The network architecture include encoder (first three layers) and decoder (last three layers) parts. Spectrogram is extracted and feeding into the encoder while the training objective is predicting the middle of frame by using previous and next frames.
It is similar to Continuous Bag-of-Words (CBOW) in NLP. CBOW use surroundings to predict the target words.
The following model comparisons demonstrate aDAE achieve a better result in general.
Meyer et al. also use spectrogram as raw input to learn vector representation. The training object is using previous frame to predict next frame which is similar to language model in NLP. Audio frame predictor (AFP) is presented in this paper.
The network architecture include encoder and decoder parts. Spectrogram is extracted with 2.56s sliding window size and 0.64s overlaps and feeding into encoder which includes multiple ConvLSTM layer. ConvLSTM setting use 3×3 filter kernal with ReLu activation and batch normalization function.
Meyer et al. use two steps procedure to train the data-driven representation. The network is trained by minimizing the mean squared error (MSE) (i.e. encoder to decoder) in the first 6 epochs. In the sixth to ninth epochs, pairwise loss training objected is added to adjust the representation simultaneously.
I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related. Feel free to connect with me on LinkedIn or following me on Medium or Github.