语音方面的资料不如图像识别的多,所以特地写了一份博客(并不如何严谨),希望可以帮到大家。

  我们需要实现10种语音的分类:冷气机,汽车喇叭,儿童玩耍,狗吠声,钻孔,发动机空转,枪射击,手持式凿岩机,警笛,街头音乐

  每个录音长度约为4s,被放在10个fold文件中。

  我们采用keras(可以简单的认为keras是前端,tensorflow是后端,类似于tensorflow是个库,我们使用keras调用它的api)实现模型搭建,使用librosa(Librosa是一个用于音频、音乐分析、处理的python工具包)来处理语音。

  • 导入这几个库即可

    import keras
    from keras.layers import Activation, Dense, Dropout, Conv2D, Flatten, MaxPooling2D
    from keras.models import Sequentia
    import librosa
    import librosa.display
    import numpy as np
    import pandas as pd
    import random
    
  • 读取csv文件

    data = pd.read_csv('metadata/UrbanSound8K.csv')
    valid_data = data[['slice_file_name', 'fold' ,'classID', 'class']][ data['end']-data['start'] >= 3 ]
    valid_data['path'] = 'fold' + valid_data['fold'].astype('str') + '/' + valid_data['slice_file_name'].astype('str')
    
  • 读入wav文件

    from tqdm import tnrange, tqdm_notebook
    
    D=[]
    
    for row in tqdm_notebook(valid_data.itertuples()): 
        print(row.path)
        print(row.classID)
        y1, sr1 = librosa.load("audio/" + row.path, duration=2.97)  
        ps = librosa.feature.melspectrogram(y=y1, sr=sr1)
        if ps.shape != (128, 128): 
                continue
        D.append( (ps, row.classID) )
    
  • 划分训练集和测试集,前7000个为训练集,7000以后为数据集

    dataset = D
    random.shuffle(dataset)
    
    train = dataset[:7000]
    test = dataset[7000:]
    
    X_train, y_train = zip(*train)
    X_test, y_test = zip(*test)
    
    X_train = np.array([x.reshape( (128, 128, 1) ) for x in X_train])
    X_test = np.array([x.reshape( (128, 128, 1) ) for x in X_test])
    
    y_train = np.array(keras.utils.to_categorical(y_train, 10))
    y_test = np.array(keras.utils.to_categorical(y_test, 10))
    
  • 搭建模型

    model = Sequential()
    input_shape=(128, 128, 1)
    
    model.add(Conv2D(24, (5, 5), strides=(1, 1), input_shape=input_shape))
    model.add(MaxPooling2D((4, 2), strides=(4, 2)))
    model.add(Activation('relu'))
    
    model.add(Conv2D(48, (5, 5), padding="valid"))
    model.add(MaxPooling2D((4, 2), strides=(4, 2)))
    model.add(Activation('relu'))
    
    model.add(Conv2D(48, (5, 5), padding="valid"))
    model.add(Activation('relu'))
    
    model.add(Flatten())
    model.add(Dropout(rate=0.5))
    
    model.add(Dense(64))
    model.add(Activation('relu'))
    model.add(Dropout(rate=0.5))
    
    model.add(Dense(10))
    model.add(Activation('softmax'))
    
  • 填入数据

    model.compile(
    optimizer="Adam",
    loss="categorical_crossentropy",
    metrics=['accuracy'])
    
    model.fit(
        x=X_train,
        y=y_train,
        epochs=12,
        batch_size=128,
        validation_data= (X_test, y_test))
    
    score = model.evaluate(
            x=X_test, y=y_test)
    
    print('Test loss:', score[0])
    print('Test accuracy:', score[1])
    

内容来自:https://blog.csdn.net/c2c2c2aa/article/details/81543549