【iOS10 SpeechRecognition】語音識別現說現譯的最佳實踐

首先想強調一下“語音識別”四個字字面意義上的需求：用戶說話然後馬上把用戶說的話轉成文字顯示！，這才是開發者真正需要的功能。做需求之前其實是先谷歌百度一下看有沒有造好的輪子直接用，結果真的很呵呵，都是標著這個庫深入學習的標題，裡面調用一下api從URL里取出一個本地語音文件進行識別，這就沒了？最基 ...

首先想強調一下“語音識別”四個字字面意義上的需求：用戶說話然後馬上把用戶說的話轉成文字顯示！，這才是開發者真正需要的功能。

做需求之前其實是先谷歌百度一下看有沒有造好的輪子直接用，結果真的很呵呵，都是標著這個庫深入學習的標題，裡面調用一下api從URL里取出一個本地語音文件進行識別，這就沒了？最基本的需求都沒法實現。

今天整理下對於此功能的兩種實現方式：

首先看下識別請求的API有兩種 SFSpeechAudioBufferRecognitionRequest 和 SFSpeechURLRecognitionRequest ，並且實現解析的方式也有兩種 block 和 delegate。我就相互組合下兩種方法把這些內容都能涵蓋。

在開發之前需要先在info.plist註冊用戶隱私許可權，雖然大家都已經知道了我還是說一嘴為了本文的完整性。

Privacy - Microphone Usage Description
Privacy - Speech Recognition Usage Description

再使用requestAuthorization來請求使用許可權

    [SFSpeechRecognizer requestAuthorization:^(SFSpeechRecognizerAuthorizationStatus status) {
        // 對結果枚舉的判斷
    }];

關於麥克風的許可權在首次開始錄音時也會提出許可權選擇。

一、 SFSpeechAudioBufferRecognitionRequest 加上 block的方式

用這種方式實現主要分為以下幾個步驟

①多媒體引擎的建立

成員變數需要添加以下幾個屬性，便於開始結束釋放等

@property(nonatomic,strong)SFSpeechRecognizer *bufferRec;
@property(nonatomic,strong)SFSpeechAudioBufferRecognitionRequest *bufferRequest;
@property(nonatomic,strong)SFSpeechRecognitionTask *bufferTask;
@property(nonatomic,strong)AVAudioEngine *bufferEngine;
@property(nonatomic,strong)AVAudioInputNode *buffeInputNode;

初始化建議寫在啟動的方法里，便於啟動和關閉，如果準備使用全局的也可以只初始化一次

    self.bufferRec = [[SFSpeechRecognizer alloc]initWithLocale:[NSLocale localeWithLocaleIdentifier:@"zh_CN"]];
    self.bufferEngine = [[AVAudioEngine alloc]init];
    self.buffeInputNode = [self.bufferEngine inputNode];

②創建語音識別請求

    self.bufferRequest = [[SFSpeechAudioBufferRecognitionRequest alloc]init];
    self.bufferRequest.shouldReportPartialResults = true;

shouldReportPartialResults 其中這個屬性可以自行設置開關，是等你一句話說完再回調一次，還是每一個散碎的語音片段都會回調。

③建立任務，並執行任務

    // block外的代碼也都是準備工作，參數初始設置等
    self.bufferRequest = [[SFSpeechAudioBufferRecognitionRequest alloc]init];
    self.bufferRequest.shouldReportPartialResults = true;
    __weak ViewController *weakSelf = self;
    self.bufferTask = [self.bufferRec recognitionTaskWithRequest:self.bufferRequest resultHandler:^(SFSpeechRecognitionResult * _Nullable result, NSError * _Nullable error) {
            // 接收到結果後的回調
    }];
    
    // 監聽一個標識位並拼接流文件
    AVAudioFormat *format =[self.buffeInputNode outputFormatForBus:0];
    [self.buffeInputNode installTapOnBus:0 bufferSize:1024 format:format block:^(AVAudioPCMBuffer * _Nonnull buffer, AVAudioTime * _Nonnull when) {
        [weakSelf.bufferRequest appendAudioPCMBuffer:buffer];
    }];
    
    // 準備並啟動引擎
    [self.bufferEngine prepare];
    NSError *error = nil;
    if (![self.bufferEngine startAndReturnError:&error]) {
        NSLog(@"%@",error.userInfo);
    };
    self.showBufferText.text = @"等待命令中.....";

對runloop稍微瞭解過的人都知道，block外面的代碼是在前一個運行迴圈先執行的，正常的啟動流程是先初始化參數然後啟動引擎，然後會不斷地調用拼接buffer的這個回調方法，然後一個單位的buffer攢夠了後會回調一次上面的語音識別結果的回調，有時候沒聲音也會調用buffer的方法，但是不會調用上面的resulthandler回調，這個方法內部應該有個容錯（音量power沒到設定值會自動忽略）。

④接收到結果的回調

結果的回調就是在上面resultHandler裡面的block里了，執行後返回的參數就是result和error了，可以針對這個結果做一些操作。

        if (result != nil) {
            self.showBufferText.text = result.bestTranscription.formattedString;
        }
        if (error != nil) {
            NSLog(@"%@",error.userInfo);
        }

這個結果類型SFSpeechRecognitionResult可以看看裡面的屬性，有最佳結果，還有備選結果的數組。如果想做精確匹配的應該得把備選數組的答案也都過濾一遍。

⑤結束監聽

    [self.bufferEngine stop];
    [self.buffeInputNode removeTapOnBus:0];
    self.showBufferText.text = @"";
    self.bufferRequest = nil;
    self.bufferTask = nil;

這個中間的bus是臨時標識的節點，大概理解和埠的概念差不多。

二、SFSpeechURLRecognitionRequest 和 delegate的方法

block和delegate的主要區別是，block方式使用簡潔， delegate則可以有更多的自定義需求的空間，因為裡面有更多的結果回調生命周期方法。

這五個方法也沒什麼好說的，都是顧名思義。要註意的一點是第二個方法會調用多次，第三個方法會在一句話說完時調用一次。

// Called when the task first detects speech in the source audio
- (void)speechRecognitionDidDetectSpeech:(SFSpeechRecognitionTask *)task;

// Called for all recognitions, including non-final hypothesis
- (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didHypothesizeTranscription:(SFTranscription *)transcription;

// Called only for final recognitions of utterances. No more about the utterance will be reported
- (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishRecognition:(SFSpeechRecognitionResult *)recognitionResult;

// Called when the task is no longer accepting new audio but may be finishing final processing
- (void)speechRecognitionTaskFinishedReadingAudio:(SFSpeechRecognitionTask *)task;

// Called when the task has been cancelled, either by client app, the user, or the system
- (void)speechRecognitionTaskWasCancelled:(SFSpeechRecognitionTask *)task;

// Called when recognition of all requested utterances is finished.
// If successfully is false, the error property of the task will contain error information
- (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishSuccessfully:(BOOL)successfully;

這種實現的思路是，先實現一個錄音器（可以手動控制開始結束，也可以是根據音調大小自動開始結束的同步錄音器類似於會說話的湯姆貓），然後將錄音文件存到一個本地目錄，然後使用URLRequest的方式讀取出來進行翻譯。步驟分解如下

①建立同步錄音器

需要以下這些屬性

/** 錄音設備 */
@property (nonatomic, strong) AVAudioRecorder *recorder;
/** 監聽設備 */
@property (nonatomic, strong) AVAudioRecorder *monitor;
/** 錄音文件的URL */
@property (nonatomic, strong) NSURL *recordURL;
/** 監聽器 URL */
@property (nonatomic, strong) NSURL *monitorURL;
/** 定時器 */
@property (nonatomic, strong) NSTimer *timer;

屬性的初始化

    // 參數設置
    NSDictionary *recordSettings = [[NSDictionary alloc] initWithObjectsAndKeys:
                                    [NSNumber numberWithFloat: 14400.0], AVSampleRateKey,
                                    [NSNumber numberWithInt: kAudioFormatAppleIMA4], AVFormatIDKey,
                                    [NSNumber numberWithInt: 2], AVNumberOfChannelsKey,
                                    [NSNumber numberWithInt: AVAudioQualityMax], AVEncoderAudioQualityKey,
                                    nil];
    
    NSString *recordPath = [NSTemporaryDirectory() stringByAppendingPathComponent:@"record.caf"];
    _recordURL = [NSURL fileURLWithPath:recordPath];
    
    _recorder = [[AVAudioRecorder alloc] initWithURL:_recordURL settings:recordSettings error:NULL];
    
    // 監聽器
    NSString *monitorPath = [NSTemporaryDirectory() stringByAppendingPathComponent:@"monitor.caf"];
    _monitorURL = [NSURL fileURLWithPath:monitorPath];
    _monitor = [[AVAudioRecorder alloc] initWithURL:_monitorURL settings:recordSettings error:NULL];
    _monitor.meteringEnabled = YES;

其中參數設置的那個字典里，的那些常量大家不用過於上火，這是之前寫的代碼直接扒來用的，上文中設置的最優語音質量。

②開始與結束

要想通過聲音大小來控制開始結束的話，需要在錄音器外再額外設置個監聽器用來查看語音的大小通過peakPowerForChannel 方法查看當前話筒環境的聲音環境音量。並且有個定時器來控制音量檢測的周期。大致代碼如下

- (void)setupTimer {
    [self.monitor record];
    self.timer = [NSTimer scheduledTimerWithTimeInterval:0.1 target:self selector:@selector(updateTimer) userInfo:nil repeats:YES]; //董鉑然博客園
}

// 監聽開始與結束的方法
- (void)updateTimer {

    // 不更新就沒法用了
    [self.monitor updateMeters];
    
    // 獲得0聲道的音量，完全沒有聲音-160.0，0是最大音量
    float power = [self.monitor peakPowerForChannel:0];
    
    //        NSLog(@"%f", power);
    if (power > -20) {
        if (!self.recorder.isRecording) {
            NSLog(@"開始錄音");
            [self.recorder record];
        }
    } else {
        if (self.recorder.isRecording) {
            NSLog(@"停止錄音");
            [self.recorder stop];
            [self recognition];
        }
    }
}

③語音識別的任務請求

- (void)recognition {
    // 時鐘停止
    [self.timer invalidate];
    // 監聽器也停止
    [self.monitor stop];
    // 刪除監聽器的錄音文件
    [self.monitor deleteRecording];
    
    //創建語音識別操作類對象
    SFSpeechRecognizer *rec = [[SFSpeechRecognizer alloc]initWithLocale:[NSLocale localeWithLocaleIdentifier:@"zh_CN"]];
    //            SFSpeechRecognizer *rec = [[SFSpeechRecognizer alloc]initWithLocale:[NSLocale localeWithLocaleIdentifier:@"en_ww"]];  //董鉑然博客園
    
    //通過一個本地的音頻文件來解析
    SFSpeechRecognitionRequest * request = [[SFSpeechURLRecognitionRequest alloc]initWithURL:_recordURL];
    [rec recognitionTaskWithRequest:request delegate:self];
}

這段通過一個本地文件進行識別轉漢字的代碼，應該是網上傳的最多的，因為不用動腦子都能寫出來。但是單有這一段代碼基本是沒有什麼卵用的。（除了人家微信現在有個長按把語音轉文字的功能，其他誰的App需求我真想不到會直接拿出一個本地音頻文件來解析，自動生成mp3歌詞？周傑倫的歌解析難度比較大，還有語音識別時間要求不能超過1分鐘）

④結果回調的代理方法

- (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishRecognition:(SFSpeechRecognitionResult *)recognitionResult
{
    NSLog(@"%s",__FUNCTION__);
    NSLog(@"%@",recognitionResult.bestTranscription.formattedString);
    [self setupTimer];
}

用的最多的就這個方法了，另外不同時刻的回調方法可以按需添加，這裡也就是簡單展示，可以看我的demo程式里有更多功能。

https://github.com/dsxNiubility/SXSpeechRecognitionTwoWays

iOS10在語音相關識別相關功能上有了一個大的飛躍，主要體現在兩點一點就是上面的語音識別，另一點是sirikit可以實現將外部的信息透傳到App內進行操作，但是暫時局限性比較明顯，只能夠實現官網所說叫車，發信息等消息類型，甚至連“打開美團搜索烤魚店”這種類型都還不能識別，所以暫時也無法往下做過多研究，等待蘋果之後的更新吧。

【iOS10 SpeechRecognition】語音識別 現說現譯的最佳實踐