WinUI（WASDK）使用ChatGPT和攝像頭手勢識別結合TTS讓機器人更智能

前言之前寫過一篇基於ML.NET的手部關鍵點分類的博客，可以根據圖片進行手部的提取分類，於是我就將手勢分類和攝像頭數據結合，集成到了我開發的電子腦殼軟體里。電子腦殼是一個為稚暉君開源的桌面機器人ElectronBot提供一些軟體功能的桌面程式項目。它是由綠蔭阿廣也就是我開發的，使用了微軟的WAS ...

前言

之前寫過一篇基於ML.NET的手部關鍵點分類的博客，可以根據圖片進行手部的提取分類，於是我就將手勢分類和攝像頭數據結合，集成到了我開發的電子腦殼軟體里。

電子腦殼是一個為稚暉君開源的桌面機器人ElectronBot提供一些軟體功能的桌面程式項目。它是由綠蔭阿廣也就是我開發的，使用了微軟的WASDK框架。

電子腦殼算是本人學習WinUI開發的練習項目了，通過根據一些開源的項目的學習，將一些功能進行整合，比如手勢識別觸發語音轉文本，然後接入ChatGPT結合文本轉語音的方式，實現機器人的對話。

此博客算是實戰記錄了，替大家先踩坑。

下圖鏈接為機器人的演示視頻，通過對話，讓ChatGPT給我講了一個駱駝祥子的故事，只不過這個故事有點離譜，本來前部分還正常，後面就開始瞎編了，比如祥子有了一頭驢，最後還成為了商人。

大家觀看覺得不錯的話給點個贊。

具體的實現方案

1. 方案思路敘述

整體的流程如下圖，圖畫的不一定標準，但是大體如圖所示：
識別流程圖

處理攝像頭幀事件，通過將攝像頭的幀數據處理進行手勢的匹配。
手勢識別結果處理方法調用語音轉文本邏輯。
轉的文本通過調用ChatGPT API實現智能回覆。
將回覆結果文本通過TTS播放到機器人上的揚聲器，完成一次對話。

2. 所用技術說明

WASDK
MediaPipe offers open source cross-platform, customizable ML solutions for live and streaming media.
ML.NET 開放源代碼的跨平臺機器學習框架

上面的技術棧在我上面文章里有講述，這裡就不展開了，大家有興趣的可以點擊之前的文章查看。

WinUI（WASDK）使用MediaPipe檢查手部關鍵點並通過ML.NET進行手勢分類

代碼講解

1. 項目介紹

電子腦殼項目本身是一個標準的MVVM的WinUI項目，使用微軟的輕量級DI容器管理對象的生命周期，MVVM使用的是社區工具包提供的框架，支持代碼生成，簡化VM的代碼。

project

2. 核心代碼講解

實時視頻流解析手勢，通過命名空間Windows.Media.Capture下的MediaCapture類和Windows.Media.Capture.Frames命名空間下的MediaFrameReader類，創建對象並註冊幀處理事件，在幀處理事件中處理視頻畫面並傳出到手勢識別服務里進行手勢識別，主要代碼如下。

//幀處理結果訂閱
private void Current_SoftwareBitmapFrameCaptured(object? sender, SoftwareBitmapEventArgs e)
{
    if (e.SoftwareBitmap is not null)
    {

        if (e.SoftwareBitmap.BitmapPixelFormat != BitmapPixelFormat.Bgra8 ||
              e.SoftwareBitmap.BitmapAlphaMode == BitmapAlphaMode.Straight)
        {
            e.SoftwareBitmap = SoftwareBitmap.Convert(
                e.SoftwareBitmap, BitmapPixelFormat.Bgra8, BitmapAlphaMode.Premultiplied);
        }
        //手勢識別服務獲取
        var service = App.GetService<GestureClassificationService>();
        //調用手勢分析代碼
        _ = service.HandPredictResultUnUseQueueAsync(calculator, modelPath, e.SoftwareBitmap);
    }
}

涉及到的代碼如下：

MainViewModel

CameraFrameService

語音轉文本的實現，WinUI（WASDK）繼承了UWP的現代化的UI，也可以很好的使用WinRT的API進行操作。主要涉及的對象為命名空間Windows.Media.SpeechRecognition下的SpeechRecognizer對象。

官網文檔地址語音交互定義自定義識別約束

以下是語音轉文本的部分代碼詳細代碼點擊文字

//創建識別為網路搜索
var webSearchGrammar = new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario.WebSearch, "webSearch", "sound");
        //webSearchGrammar.Probability = SpeechRecognitionConstraintProbability.Min;
        speechRecognizer.Constraints.Add(webSearchGrammar);
        SpeechRecognitionCompilationResult result = await speechRecognizer.CompileConstraintsAsync();

        if (result.Status != SpeechRecognitionResultStatus.Success)
        {
            // Disable the recognition buttons.
        }
        else
        {
            // Handle continuous recognition events. Completed fires when various error states occur. ResultGenerated fires when
            // some recognized phrases occur, or the garbage rule is hit.
            //註冊指定的事件
            speechRecognizer.ContinuousRecognitionSession.Completed += ContinuousRecognitionSession_Completed;
            speechRecognizer.ContinuousRecognitionSession.ResultGenerated += ContinuousRecognitionSession_ResultGenerated;
        }

語音轉文本之後調用ChatGPT API進行對話回覆獲取，使用ChatGPTSharp封裝庫實現。

代碼如下：

private async void ContinuousRecognitionSession_ResultGenerated(SpeechContinuousRecognitionSession sender, SpeechContinuousRecognitionResultGeneratedEventArgs args)
{
    // The garbage rule will not have a tag associated with it, the other rules will return a string matching the tag provided
    // when generating the grammar.
    var tag = "unknown";

    if (args.Result.Constraint != null && isListening)
    {
        tag = args.Result.Constraint.Tag;

        App.MainWindow.DispatcherQueue.TryEnqueue(() =>
        {
            ToastHelper.SendToast(tag, TimeSpan.FromSeconds(3));
        });


        Debug.WriteLine($"識別內容---{tag}");
    }

    // Developers may decide to use per-phrase confidence levels in order to tune the behavior of their 
    // grammar based on testing.
    if (args.Result.Confidence == SpeechRecognitionConfidence.Medium ||
        args.Result.Confidence == SpeechRecognitionConfidence.High)
    {
        var result = string.Format("Heard: '{0}', (Tag: '{1}', Confidence: {2})", args.Result.Text, tag, args.Result.Confidence.ToString());


        App.MainWindow.DispatcherQueue.TryEnqueue(() =>
        {
            ToastHelper.SendToast(result, TimeSpan.FromSeconds(3));
        });


        if (args.Result.Text.ToUpper() == "打開B站")
        {
            await Launcher.LaunchUriAsync(new Uri(@"https://www.bilibili.com/"));
        }
        else if (args.Result.Text.ToUpper() == "撒個嬌")
        {
            ElectronBotHelper.Instance.ToPlayEmojisRandom();
        }
        else
        {
            try
            {
                // 根據機器人客戶端工廠創建指定類型的處理程式 可以支持多種聊天API
                var chatBotClientFactory = App.GetService<IChatbotClientFactory>();

                var chatBotClientName = (await App.GetService<ILocalSettingsService>()
                     .ReadSettingAsync<ComboxItemModel>(Constants.DefaultChatBotNameKey))?.DataKey;

                if (string.IsNullOrEmpty(chatBotClientName))
                {
                    throw new Exception("未配置語音提供程式機密數據");
                }

                var chatBotClient = chatBotClientFactory.CreateChatbotClient(chatBotClientName);
                //調用指定的實現獲取聊天返回結果
                var resultText = await chatBotClient.AskQuestionResultAsync(args.Result.Text);

                //isListening = false;
                await ReleaseRecognizerAsync();
                //調用文本轉語音併進行播放方法
                await ElectronBotHelper.Instance.MediaPlayerPlaySoundByTTSAsync(resultText, false);      
            }
            catch (Exception ex)
            {
                App.MainWindow.DispatcherQueue.TryEnqueue(() =>
                {
                    ToastHelper.SendToast(ex.Message, TimeSpan.FromSeconds(3));
                });

            }
        }
    }
    else
    {
    }
}

結果文本轉語音併進行播放，通過Windows.Media.SpeechSynthesis命名空間下的SpeechSynthesizer類，使用下麵的代碼可以將文本轉化成Stream。

  using SpeechSynthesizer synthesizer = new();
            // Create a stream from the text. This will be played using a media element.

            //將文本轉化為Stream
            var synthesisStream = await synthesizer.SynthesizeTextToStreamAsync(text);

然後使用MediaPlayer對象進行語音的播報。


 /// <summary>
/// 播放聲音
/// </summary>
/// <param name="content"></param>
/// <returns></returns>
public async Task MediaPlayerPlaySoundByTTSAsync(string content, bool isOpenMediaEnded = true)
{
    _isOpenMediaEnded = isOpenMediaEnded;
    if (!string.IsNullOrWhiteSpace(content))
    {
        try
        {
            var localSettingsService = App.GetService<ILocalSettingsService>();

            var audioModel = await localSettingsService
                .ReadSettingAsync<ComboxItemModel>(Constants.DefaultAudioNameKey);

            var audioDevs = await EbHelper.FindAudioDeviceListAsync();

            if (audioModel != null)
            {
                var audioSelect = audioDevs.FirstOrDefault(c => c.DataValue == audioModel.DataValue) ?? new ComboxItemModel();

                var selectedDevice = (DeviceInformation)audioSelect.Tag!;

                if (selectedDevice != null)
                {
                    mediaPlayer.AudioDevice = selectedDevice;
                }
            }
            //獲取TTS服務實例
            var speechAndTTSService = App.GetService<ISpeechAndTTSService>();
            //轉化文本到Stream
            var stream = await speechAndTTSService.TextToSpeechAsync(content);
            //播放stream
            mediaPlayer.SetStreamSource(stream);
            mediaPlayer.Play();
            isTTS = true;
        }
        catch (Exception)
        {
        }
    }
}

至此一次完整的識別對話流程就結束了，軟體的界面如下圖，感興趣的同學可以點擊圖片查看項目源碼地址查看其他的功能：

個人感悟

個人覺得DotNET的生態還是差了些，尤其是ML.NET的輪子還是太少了，畢竟參與的人少，而且知識遷移也需要成本，熟悉其他機器學習框架的人可能不懂DotNET。

所以作為社區的一員，我覺得我們需要走出去，然後再回來，走出去就是先學習其他的機器學習框架，然後回來用DotNET進行應用，這樣輪子多了，社區就會越來越繁榮。

我也能多多的複製粘貼大家的代碼了。