探索Azure OpenAI服务：具有音频和语音功能的GPT-4o实时预览

Azure OpenAI服务最近推出了一项令人振奋的新功能：GPT-4o-实时预览，为AI技术的音频和语音能力带来了先进的功能。这一增强功能是一个重要的进步，使开发人员能够创建更加自然和对话式的人工智能体验。让我们深入了解这意味着什么，以及如何在您的项目中利用这些能力。

什么是GPT-4o-实时预览？

GPT-4o-实时预览是Azure OpenAI服务的一个重大更新，将语言生成与无缝语音交互相结合。该模型支持音频输入和输出，允许进行实时、自然的基于语音的交互。想象一下，创建能够像人类一样自然地理解并回应口语的虚拟助理或实时客户支持系统。

示例

主要特点

多模态能力：GPT-4o-实时预览支持文本、视觉，现在也支持音频输入和输出。这种多模态方法可以实现更加动态和交互式的人工智能应用。
实时交互：该模型可以实时处理并响应音频输入，非常适合需要即时反馈的应用程序，如虚拟助理和客户服务机器人。
先进的语音技术：基于Azure在语音服务方面的传统，此更新整合了语音转文本、文本转语音、神经语音和实时翻译，提升了整体用户体验。

实际应用

GPT-4o-Realtime-Preview 的其中一个最令人兴奋的应用是在开发基于语音的生成式人工智能应用中。例如，VoiceRAG 应用程序模式将检索增强生成（RAG）与实时音频功能结合在一起。这使人工智能能够听取音频输入，从知识库中检索相关信息，并通过音频输出作出回应，从而创造出一种流畅的对话体验。

实施VoiceRAG

要实现基于语音的RAG应用程序，您需要考虑客户端和服务器端组件。客户端处理音频输入和输出，而服务器管理模型配置和对知识库的访问。以下是一个简化的架构：

功能调用：GPT-4o-实时预览模型支持功能调用，允许其在会话配置中调用工具进行搜索和定位。
实时中间层：该组件代理音频流量，并在后端处理模型配置和功能调用，确保对资源的安全访问。

这里有一个来自VoiceRAG存储库的示例代码片段，帮助您开始：

import os
from dotenv import load_dotenv
from aiohttp import web
from ragtools import attach_rag_tools
from rtmt import RTMiddleTier
from azure.identity import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential

if __name__ == "__main__":
    load_dotenv()
    llm_endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT")
    llm_key = os.environ.get("AZURE_OPENAI_API_KEY")
    search_endpoint = os.environ.get("AZURE_SEARCH_ENDPOINT")
    search_index = os.environ.get("AZURE_SEARCH_INDEX")
    search_key = os.environ.get("AZURE_SEARCH_API_KEY")

    credentials = DefaultAzureCredential() if not llm_key or not search_key else None

    app = web.Application()

    rtmt = RTMiddleTier(llm_endpoint, AzureKeyCredential(llm_key) if llm_key else credentials)
    rtmt.system_message = "You are a helpful assistant. The user is listening to answers with audio, so it's *super* important that answers are as short as possible, a single sentence if at all possible. " + \
                          "Use the following step-by-step instructions to respond with short and concise answers using a knowledge base: " + \
                          "Step 1 - Always use the 'search' tool to check the knowledge base before answering a question. " + \
                          "Step 2 - Always use the 'report_grounding' tool to report the source of information from the knowledge base. " + \
                          "Step 3 - Produce an answer that's as short as possible. If the answer isn't in the knowledge base, say you don't know."
    attach_rag_tools(rtmt, search_endpoint, search_index, AzureKeyCredential(search_key) if search_key else credentials)

    rtmt.attach_to_app(app, "/realtime")

    app.add_routes([web.get('/', lambda _: web.FileResponse('./static/index.html'))])
    app.router.add_static('/', path='./static', name='static')
    web.run_app(app, host='localhost', port=8765)

这段代码演示了如何使用Azure的语音SDK从麦克风识别语音，这是创建语音交互应用程序的基本步骤。

请查看此 GitHub 仓库以获取详细的实现：navintkr/openai-rag-audio (github.com)

结论

GPT-4o-Realtime-Preview的引入具有音频和语音功能，在Azure OpenAI服务中标志着重大进步。通过整合这些功能，开发人员可以创建更具吸引力和互动性的AI 应用程序，充分利用自然语言处理和实时语音交互的强大功能。无论您是在构建虚拟助手、客户支持机器人还是任何其他语音驱动应用程序，这些新功能将开辟更多可能性。

开心编码！