在实时音视频领域，如何基于 TensorFlow 实现图像识别_AI&大模型_金叶清

近两年来，Python 在众多编程语言中的热度一直稳居前五，热门程度可见一斑。 Python 拥有很活跃的社区和丰富的第三方库，Web 框架、爬虫框架、数据分析框架、机器学习框架等，开发者无需重复造轮子，可以用 Python 进行 Web 编程、网络编程，开发多媒体应用，进行数据分析，或实现图像识别等应用。其中图像识别是最热门的应用场景之一，也是与实时音视频契合度最高的应用场景之一。

本文将分享 TensorFlow 图像识别的实现。然后，我们尝试将 TensorFlow 与 Agora Python SDK 结合，来实现在实时音视频通话下的图像识别。我们利用 Agora Python SDK 来完成音视频的编解码、降噪、回声消除、低延时传输等任务，并将视频图像以 RGB 格式传输给 TensorFlow，然后我们通过 TensorFlow 来进行图像识别，将识别结果返回给客户端。

先分享一下 Demo 的识别效果。左图是对端的视频图像，右下角是我们本地的视频图像，在识别对端图像后，右侧的文本框中会显示出识别结果。注意，这个🙂贴纸只是后期 P 上去的，我们在 Demo 中还没有加入贴图的功能，如果你感兴趣，可以试着在 Demo 基础上做改进，丰富功能。

首先，我们还是要先介绍一下 TensorFlow 的图像识别原理与方法。

TensorFlow 图片及物体识别

TensorFlow 是 Google 的开源深度学习库，你可以使用这个框架以及 Python 编程语言，构建大量基于机器学习的应用程序。而且还有很多人把 TensorFlow 构建的应用程序或者其他框架，开源发布到 GitHub 上。所以我们今天主要基于 Tensorflow 学习下物体识别。

TensorFlow 提供了用于检测图片或视频中所包含物体的 API，详情点击此处。

物体检测是检测图片中所出现的全部物体并且用矩形（Anchor Box）进行标注，物体的类别可以包括多种，例如人、车、动物、路标等。举个例子了解 TensorFlow 物体检测 API 的使用方法，这里使用预训练好的 ssd_mobilenet_v1_coco 模型（Single Shot MultiBox Detector），更多可用的物体检测模型可以点击这里。

加载库

# -*- coding:utf-8 -*- import numpy asnpimporttensorflow as tfimportmatplotlib.pyplot as pltfrom PIL importImage from utilsimport label_map_utilfrom utilsimport visualization_utils as vis_util

复制代码

定义部分常量

PATH_TO_CKPT = 'ssd_mobilenet_v1_coco_2017_11_17/frozen_inference_graph.pb'PATH_TO_LABELS = 'ssd_mobilenet_v1_coco_2017_11_17/mscoco_label_map.pbtxt'NUM_CLASSES = 90

复制代码

加载预训练好的模型

detection_graph = tf.Graph()with detection_graph.as_default():  od_graph_def = tf.GraphDef()  with tf.gfile.GFile(PATH_TO_CKPT, 'rb') as fid:    od_graph_def.ParseFromString(fid.read())    tf.import_graph_def(od_graph_def, name='')

复制代码

加载分类标签数据

label_map = label_map_util.load_labelmap(PATH_TO_LABELS) categories = label_map_util.convert_label_map_to_categories(label_map,max_num_classes=NUM_CLASSES, use_display_name=True)category_index = label_map_util.create_category_index(categories)

复制代码

将图片转化为数组，并测试图片路径

def load_image_into_numpy_array(image):  (im_width, im_height) = image.size  return np.array(image.getdata()).reshape((im_height, im_width, 3)).astype(np.uint8)
TEST_IMAGE_PATHS = ['test_images/image1.jpg', 'test_images/image2.jpg']

复制代码

使用模型进行物体检测

with detection_graph.as_default():  with tf.Session(graph=detection_graph) as sess:      image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')      detection_boxes = detection_graph.get_tensor_by_name('detection_boxes:0')      detection_scores = detection_graph.get_tensor_by_name('detection_scores:0')      detection_classes = detection_graph.get_tensor_by_name('detection_classes:0')      num_detections = detection_graph.get_tensor_by_name('num_detections:0')      for image_path in TEST_IMAGE_PATHS:        image = Image.open(image_path)        image_np = load_image_into_numpy_array(image)        image_np_expanded = np.expand_dims(image_np, axis=0)        (boxes, scores, classes, num) = sess.run(          [detection_boxes, detection_scores, detection_classes, num_detections],           feed_dict={image_tensor: image_np_expanded})                vis_util.visualize_boxes_and_labels_on_image_array(image_np, np.squeeze(boxes), np.squeeze(classes).astype(np.int32), np.squeeze(scores), category_index, use_normalized_coordinates=True, line_thickness=8)        plt.figure(figsize=[12, 8])        plt.imshow(image_np)        plt.show()

复制代码

检测结果如下，第一张图片检测出了两只狗狗

实时音视频场景下 Tensorflow 物体识别

既然 Tensorflow 在静态图片的物体识别已经相对成熟，那在现实场景中，大量的实时音视频互动场景中，如何来做物体识别？我们现在基于声网实时视频的 SDK，阐述如何做物体识别。

首先我们了解视频其实就是由一帧一帧的图像组合而成，所以从这个层面来说，视频中的目标识别就是从每一帧图像中做目标识别，从这个层面上讲，二者没有本质区别。在理解这个前提的基础上，我们就可以相对简单地做实时音视频场景下 Tensorflow 物体识别。

1）读取 Agora 实时音视频，截取远端视频流的图片

def onRenderVideoFrame(uid, width, height, yStride,                            uStride, vStride, yBuffer, uBuffer, vBuffer,                            rotation, renderTimeMs, avsync_type):         # 用 isImageDetect 字段判断前一帧图像是否已完成识别，若完成置为True,执行以下代码，执行完置为false        if EventHandlerData.isImageDetect:            y_array = (ctypes.c_uint8 * (width * height)).from_address(yBuffer)            u_array = (ctypes.c_uint8 * ((width // 2) * (height // 2))).from_address(uBuffer)            v_array = (ctypes.c_uint8 * ((width // 2) * (height // 2))).from_address(vBuffer)
            Y = np.frombuffer(y_array, dtype=np.uint8).reshape(height, width)            U = np.frombuffer(u_array, dtype=np.uint8).reshape((height // 2, width // 2)).repeat(2, axis=0).repeat(2, axis=1)            V = np.frombuffer(v_array, dtype=np.uint8).reshape((height // 2, width // 2)).repeat(2, axis=0).repeat(2, axis=1)            YUV = np.dstack((Y, U, V))[:height, :width, :]            # AI模型中大多数模型都是RGB格式训练，声网提供的视频回调数据源是YUV格式，我们做下格式转换            RGB = cv2.cvtColor(YUV, cv2.COLOR_YUV2RGB, 3)            EventHandlerData.image = Image.fromarray(RGB)            EventHandlerData.isImageDetect = False

复制代码

2）Tensorflow 对截取图片进行物体识别

class objectDetectThread(QThread):    objectSignal = pyqtSignal(str)    def __init__(self):        super().__init__()    def run(self):        detection_graph = EventHandlerData.detection_graph        with detection_graph.as_default():            with tf.Session(graph=detection_graph) as sess:                (im_width, im_height) = EventHandlerData.image.size                image_np = np.array(EventHandlerData.image.getdata()).reshape((im_height, im_width, 3)).astype(np.uint8)                image_np_expanded = np.expand_dims(image_np, axis=0)                image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')                boxes = detection_graph.get_tensor_by_name('detection_boxes:0')                scores = detection_graph.get_tensor_by_name('detection_scores:0')                classes = detection_graph.get_tensor_by_name('detection_classes:0')                num_detections = detection_graph.get_tensor_by_name('num_detections:0')                (boxes, scores, classes, num_detections) = sess.run(                    [boxes, scores, classes, num_detections],                    feed_dict={image_tensor: image_np_expanded})                objectText = []                # 如果识别概率大于百分之四十，我们就在文本框内显示所识别物体                for i, c in enumerate(classes[0]):                    if scores[0][i] &gt; 0.4                        object = EventHandlerData.category_index[int(c)]['name']                        if object not in objectText:                            objectText.append(object)                    else:                        break                self.objectSignal.emit(', '.join(objectText))                EventHandlerData.detectReady = True                # 本帧图片识别完，isImageDetect 字段置为True，再次开始读取并转换Agora远端实时音视频                EventHandlerData.isImageDetect = True

复制代码

我们已经将这个 Demo 以及 Agora Python SDK 上传至 Github，大家可以直接下载使用：Agora Python TensorFlow Demo

Agora Python TensorFlow Demo 编译指南

点击下载Agora Python SDK
若是 Windows，复制.pyd and .dll 文件到本项目文件夹根目录；若是 IOS，复制.so 文件到本文件夹根目录
下载 Tensorflow 模型,然后把 object_detection 文件复制.到本文件夹根目录
安装 Protobuf。然后运行：

 protoc object_detection/protos/*.proto --python_out=.

复制代码

点击下载预先训练的模型
推荐使用 ssd_mobilenet_v1_coco
和 ssdlite_mobilenet_v2_coco，因为他们相对运行较快
提取 frozen graph,命令行运行：

python extractGraph.py --model_file='FILE_NAME_OF_YOUR_MODEL'

复制代码

最后，在 callBack.py 中修改 model name，在 demo.py 中修改 Appid，然后运行即可

请注意，这个 Demo 仅作为演示使用，从获取到远端实时视频画面，到 TensorFlow 进行识别处理，再到显示出识别效果，期间需要 2 至 4 秒。不同网络情况、设备性能、算法模型，其识别的效率也不同。感兴趣的开发者可以尝试更换自己的算法模型，来优化识别的延时。

如果 Demo 运行中遇到问题，请在 Github 直接提 issue。

作者介绍

金叶清，6 年研发运营经验，从事开发者社区领域工作近五年。曾负责华为开发者社区相关运营工作；目前就职于声网，从事开发者布道师工作。

发布

暂无评论

创作场景

在实时音视频领域，如何基于 TensorFlow 实现图像识别

TensorFlow 图片及物体识别

加载库

定义部分常量

加载预训练好的模型

加载分类标签数据

将图片转化为数组，并测试图片路径

使用模型进行物体检测

实时音视频场景下 Tensorflow 物体识别

Agora Python TensorFlow Demo 编译指南

评论

更多内容推荐

推荐阅读

电子书

大厂实战PPT下载