本文最初发表在 TowardsDataScience 博客，经原作者 Basile Roth 授权，InfoQ 中文站翻译并分享。

介绍

在隔离期间，我在 GitHub 上花了时间去探索 TensorFlow 的大量预训练模型。在探索的过程中，我偶然发现了一个仓库，里面有 25 个预训练的对象检测模型，并带有性能和速度指标。由于我对计算机视觉有一定的了解，并且考虑到实际情况，我认为，使用其中的一个模型来构建一个社交距离应用可能会很有趣。

更重要的是，上学期我在计算机视觉（Computer Vision）课上接触到了 OpenCV，在做一些小项目的时候，我意识到 OpenCV 是多么的强大。其中之一是对一张图片进行鸟瞰视图变换。鸟瞰视图基本上就是一个自上而下的场景表现。这是在构建自动驾驶汽车应用程序时经常执行的任务。

"车载摄像机鸟瞰系统的实现。

这让我意识到，在我们想要监控社交距离的场景中，应用这样的技术，可以提高它的质量。本文介绍了我如何使用深度学习模型以及计算机视觉的一些知识来构建一个强大的社交距离检测器。

本文所有的代码以及安装说明都可以在我的 GitHub 仓库中找到。

1. 模型的选择

TensorFlow 对象监测模型动物园上的所有可用模型都已在 COCO（Common Objects in COntext 数据集上进行训练。这个数据集包含 120000 张图片，其中 880000 张图片中有已标记的对象。这些模型被训练用来检测数据集中标记的 90 种不同类型的对象。所有这些不同对象的完整列表可以在 github repo 的 data 部分中找到。这个对象列表包括汽车、牙刷、香蕉，当然还有人。

可用模型的列表（部分）

根据模型的速度不同，它们有不同的性能。为了确定如何根据预测速度来利用模型的质量，我做了一些测试。由于这个应用程序的目标并不是为了执行实时分析，因此，我最终选择了 faster_rcnn_inception_v2_coco，它的 mAP（在验证集上的检测器性能）得分为 28，性能相当强悍，执行速度为 58 毫秒。

2. 人体检测

要使用这种模型，为了能够检测到人体，必须执行以下几个步骤：

将包含模型的文件夹在到 TensorFlow 途中。并定义想要从模型中获得的输出。
对于每一帧，通过图传递图像以获得所需的输出。
过滤掉弱预测以及无需检测的对象。

加载并启动模型

TensorFlow 模型的工作方式是通过使用图来设计的。第一步意味着将模型加载到 TensorFlow 图中。这个图将包含不同的操作，以获得所需的检测。下一步是创建一个会话，它是一个实体，负责执行上一个图中定义的操作。欲了解更多关于图和会话的解释，请参阅这篇文章：《什么是 TensorFlow 会话？》（What is a TensorFlow Session?）。我已经决定实现一个类，将所有与 TensorFlow 图相关的数据放在一起。


class Model:
    """
    Class that contains the model and all its functions
    """
    def __init__(self, model_path):
        """
        Initialization function
        @ model_path : path to the model
        """
        # Declare detection graph
        self.detection_graph = tf.Graph()
        # Load the model into the tensorflow graph
        with self.detection_graph.as_default():
            od_graph_def = tf.compat.v1.GraphDef()
            with tf.io.gfile.GFile(model_path, 'rb') as file:
                serialized_graph = file.read()
                od_graph_def.ParseFromString(serialized_graph)
                tf.import_graph_def(od_graph_def, name='')
        # Create a session from the detection graph
        self.sess = tf.compat.v1.Session(graph=self.detection_graph)
    def predict(self,img):
        """
        Get the predicition results on 1 frame
        @ img : our img vector
        """
        # Expand dimensions since the model expects images to have shape: [1, None, None, 3]
        img_exp = np.expand_dims(img, axis=0)
        # Pass the inputs and outputs to the session to get the results
        (boxes, scores, classes) = self.sess.run([self.detection_graph.get_tensor_by_name('detection_boxes:0'), self.detection_graph.get_tensor_by_name('detection_scores:0'), self.detection_graph.get_tensor_by_name('detection_classes:0')],feed_dict={self.detection_graph.get_tensor_by_name('image_tensor:0'): img_exp})
        return (boxes, scores, classes)

通过模型传递每个帧

对于需要处理的每个帧，都会启动一个新会话。这是通过调用 run() 函数来完成的。执行此操作时必须指定一些参数。这些参数包括模型需要的输入类型，以及我们希望从模型中返回哪些输出。在我们的示例中，需要的输出如下：

每个对象的边界框坐标。
每个预测的置信度（0 或 1）。
预测的类（0 到 90）。

过滤掉弱预测和无关对象

模型检测到的许多类中有一个是人体，与人体关联的类是 1。

为了排除弱预测（阈值：0.75）和除了人体之外的所有其他类型的对象，我使用了一条 if 语句，结合这两个条件来排除任何其他对象的进一步计算。

if int(classes[i]) == 1 and scores[i] > 0.75

3. 鸟瞰视图变换

正如在本文介绍一节中所解释的，执行鸟瞰视图变换可以给我们一个场景的俯视图。值得庆幸的是，OpenCV 具有强大的内置函数，可以将这种方法应用于图像，从而将从透视角度获取的图像变换为该图像的俯视图。我使用了 Adrian Rosebrock 的教程来了解如何做到这一点。

第一步是在原始图像上选择 4 个点，这些点将成为要转换的平面图的角点。这些点必须形成一个矩形，至少有两条对边是平行的。如果不这样做的话，当进行变换时，比例将会不一样。我已经在我的仓库中实现了一个脚本，它使用 OpenCV 的 setMouseCallback() 函数来获取这些坐标。计算变换矩阵的函数也需要图像的维度，这些维度是使用图像的 image.shape 适当计算的。

width, height, _ = image.shape

这将返回宽度、高度和其他不相关的颜色像素值。让我们来看看它们是如何用于计算转换矩阵的：

def compute_perspective_transform(corner_points,width,height,image):
	""" Compute the transformation matrix
	@ corner_points : 4 corner points selected from the image
	@ height, width : size of the image
	return : transformation matrix and the transformed image
	"""
	# Create an array out of the 4 corner points
	corner_points_array = np.float32(corner_points)
	# Create an array with the parameters (the dimensions) required to build the matrix
	img_params = np.float32([[0,0],[width,0],[0,height],[width,height]])
	# Compute and return the transformation matrix
	matrix = cv2.getPerspectiveTransform(corner_points_array,img_params)
	img_transformed = cv2.warpPerspective(image,matrix,(width,height))
	return matrix,img_transformed

请注意，我之所以选择返回矩阵，是因为在下一步中将使用它来计算检测到的每个人体的新坐标。其结果是帧中每个人的 “GPS” 坐标。使用这些点要比使用原始地面点要精确得多，因为在透视视图中，当人体处于不同的平面时，距离是不一样的，而不是与摄像机相同的距离。与使用原始帧中的点相比，这种方法可以大大提高社交距离测量的精度。

对于检测到的每个人体，返回构建一个边界框所需的两个点。这些点是边界框的左上角和右下角。从这些点中，我通过得到它们之间的中间点，计算出边界框的中心点。利用这一结果，我计算出位于边界框底部中心的点的坐标。在我看来，我称之为地面点的这个点，是图像中人体坐标的最佳表示。

然后，我使用变换矩阵来计算每个检测到的地面点的变换坐标。这是在检测到每个帧中的人体之后，使用 cv2.perspectiveTransform() 函数在每个帧执行此操作来完成的。下面就是我如何实现这项任务的方式：

def compute_point_perspective_transformation(matrix,list_downoids):
	""" Apply the perspective transformation to every ground point which have been detected on the main frame.
	@ matrix : the 3x3 matrix
	@ list_downoids : list that contains the points to transform
	return : list containing all the new points
	"""
	# Compute the new coordinates of our points
	list_points_to_detect = np.float32(list_downoids).reshape(-1, 1, 2)
	transformed_points = cv2.perspectiveTransform(list_points_to_detect, matrix)
	# Loop over the points and add them to the list that will be returned
	transformed_points_list = list()
	for i in range(0,transformed_points.shape[0]):
		transformed_points_list.append([transformed_points[i][0][0],transformed_points[i][0][1]])
	return transformed_points_list

4. 测量社交距离

在每个帧上调用这个函数后，将返回一个包含所有新变换点的列表。根据这个列表，我必须计算出每对点之间的距离。我使用了 itertools 库中的 combinations() 函数，它允许在列表中获取每个可能的组合，而无需保持双精度。这在堆栈溢出问题上有很好的解释。剩下的就是简单的数学计算：在 Python 中，使用 math.sqrt() 函数很容易计算出两点之间的距离。选择的阈值为 120 个像素，因为这一数字在我们的场景中大约相当于 2 英尺。

# Check if 2 or more people have been detected (otherwise no need to detect)
  if len(transformed_downoids) >= 2:
    # Iterate over every possible 2 by 2 between the points combinations
    list_indexes = list(itertools.combinations(range(len(transformed_downoids)), 2))
    for i,pair in enumerate(itertools.combinations(transformed_downoids, r=2)):
      # Check if the distance between each combination of points is less than the minimum distance chosen
      if math.sqrt( (pair[0][0] - pair[1][0])**2 + (pair[0][1] - pair[1][1])**2 ) < int(distance_minimum):
        # Change the colors of the points that are too close from each other to red
        change_color_topview(pair)
        # Get the equivalent indexes of these points in the original frame and change the color to red
        index_pt1 = list_indexes[i][0]
        index_pt2 = list_indexes[i][1]
        change_color_originalframe(index_pt1,index_pt2)

一旦确定两个点彼此距离太近，则标记该点的圆圈的颜色就会从绿色变为红色，并且原始帧上的边界框颜色也会与之相同。

5. 结果

让我继续介绍这个项目的工作原理：

首先得到平面图的 4 个角点，然后应用透视变换得到这个平面图的鸟瞰视图，并保存变换矩阵。
获取在原始帧中检测到的每个人体的边界框。
计算这个边界框的最低点，它位于边界框两个底部角点之间。
利用这些点的变换矩阵来得到每个人体的真实 “GPS” 坐标。
使用 itertools.combinations() 来测量每个点到帧中所有其他点的距离。
如果检测到社交距离过近，则将边界框的颜色更改为红色。

我使用了一段来自 PETS2009 数据集的视频，该数据集由包含不同人群活动的多传感器序列组成。它最初是用来完成人数统计和人群密度估计等任务的。我决定使用第一个角度的视频，因为它是最宽的一个角度，具有最好的场景视图。这段视频展示了所获得的结果。

https://v.qq.com/x/page/k3101orq51k.html

6. 结论与改进

如今，社交距离和其他基本卫生措施对尽可能减缓新冠肺炎病毒的传播速度。但这个项目只是一个概念验证，由于伦理和隐私问题，并没有用来检测公共或私人领域的社交距离。

我很清楚这个项目并不完美，所以，以下是我对如何改进这个应用程序的一些想法：

使用更快的模型来执行实时社交距离分析。
使用对遮挡情况更稳健的模型。
自动标定（Automatic calibration）是计算机视觉中一个非常著名的问题，它可以极大地改善不同场景下的鸟瞰图变换。

我已在 GitHub 上提供了本文的完整代码。

参考资料

作者介绍：

Basile Roth，加拿大魁北克省蒙特利尔硕士研究生。研究领域包括：机器学习、计算机视觉和自然语言处理。

原文链接：

https://towardsdatascience.com/a-social-distancing-detector-using-a-tensorflow-object-detection-model-python-and-opencv-4450a431238

创作场景

隔离宅在家，我自己做了个社交距离检测器