The Art of Machine Learning

Disentangle the matrix

The SSD model showed significant improvement in detection speed due to the elimination of bounding box proposal and resampling of pixels, while achieving high accuracy[1]. I won't go into details on how SSD model works and the training mechanism, if you're interested check out the in-depth article written by Aman Dalmia. What is most relevant to this post is the output matrix of SSD. By default, when the model makes an inference on an image, it returns 8,732 bounding boxes for each class to accommodate for the range of scale and aspect ratios. Then it tidys up the prediction using non-maximum suppression to remove overlapping boxes which exceeds the IoU threshold of 0.45, keeping only the top 200 boxes total. This means that the immediate output array from an image is (N, i, j, b), where N is the batch axis (1 for one image/frame), i is the class, j is the number of detection, and b is the axis containing the confidence score and bounding box coordinates.

The approach used to track the objects can be broken down into a few steps.

Within a frame, calculate the centroid coordinates for each bounding box, assign the proper ID and class label.
Update the dictionary list of the centroids detected.
Check the updated list to see if current centroids match centroids in the previous frame, corresponding to the same object. If yes, draw out the tracking.

Keeping that in mind, the following function returns the bounding box matrix "detections". The function takes one frame, and adds the first axis for batching, and runs a single forward pass through the pretrained network. The scale variable is needed to later transform the array back to the original frame size.


    
      
        
  
    
    

        


  


  
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
          
          def detect(frame, net, transform):
        
        
          
              """
        
        
          
              frame: individual frame
        
        
          
              net: pretrained object detector model
        
        
          
              transform: function to transform image of the frame
        
        
          
              """
        
        
          
              height, width = frame.shape[:2]
        
        
          
              frame_t = transform(frame)[0]
        
        
          
              x = torch.from_numpy(frame_t).permute(2, 0, 1)
        
        
          
              x = x.unsqueeze(0)
        
        
          
              # We feed the neural network ssd with the image to get prediction output.
        
        
          
              with torch.no_grad():
        
        
          
                  y = net(x)
        
        
          
              # the dimension is (N,i,j,b), i is the class, j is the number of detection (200), b axis is confidence and bounding box coordinates
        
        
          
              detections = y.data
        
        
          
              scale = torch.Tensor([width, height, width, height])
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
  



    

  


      
      
        view raw
        
          object_detect.py
        
        hosted with ❤ by GitHub

Next, the array is looped through by each class, and boxes with confidence score greater than or equal to 0.6 are selected. And the bounding boxes are drawn using cv2.rectangle(). Note that many of the OpenCV utility drawing functions directly overwrites the pixels in that image, so the original image is altered.


    
      
        
  
    
    

        


  


  
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
          
              for i in range(detections.size(1)):
        
        
          
                  j = 0
        
        
          
                  rects = []
        
        
          
                  # for each class, filter for bbox with confidence score >=0.6.
        
        
          
                  while detections[0, i, j, 0] >= 0.6:
        
        
          
                      # SSD returns the points at the upper left and the lower right of the bbox
        
        
          
                      pt = (detections[0, i, j, 1:] * scale).numpy()
        
        
          
                      cv2.rectangle(
        
        
          
                          frame,
        
        
          
                          (int(pt[0]), int(pt[1])),
        
        
          
                          (int(pt[2]), int(pt[3])),
        
        
          
                          (255, 0, 0),
        
        
          
                          2,
        
        
          
                      )
        
        
          
          

        
        
          
                      # save the bounding box coordinates (startX, startY, endX, endY)
        
        
          
                      rects.append((int(pt[0]), int(pt[1]), int(pt[2]), int(pt[3])))
        
        
          
                      j += 1
        
        
          
                      objects = ct.update(rects)
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
  



    

  


      
      
        view raw
        
          object_detect.py
        
        hosted with ❤ by GitHub

The bbox coordinates are saved in the object instance ct = CentroidTracker(), which was a class function provided by Adrian Rosebrock @ Pyimagesearch to calculate the centroid and assign its ID label[2]. The underlying assumption is that the distance between a given object centroid for frames t and t-1 will be smaller than its distances to other object centroids. Having this dictionary allows us to loop over the labeled objects within each class and draw the centroids as shown below.


    
      
        
  
    
    

        


  


  
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
          
                      # loop over the tracked objects
        
        
          
                      for (objectID, centroid) in objects.items():
        
        
          
                          # draw the centroid of the object on the output frame and put text on the frame
        
        
          
                          text = labelmap[i - 1] + "_" + str(objectID)
        
        
          
                          cv2.putText(
        
        
          
                              frame,
        
        
          
                              text,
        
        
          
                              (centroid[0] - 10, centroid[1] - 10),
        
        
          
                              cv2.FONT_HERSHEY_SIMPLEX,
        
        
          
                              2,
        
        
          
                              (0, 255, 0),
        
        
          
                              2,
        
        
          
                              cv2.LINE_AA,
        
        
          
                          )
        
        
          
                          cv2.circle(frame, (centroid[0], centroid[1]), 10, (0, 255, 0), -1)
        
        
        
        
        
        
        
        
        
        
  



    

  


      
      
        view raw
        
          object_detect.py
        
        hosted with ❤ by GitHub

Follow the centroid

Centroid tracking is a convenient tool that allows for frame-to-frame tracking of movement, like the path of a bicyclist. I modified the above detect function to allow for movement tracking by identifying the same label ID from the same class in the previous frame. And using the deque object from the collections package to update the list of centroids for each frame. Here's the complete function that loops through each class, and updates the list of centroids, then draws the movement in lines color-coded by class.


    
      
        
  
    
    

        


  


  
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
          
          def tracking(frame, net, transform, f_idx, track_pts):
        
        
          
              """
        
        
          
              frame: individual frame
        
        
          
              net: pretrained object detector mdoel
        
        
          
              transform: function to transform image of the frame
        
        
          
              f_idx: frame count
        
        
          
              track_pts: deque collection of centroids
        
        
          
              """
        
        
          
              height, width = frame.shape[:2]
        
        
          
              frame_t = transform(frame)[0]
        
        
          
              x = torch.from_numpy(frame_t).permute(2, 0, 1)
        
        
          
              x = x.unsqueeze(0)
        
        
          
              with torch.no_grad():
        
        
          
                  # We feed the neural network ssd with the image to get prediction output.
        
        
          
                  y = net(x)
        
        
          
              # the dimension is (N,i,j,b), i is the class, j is the number of detection (200), b axis is confidence and bounding box coordinates
        
        
          
              detections = y.data
        
        
          
              scale = torch.Tensor([width, height, width, height])
        
        
          
          

        
        
          
              # keep track of centroid points for this individual frame
        
        
          
              centroid_pts = {f_idx: {}}
        
        
          
              color_ls = [
        
        
          
                  (79, 111, 243),
        
        
          
                  (129, 199, 190),
        
        
          
                  (86, 160, 82),
        
        
          
                  (181, 190, 183),
        
        
          
                  (49, 90, 93),
        
        
          
                  (142, 132, 21),
        
        
          
                  (69, 116, 8),
        
        
          
                  (9, 68, 193),
        
        
          
                  (126, 152, 167),
        
        
          
                  (103, 95, 90),
        
        
          
                  (98, 117, 184),
        
        
          
                  (186, 112, 111),
        
        
          
                  (234, 54, 27),
        
        
          
                  (90, 218, 83),
        
        
          
                  (134, 185, 119),
        
        
          
                  (174, 1, 141),
        
        
          
                  (2, 222, 238),
        
        
          
                  (219, 86, 39),
        
        
          
                  (155, 151, 186),
        
        
          
                  (192, 221, 232),
        
        
          
                  (174, 149, 85),
        
        
          
              ]
        
        
          
          

        
        
          
              for i in range(detections.size(1)):
        
        
          
                  j = 0
        
        
          
                  rects = []
        
        
          
                  # for each class, filter for bbox with confidence score >=0.6.
        
        
          
                  while detections[0, i, j, 0] >= 0.60:
        
        
          
                      # SSD returns the points at the upper left and the lower right of the bbox
        
        
          
                      pt = (detections[0, i, j, 1:] * scale).numpy()
        
        
          
                      cv2.rectangle(
        
        
          
                          frame,
        
        
          
                          (int(pt[0]), int(pt[1])),
        
        
          
                          (int(pt[2]), int(pt[3])),
        
        
          
                          color_ls[i],
        
        
          
                          3,
        
        
          
                      )
        
        
          
                      # save the bounding box coordinates (startX, startY, endX, endY)
        
        
          
                      rects.append((int(pt[0]), int(pt[1]), int(pt[2]), int(pt[3])))
        
        
          
                      j += 1
        
        
          
                      objects = ct.update(rects)
        
        
          
          

        
        
          
                      # loop over the tracked centroids
        
        
          
                      centroid_pts[f_idx][i] = {}
        
        
          
                      for (objectID, centroid) in objects.items():
        
        
          
                          # add centroid pt to list for each class
        
        
          
                          w = {objectID: centroid}
        
        
          
                          centroid_pts[f_idx][i].update(w)
        
        
          
          

        
        
          
                          #  draw the centroid of the object on the output frame
        
        
          
                          cv2.circle(frame, (centroid[0], centroid[1]), 15, color_ls[i], -1)
        
        
          
                          text = labelmap[i - 1] + "_" + str(objectID)
        
        
          
                          cv2.putText(
        
        
          
                              frame,
        
        
          
                              text,
        
        
          
                              (centroid[0] - 10, centroid[1] - 10),
        
        
          
                              cv2.FONT_HERSHEY_SIMPLEX,
        
        
          
                              2,
        
        
          
                              (0, 255, 0),
        
        
          
                              2,
        
        
          
                              cv2.LINE_AA,
        
        
          
                          )
        
        
          
          

        
        
          
              track_pts.appendleft(centroid_pts)  # update the deque list
        
        
          
          

        
        
          
              # loop through track_pts list and plot the "contrails" of the tracked object
        
        
          
              for i in range(len(track_pts) - 1):
        
        
          
                  frame_idx = list(track1[i].keys())[0]
        
        
          
                  current = track1[i][frame_idx]
        
        
          
                  for class_idx, v in current.items():
        
        
          
                      for label_id, coord in v.items():
        
        
          
                          try:
        
        
          
                              y = track1[i][frame_idx][class_idx][label_id]
        
        
          
                              y_t = track1[i + 1][frame_idx - 1][class_idx][label_id]
        
        
          
                              # draw the line
        
        
          
                              thickness = int(np.sqrt(64 / float(i + 1)) * 2.5)
        
        
          
                              cv2.line(
        
        
          
                                  frame, tuple(y_t), tuple(y), color_ls[class_idx], thickness
        
        
          
                              )
        
        
          
                          except KeyError:
        
        
          
                              pass
        
        
          
          

        
        
          
              return frame, track_pts
        
        
        
        
        
        
        
        
        
        
  



    

  


      
      
        view raw
        
          object_tracking_line.py
        
        hosted with ❤ by GitHub

This function has the additional argument `f_idx` and `track_pts` because it needs to keep track of the frame sequence, and it needs a pre-initialized dictionary object. Here, the dictionary `track_pts` can have specified maxlength to keep the line drawn at a certain max length. The critical part is the loops towards the end, where it checks to see if the current centroid (y) shares matching label ID and class with centroids in the previous frame (y_t). An output sample clip is shown below.

There're some misclassified objects in between frames, and only the foreground runners were detected. As a result, some of the movement tracking were off due to the mislabeling. Like many object detection model, SSD did not perform as well when there's substantial background noises blending in with the image and the foreground scenery. Next, I tried out the algorithm on a second clip that captures street traffic at night.

This dim lit video yielded worse detection performance, mislabeling some cars as person. Whenever there are other objects positioned directly behind the pedestrian, the model struggles to make the right classification, and at times no classfication was made for the pedestrian. In terms of the detection, it is well known that object detection models experience significant decrease in performance when they make inference on dark images. One way to mitigate that is to preprocess the video feeds to retrieve the details, using pretrained models like See in the Dark (SID) to transform low-light images to well-lit RGB outputs[3].

For continuous tracking, we can try to reduce the "noise" by increasing the default value of maxDisappeared=50 in the CentroidTracker class function, where maxDisappeared is the number of maximum consecutive frames a given object is allowed to be marked as "disappeared" until its tracking stops. We can also adjust `y` and `y_t` to skip a frame (e.g i+2,frame_idx - 2) so it's more robust to the occasional misclassification in between frames.

2D posture detection

There has been a wealth of research and open source programs to detect human poses, with the most promising library OpenPose released by the Perceptual Computing Lab @ Carnegie Mellon University, and won the COCO keypoints challenge in 2016. It uses the first 10 layers of the VGGnet as input for downstream CNN network to predict confidence maps of body key points (shoulders, hands, legs, etc), and the affinity vectors between parts to estimate the 2D keypoint coordinates for the human subjects[4].

I utilitized the body estimator model from the pytorch-openpose repo. For rendering the skeleton, the model returns a total of 18 key points from an array of candidate points. After extracting those key points, I applied trigonometric functions to pick out the runners in the video frame based on angles of the arms and legs.

The result performance was sensitive to "loud" background, and the overlapping and proximity of the subjects in the frame. For some odd reason, when the pedestrians walked past the trash cans they were misclassified as "Runner" in a few frames. I managed to fine-tune the angle thresholds to reduce the false positives of "Runner" detection. To run one frame without a GPU it took about 59 seconds, with a Colab Tesla T4 instance it was cut down to 3 seconds.

With the wide spread of video surveillance comes high-stake privacy concerns for certain applications. For detection tasks that preserve privacy, you can substitue the output video with a base frame (a frame that only shows the plain background) and only show the bounding boxes and the trackings. This means only the relevant outputs are kept while the original raw data can be deleted after model inference. I have written another code to accomodate that, and the example output can be downloaded here.

It is fairly straightforward to implement AI in real-time object detection and tracking. The biggest challenge for these algorithms seems to be sensitivity to crowding, where the classification of one object/subject is somehow influenced by nearby objects and background leading to mislabeling. This is probably due to the fact that many object detection models were trained on benchmark dataset such as COCO and ImageNet, which consist mostly cleaner images than real world samples. Another constraint is that these Deep Learning models would need to be deployed on a GPU to achieve a reasonable latency. Some example solutions for that include cloud platform like AWS Kinesis and SageMaker, or embed the trained model to an edge device that runs the model on the device itself.

You can view the complete codes for the material presented in this post in my github repo. And feel free to leave comments here.

REFERENCES

[1]Liu, W., Anguelov, D., Erhan, D., et al. SSD: Single Shot MultiBox Detector, arXiv preprint arXiv:1512.02325, 2015.

[2]Rosebrock, A., Simple object tracking with OpenCV, https://www.pyimagesearch.com/2018/07/23/simple-object-tracking-with-opencv/

[3]Z,I., Safer YOLO, in the Dark (I), https://medium.com/@turboergouzhi/safer-yolo-in-the-dark-i-98ddaa7db3ad

[4]Z Cao, G Hidalgo, T Simon, SE Wei, Y Sheikh, OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields,arXiv preprint arXiv:1812.08008

	def detect(frame, net, transform):
	"""
	frame: individual frame
	net: pretrained object detector model
	transform: function to transform image of the frame
	"""
	height, width = frame.shape[:2]
	frame_t = transform(frame)[0]
	x = torch.from_numpy(frame_t).permute(2, 0, 1)
	x = x.unsqueeze(0)
	# We feed the neural network ssd with the image to get prediction output.
	with torch.no_grad():
	y = net(x)
	# the dimension is (N,i,j,b), i is the class, j is the number of detection (200), b axis is confidence and bounding box coordinates
	detections = y.data
	scale = torch.Tensor([width, height, width, height])

	for i in range(detections.size(1)):
	j = 0
	rects = []
	# for each class, filter for bbox with confidence score >=0.6.
	while detections[0, i, j, 0] >= 0.6:
	# SSD returns the points at the upper left and the lower right of the bbox
	pt = (detections[0, i, j, 1:] * scale).numpy()
	cv2.rectangle(
	frame,
	(int(pt[0]), int(pt[1])),
	(int(pt[2]), int(pt[3])),
	(255, 0, 0),
	2,
	)

	# save the bounding box coordinates (startX, startY, endX, endY)
	rects.append((int(pt[0]), int(pt[1]), int(pt[2]), int(pt[3])))
	j += 1
	objects = ct.update(rects)

	# loop over the tracked objects
	for (objectID, centroid) in objects.items():
	# draw the centroid of the object on the output frame and put text on the frame
	text = labelmap[i - 1] + "_" + str(objectID)
	cv2.putText(
	frame,
	text,
	(centroid[0] - 10, centroid[1] - 10),
	cv2.FONT_HERSHEY_SIMPLEX,
	2,
	(0, 255, 0),
	2,
	cv2.LINE_AA,
	)
	cv2.circle(frame, (centroid[0], centroid[1]), 10, (0, 255, 0), -1)

	def tracking(frame, net, transform, f_idx, track_pts):
	"""
	frame: individual frame
	net: pretrained object detector mdoel
	transform: function to transform image of the frame
	f_idx: frame count
	track_pts: deque collection of centroids
	"""
	height, width = frame.shape[:2]
	frame_t = transform(frame)[0]
	x = torch.from_numpy(frame_t).permute(2, 0, 1)
	x = x.unsqueeze(0)
	with torch.no_grad():
	# We feed the neural network ssd with the image to get prediction output.
	y = net(x)
	# the dimension is (N,i,j,b), i is the class, j is the number of detection (200), b axis is confidence and bounding box coordinates
	detections = y.data
	scale = torch.Tensor([width, height, width, height])

	# keep track of centroid points for this individual frame
	centroid_pts = {f_idx: {}}
	color_ls = [
	(79, 111, 243),
	(129, 199, 190),
	(86, 160, 82),
	(181, 190, 183),
	(49, 90, 93),
	(142, 132, 21),
	(69, 116, 8),
	(9, 68, 193),
	(126, 152, 167),
	(103, 95, 90),
	(98, 117, 184),
	(186, 112, 111),
	(234, 54, 27),
	(90, 218, 83),
	(134, 185, 119),
	(174, 1, 141),
	(2, 222, 238),
	(219, 86, 39),
	(155, 151, 186),
	(192, 221, 232),
	(174, 149, 85),
	]

	for i in range(detections.size(1)):
	j = 0
	rects = []
	# for each class, filter for bbox with confidence score >=0.6.
	while detections[0, i, j, 0] >= 0.60:
	# SSD returns the points at the upper left and the lower right of the bbox
	pt = (detections[0, i, j, 1:] * scale).numpy()
	cv2.rectangle(
	frame,
	(int(pt[0]), int(pt[1])),
	(int(pt[2]), int(pt[3])),
	color_ls[i],
	3,
	)
	# save the bounding box coordinates (startX, startY, endX, endY)
	rects.append((int(pt[0]), int(pt[1]), int(pt[2]), int(pt[3])))
	j += 1
	objects = ct.update(rects)

	# loop over the tracked centroids
	centroid_pts[f_idx][i] = {}
	for (objectID, centroid) in objects.items():
	# add centroid pt to list for each class
	w = {objectID: centroid}
	centroid_pts[f_idx][i].update(w)

	# draw the centroid of the object on the output frame
	cv2.circle(frame, (centroid[0], centroid[1]), 15, color_ls[i], -1)
	text = labelmap[i - 1] + "_" + str(objectID)
	cv2.putText(
	frame,
	text,
	(centroid[0] - 10, centroid[1] - 10),
	cv2.FONT_HERSHEY_SIMPLEX,
	2,
	(0, 255, 0),
	2,
	cv2.LINE_AA,
	)

	track_pts.appendleft(centroid_pts) # update the deque list

	# loop through track_pts list and plot the "contrails" of the tracked object
	for i in range(len(track_pts) - 1):
	frame_idx = list(track1[i].keys())[0]
	current = track1[i][frame_idx]
	for class_idx, v in current.items():
	for label_id, coord in v.items():
	try:
	y = track1[i][frame_idx][class_idx][label_id]
	y_t = track1[i + 1][frame_idx - 1][class_idx][label_id]
	# draw the line
	thickness = int(np.sqrt(64 / float(i + 1)) * 2.5)
	cv2.line(
	frame, tuple(y_t), tuple(y), color_ls[class_idx], thickness
	)
	except KeyError:
	pass

	return frame, track_pts