Automatic multi-class object tracking with OpenCV

object detection

Image analysis and object detection have always been something I'm interested in, my first exposure was writing a Matlab program to automatically detect fluorescent microarray spots and extract the pixel intensity values. A while back I took the Udemy course Deep Learning and Computer Vision A-Z by Hadelin de Ponteves and Kirill Eremenko. One of the projects was object detection using the Deep learning model Single Shot MultiBox Detector (SSD). I wanted to expand on that project and come up with a straightforward way to track the movement of detected objects across multiple classes in video feeds. This is the basis for applications like traffic monitoring, estimating velocity, and security monitoring.

I came across some example tutorials of object tracking in the blog Pyimagesearch. However, those examples are either for tracking single-class object or not recording the object class. So I modified the codes to include multi-class object trackings based on the output of the SSD detection. For this post, I will share the custom functions that I came up with to integrate multiple class and multiple label tracking using OpenCV, as well as posture estimation.

June 10th, 2020 - 10 minute read -
OpenCV, Object Detection, PyTorch

Disentangle the matrix

The SSD model showed significant improvement in detection speed due to the elimination of bounding box proposal and resampling of pixels, while achieving high accuracy[1]. I won't go into details on how SSD model works and the training mechanism, if you're interested check out the in-depth article written by Aman Dalmia. What is most relevant to this post is the output matrix of SSD. By default, when the model makes an inference on an image, it returns 8,732 bounding boxes for each class to accommodate for the range of scale and aspect ratios. Then it tidys up the prediction using non-maximum suppression to remove overlapping boxes which exceeds the IoU threshold of 0.45, keeping only the top 200 boxes total. This means that the immediate output array from an image is (N, i, j, b), where N is the batch axis (1 for one image/frame), i is the class, j is the number of detection, and b is the axis containing the confidence score and bounding box coordinates.

The approach used to track the objects can be broken down into a few steps.

  1. Within a frame, calculate the centroid coordinates for each bounding box, assign the proper ID and class label.
  2. Update the dictionary list of the centroids detected.
  3. Check the updated list to see if current centroids match centroids in the previous frame, corresponding to the same object. If yes, draw out the tracking.

Keeping that in mind, the following function returns the bounding box matrix "detections". The function takes one frame, and adds the first axis for batching, and runs a single forward pass through the pretrained network. The scale variable is needed to later transform the array back to the original frame size.

Next, the array is looped through by each class, and boxes with confidence score greater than or equal to 0.6 are selected. And the bounding boxes are drawn using cv2.rectangle(). Note that many of the OpenCV utility drawing functions directly overwrites the pixels in that image, so the original image is altered.

The bbox coordinates are saved in the object instance ct = CentroidTracker(), which was a class function provided by Adrian Rosebrock @ Pyimagesearch to calculate the centroid and assign its ID label[2]. The underlying assumption is that the distance between a given object centroid for frames t and t-1 will be smaller than its distances to other object centroids. Having this dictionary allows us to loop over the labeled objects within each class and draw the centroids as shown below.

Follow the centroid

Centroid tracking is a convenient tool that allows for frame-to-frame tracking of movement, like the path of a bicyclist. I modified the above detect function to allow for movement tracking by identifying the same label ID from the same class in the previous frame. And using the deque object from the collections package to update the list of centroids for each frame. Here's the complete function that loops through each class, and updates the list of centroids, then draws the movement in lines color-coded by class.

This function has the additional argument `f_idx` and `track_pts` because it needs to keep track of the frame sequence, and it needs a pre-initialized dictionary object. Here, the dictionary `track_pts` can have specified maxlength to keep the line drawn at a certain max length. The critical part is the loops towards the end, where it checks to see if the current centroid (y) shares matching label ID and class with centroids in the previous frame (y_t). An output sample clip is shown below.

There're some misclassified objects in between frames, and only the foreground runners were detected. As a result, some of the movement tracking were off due to the mislabeling. Like many object detection model, SSD did not perform as well when there's substantial background noises blending in with the image and the foreground scenery. Next, I tried out the algorithm on a second clip that captures street traffic at night.

This dim lit video yielded worse detection performance, mislabeling some cars as person. Whenever there are other objects positioned directly behind the pedestrian, the model struggles to make the right classification, and at times no classfication was made for the pedestrian. In terms of the detection, it is well known that object detection models experience significant decrease in performance when they make inference on dark images. One way to mitigate that is to preprocess the video feeds to retrieve the details, using pretrained models like See in the Dark (SID) to transform low-light images to well-lit RGB outputs[3].

For continuous tracking, we can try to reduce the "noise" by increasing the default value of maxDisappeared=50 in the CentroidTracker class function, where maxDisappeared is the number of maximum consecutive frames a given object is allowed to be marked as "disappeared" until its tracking stops. We can also adjust `y` and `y_t` to skip a frame (e.g i+2,frame_idx - 2) so it's more robust to the occasional misclassification in between frames.

2D posture detection

There has been a wealth of research and open source programs to detect human poses, with the most promising library OpenPose released by the Perceptual Computing Lab @ Carnegie Mellon University, and won the COCO keypoints challenge in 2016. It uses the first 10 layers of the VGGnet as input for downstream CNN network to predict confidence maps of body key points (shoulders, hands, legs, etc), and the affinity vectors between parts to estimate the 2D keypoint coordinates for the human subjects[4].

I utilitized the body estimator model from the pytorch-openpose repo. For rendering the skeleton, the model returns a total of 18 key points from an array of candidate points. After extracting those key points, I applied trigonometric functions to pick out the runners in the video frame based on angles of the arms and legs.

The result performance was sensitive to "loud" background, and the overlapping and proximity of the subjects in the frame. For some odd reason, when the pedestrians walked past the trash cans they were misclassified as "Runner" in a few frames. I managed to fine-tune the angle thresholds to reduce the false positives of "Runner" detection. To run one frame without a GPU it took about 59 seconds, with a Colab Tesla T4 instance it was cut down to 3 seconds.

With the wide spread of video surveillance comes high-stake privacy concerns for certain applications. For detection tasks that preserve privacy, you can substitue the output video with a base frame (a frame that only shows the plain background) and only show the bounding boxes and the trackings. This means only the relevant outputs are kept while the original raw data can be deleted after model inference. I have written another code to accomodate that, and the example output can be downloaded here.

It is fairly straightforward to implement AI in real-time object detection and tracking. The biggest challenge for these algorithms seems to be sensitivity to crowding, where the classification of one object/subject is somehow influenced by nearby objects and background leading to mislabeling. This is probably due to the fact that many object detection models were trained on benchmark dataset such as COCO and ImageNet, which consist mostly cleaner images than real world samples. Another constraint is that these Deep Learning models would need to be deployed on a GPU to achieve a reasonable latency. Some example solutions for that include cloud platform like AWS Kinesis and SageMaker, or embed the trained model to an edge device that runs the model on the device itself.

You can view the complete codes for the material presented in this post in my github repo. And feel free to leave comments here.



REFERENCES

[1]Liu, W., Anguelov, D., Erhan, D., et al. SSD: Single Shot MultiBox Detector, arXiv preprint arXiv:1512.02325, 2015.

[2]Rosebrock, A., Simple object tracking with OpenCV, https://www.pyimagesearch.com/2018/07/23/simple-object-tracking-with-opencv/

[3]Z,I., Safer YOLO, in the Dark (I), https://medium.com/@turboergouzhi/safer-yolo-in-the-dark-i-98ddaa7db3ad

[4]Z Cao, G Hidalgo, T Simon, SE Wei, Y Sheikh, OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields,arXiv preprint arXiv:1812.08008