# Comparing Object Detection Algorithms, Demystified

--

(NOTE: This article is an attempt to explain detection metrics to beginners. For a more detailed and advanced article, check out Jonathan Hui’s work here.)

Comparing object detection algorithms is a lot like comparing mattresses. At the time of writing, there are no less than 5 different metrics for comparing the performance of an object detection algorithm:

- IOU (Intersection-over-Union)
- mAP (AP)
- AP-50
- AP-75
- mAP[5..:95]

In this article, we’ll tackle these metrics one by one and help you choose a good object detection algorithm. For the purposes of this article, we’ll compare a few common detectors:

- YOLOv2
- YOLOv3
- RetinaNet
- SSD
- R-FCN

# IOU (Intersection-over-Union)

Intersection over Union is by far the simplest and most easy to understand out of all of the metrics. It’s simply the amount of overlap between the **predicted bounding box** (the one generated by the detector) and the **ground truth** (the “correct” or “most accurate” box). A diagram explains it best:

The IOU is a good base metric for how “accurate” a detector is. The higher the IOU, the close it is to the “correct” ground truth.

# mAP (Mean Average Precision)

TL;DR: mAP is a combined measure of how many predictions are correct and how many of the objects it identified.

For the math behind it, read below.

mAP (sometimes just called AP in papers) is the most common measure of accuracy presented in research papers (it’s used as a metric in the YOLOv3 and RetinaNet papers, to name a few). It’s also one of the most confusing.

To first understand mAP, we must first understand **recall **and **precision**. They are defined by these mathematical formulas:

Basically, **precision** is the number of true positives (correct detections) over the sum of the true positives and the false positives (incorrect detections.) **Recall **is the true positives over the sum of the true positives and the **false negatives** (when the algorithm doesn’t detect something).

In layman’s terms, precision is how many of your detections are correct. Recall is how good your algorithm is at finding **all** the objects.

AP is measured as the average of the **maximum precision value **for a certain recall level. It can be represented mathematically as

Where AP r(x) is the maximum precision at recall values of 0–1.0.

Confused? You’re not alone.

Here’s an example of how to calculate AP:

When we plot the precision and recall values, we get this:

B is the original data. C is the maximum precision value for the recall value. C is what we’r interested in. Using the graph, we get the equation

AP = ((1.0)*5 + (0.57)*4 + (0.5)*2)/11

and an AP of 0.75.

Looking at the dataset, we have a precision value of 1.0 for a recall value of 0–0.4 (5*1.0). For 0.5–0.8, we have a max precision value of 0.57 (0.57*4). Lastly, for 0.8–1.0, we have a max precision of 0.5 (0.5*3). Averaging these by dividing by 11, we get an average precision of 0.75.

In benchmarks such as the Pascal VOC dataset, mAP is just AP averaged over all the classes.

# AP50, AP75, AP[.5:.95]

Occasionally, there will be numbers next to AP. They’re pretty simple.

Benchmarks such as Pascal VOC use IOU cutoffs to decide whether an image is classified “correctly”. In the example of AP50 (the cutoff in the example above), any detection with an IOU **over** .5 is considered correct. In the case of AP75, the cutoff is .75 IOU.

AP[.5:.95] is the average over AP values for IOU cutoffs ranging from .5 IOU to .95 IOU. Sometimes, there may be a value in between (AP[.5:.05:.95]), indicating the **step value **or the difference between levels (it’s usually .05).

# Comparison Time!

Now, the moment you’ve been waiting for — comparing algorithms!

Currently, out of all the algorithms, RetinaNet is the most accurate, with an average mAP of 61.1 on the AP-50. However, YOLOv3 is by far the fastest, outpacing RetinaNet, R-FCN, and SSD. It also isn’t far behind on the AP-50, with a score of 57.9.

Our recommendation is to use YOLO if you have any sort of realtime requirement or limited resources. If you need accuracy, try RetinaNet.