Exploring the COCO Dataset

This article was originally published at 3LC’s website. It is reprinted here with the permission of 3LC.

The COCO dataset is a cornerstone of modern object detection, shaping models used in self-driving cars, robotics, and beyond. But what happens when we take a closer look? By examining annotations, embeddings, and dataset patterns at a granular level, we’re uncovering quirks, inconsistencies, and areas for potential refinement. Could this lead to a better YOLO model? No guarantees – just an exploration of one of computer vision’s most influential datasets.

The COCO dataset has been one of the foundational pillars of modern computer vision – alongside ImageNet and others. It has shaped how models like Ultralytics YOLO understand the world, powering everything from self-driving cars to smart surveillance, drones, and robots. COCO is often the dataset that object detection foundational models are trained on before you fine-tune them with your own data.

In this deep dive, I’m exploring the COCO dataset and its annotations to better understand what makes COCO so effective – with a hairy goal of refining the dataset and labels even further. Could this lead to a better foundational YOLO model and weights? Let’s find out!

If you’re working with object detection, training object detection models, or just curious about dataset quality at scale, this one’s for you. Also, there are nice pictures and videos!

(Disclaimer: I’m not a data scientist and this is more intuition-driven experimentation than rigorous science – but let’s see where it takes us!)

COCO 2017

First, we download the dataset from COCO’s official site before registering it in 3LC.

Registering only takes a few seconds – no data is duplicated or moved.

Next, we open the 3LC dashboard and see our datasets

We can open both the train and validation sets simultaneously, but for now, let’s just open the train dataset.

Browsing the images and bounding boxes

Browsing the COCO Dataset

We can quickly browse the dataset using the filtering panel on the left. For example, let’s try finding an image containing a Person, Umbrella, and Cow at the same time – just for fun! Even with 800,000+ bounding boxes and ~100,000 images, this is instant.

By selecting the three categories in the left panel and using the AND operation, we find that 28 images (out of 118,287) contain all three labels simultaneously.

We don’t have many metrics (yet!), but we do have Width and Height, so let’s plot those and select the images with the lowest resolution.

Lassoing in on the lowest width/height images

And we get these:

Low resolution images in the COCO dataset

While this is useful for quick dataset browsing, we need more meaningful metrics for deeper exploration!

Creating Per-Bounding Box Embeddings

The first step is to extend the dataset by generating per-bounding box embeddings (along with additional metrics). Instead of using an already trained model, we’re going to train a classifier on what’s inside each bounding box across the entire dataset. Then, we’ll extract those embeddings and create a new revision of the 3LC table containing them.

Why? Because generic pre-trained models for generating embeddings may not be relevant – especially if you’re working with domain-specific data.

The repository for this script is open-source and available here:

https://github.com/3lc-ai/3lc-examples/tree/main/src/tlc_tools/experimental

(This script works directly with the bounding boxes in the dataloader, so no duplicated data is created for training the classifier on the cropped bounding boxes.)

Training the classification model

The script automatically assigns weights per bounding box, which is good since COCO’s categories are highly imbalanced. The script uses PyTorch’s WeightedRandomSampler to fetch bounding boxes in a balanced manner, while augmenting and varying crops (bounding box position and scale). This ensures that each bounding box appears slightly different for each training epoch, compensating for imperfect labels.

Now that we have ~880,000 embeddings (one per bounding box in the train and val set), we can start exploring!

Exploring Per-Bounding Box Embeddings

Let’s try looking at some embedding outliers!

These “weird” and not precise bounding boxes are tagged with is_crowd in the original COCO data, meaning they encompass multiple objects. But do models actually learn from these? Or do they just ignore them? Or even worse, is the YOLO foundational model not as good as it could be, due to these?

Capturing Metrics with a YOLO Model

To dig deeper, let’s capture YOLO’s predictions on COCO. Since YOLO was trained on this dataset, we can just collect metrics and see what it has learned.

We create a Run to collect per-bounding box metrics from the model.

We are using the extended tables that has the per BB embedding

First, let’s see if the model predicted any is_crowd bounding boxes on the validation set.

Left: Ground truth (is_crowd labels) | Right: Model’s predictions

Looking through examples, we see that the model never predicts these large bounding boxes on the validation set. What about in the training set?

No “large”prediction – only individual people

Even in the training set, the model never predicts these bounding boxes, so it’s penalized for missing them during training. That can’t be good. There are 9,000+ images in the train set with these bounding boxes.

To take a more scientific look at this, let’s create two plots: one where embeddings are colored by False Negatives (missed predictions) and another where they are colored by is_crowd tags.

Left: Green represents False Negative bounding boxes | Right: Purple highlights is_crowd bounding boxes

It’s pretty clear these labels are never predicted – time to start fixing the data!

Removing is_crowd Bounding Boxes

Let’s delete all is_crowd bounding boxes (from both the train and validation sets), commit and create new 3LC table revision, and later retrain a new YOLO model from scratch.

Later, I’ll compare:

  • Original YOLO model on the revised validation set vs.
  • New YOLO model on the revised validation set

I predict the new model should perform better – especially since the model never predicted these labels anyway!

(However, maybe these labels add useful noise or help the model learn some other way?)

But before we start any re-training, let’s see if we can change more

Finding & Adding Missing Labels

I’ll check for missing bounding boxes – labels that should have been in the dataset but weren’t. Using the filter panel, we isolate high-confidence model predictions that has no overlap with the labels in COCO.

Left: Original labels | Right: Filtered in predictions are likely missing labels

At IOU 0 (no overlap) and 0.5 confidence, as you can see highlighted on the left filter panel with the yellow boxes, indicating active filters, we find over 4,000+ images with 40,000+ potential missing bounding boxes.

Not all of these predictions are correct, of course, but most should be. Intuitively, adding many missing but accurate bounding boxes – even if some incorrect ones are included – should still improve performance! A more meticulous approach would be to add missing labels iteratively, starting with adding even more confident predictions, retraining in-between (but it’s costly to train YOLO from scratch!)

Adding 40,000+ missing bounding boxes in one go!

Now, we have two revisions of the datasets both for train and val set:

Revision 1: all is_crowd bounding boxes are removed

Revision 2: 40,000+ missing bounding boxes have been added to revision 1

There’s much more we could explore, like inaccurate or wrong labels, but let’s start here! Training a YOLO model from scratch takes time – The smallest YOLO11n (nano) takes 6 days on an A100 NVidia GPU. So, it’ll be a while before the next update…

(Unless someone wants to sponsor a GPU cluster)

Note: If we actually manage to improve the foundational Ultralytics YOLO model – which would be huge, considering it’s the starting point for millions of training runs each year and powers countless object detection applications across industries – we’ll of course share both the improved model weights on Hugging Face and the refined COCO dataset (licenses permitting). Even small improvements at this stage could ripple through the entire ecosystem, making models more accurate and efficient.

Again, I’m not a data scientist – just vibing and using intuition here. Let’s see where this takes us!

Coming in Part 2: Training a YOLO Model from Scratch with revised data

Paul Endresen
CEO and CTO, 3LC

Here you’ll find a wealth of practical technical insights and expert advice to help you bring AI and visual intelligence into your products without flying blind.

Contact

Address

Berkeley Design Technology, Inc.
PO Box #4446
Walnut Creek, CA 94596

Phone
Phone: +1 (925) 954-1411
Scroll to Top