The weighty significance of data cleanliness — or as I like to call it, “cleanliness is next to model-ness” — cannot be overstated.

Alexander Lan
12 min readMar 9, 2023

Our team had the delightful experience of increasing our F1 score (why F1 is another blog post to come) by a whopping 8% simply by giving our badly annotated data a one-way ticket to the digital trash bin with the help of our trusty and amazing fastdup tool. It’s amazing what a little tidying up can do for your model’s performance!

In the field of computer vision, the development of machine learning models typically involves a repetitive cycle of model training, testing, and comparison, culminating in the selection of the best-performing model or ensemble of models for production. However, these models may exhibit significant drawbacks when deployed in real-world conditions. To address this issue, we present a data-first approach to model development that incorporates data visualization and cleaning using fastdup and our custom ImageLoader tools that ease the visualization and analysis of computer vision data in jupyter. We demonstrate how this approach can help to avoid production problems by identifying and mitigating potential issues . Our approach enables data scientists and machine learning engineers to achieve better results, save time, and improve the reliability of their computer vision models.

1. Introduction

Our ultimate aim is to give data cleaning a touch of magic, making it possible to weed out poorly annotated images or eliminate labels automatically in a multi-label problem without the need for any manual intervention.

The full analysis Jupyter can be found in GitHub, including how to download the dataset and use it. The ImageLoader tool can be downloaded here

In the realm of machine learning, the development of models in startups, small companies, and large corporations alike is often marked by a scarcity of time and resources. As a result, data scientists and machine learning engineers may be forced to make tradeoffs between model quality and practical considerations such as productivity and speed to market. This can have serious consequences when the models are deployed to production, as their performance may suffer despite having demonstrated good quality on validation and test sets. To help mitigate this issue, we present a clear example of how models can be improved prior to deployment.

Here we present our approach using a publicly available dataset ( peta ) that showed the same improvements that we achieved with our own private one. Overall we achieved an 8% increase on F1 score by using data cleaning.

We utilized the efficient and powerful fastdup tool to cluster a large number of images quickly and effectively. This enabled us to assign a cluster_id to each image, which we further utilized to enhance our approach. Our ultimate objective was to extract tabular features that could easily identify:

good / bad images

  1. good / bad labels .

2. Experiment results

In our work we conducted 4 experiments and showed how smart data cleaning can improve your model performance. Since we did no hyperparameter tuning we just use a val set and no test set.

Experiment A — prediction with no data cleaning.

Experiment B — prediction with cleaning on train and val.

Experiment C— prediciton with data cleaning on train and no cleaning on val.

Pasted image 0

After seeing the impressive results of our previous experiments, we couldn’t help but wonder if we were just getting lucky with our data. So we decided to take a closer look and double check our good fortune. We wanted to make sure that we weren’t artificially increasing our scores by accidentally removing the “hard” labels during the cleaning process. To accomplish this, we conducted another experiment to specifically test for the robustness of our cleaning methodology. By doing so, we can feel confident that our data is truly representative of the underlying reality and that our results are a genuine reflection of our model’s performance.

Experiment D — prediction with no data cleaning on train and manual curation of data on val.

During the course of our experimentation, we discovered that the issue with the model’s performance isn’t always with the model itself, but with the validation set. In fact, it’s quite possible that the model is actually performing well, but the label associated with the data is incorrect. To mitigate this possibility, we conducted a thorough manual review of the discrepancy predictions between Experiment A, B, and C.

Here is an illustrated example of good components (ie images identified as the same cluster using fastdup). We can see that by cleaning the data on training we helped the model learn better by removing confusing examples.

Good components, predicted incorrectly in A and correctly in B and C

Pasted image 0
Pasted image 0
Pasted image 0
Pasted image 0

We expected to get a better F1 score from Experiment C than from Experiment A, but it was not like that. Our hypothesis was that the model from Experiment C trained with the better data is better than the model from Experiment A that was trained with uncleaned data. Therefore predictions in Experiment C should be better than Experiment A if we predict the same validation subset.

That is why we decided to add the Experiment D, in that we took the model for Experiment A and got predictions from a cleaned validation set. Doing so we got better predictions, because bad and difficult examples were deleted from the validation subset in this case. The table below shows that Experiment D prediction became better than Experiment A for at least head_gender and head_age.

Pasted image 0

Here i want to mention cnvrg — an excellent tool that really helped us manage all of these experiments and keep track very easily and with little effort.

Based on our experiments, we found that cleaning the entire dataset can lead to improved model performance, as seen in Experiment D. However, we initially observed worse performance in Experiment C compared to Experiment A, despite our hypothesis that Experiment C should perform better. To further investigate this, we manually validated the labels for all incorrect predictions by examining the images and labels for components and added new columns to our dataframe to track labeling errors and model mistakes. This enabled us to identify wrongly labeled components and correct their labels, resulting in improved F1 scores for both Experiment A and Experiment C. An example of this is shown in the image below, which includes samples of the corrected labels and bad components.

Here are some examples that were miss-labeled and artificially lowered the score of experiment C. After fixing their label, and recalculating we get experiment D.

Here we see a person labeedl as Under 45 and Under 60 (two different annotators). This is a “bad” component and was flagged as such by our algorithm. Yet it remained in the validation set. We can clearly see that the algorithm was right.

Screen shot 2023-03-07 at 14.48.24

Here we see an image predicted as less than 45, but the ground truth was 60. We manually override this labeling a a bit ambiguous at any case.

Screen shot 2023-03-07 at 14.50.20

Sample of the bad labeled components

There were also examples of bad labeled components containing more than 3 images.

Note the labels column. Here we count how many images in that cluster have that label. If the component is good , then all images should have the same labels.

Pasted image 0
Pasted image 0
Pasted image 0

It turns out that after the manual label validation and improvement the Experiment C outperformed the Experiment A results. We detected that it was so because the validation set included incorrectly labeled examples that were left uncleaned so far.

To provide a more comprehensive comparison, we have updated our table to include the results of Experiments A and C, along with some examples from the uncleaned validation set.

Pasted image 0

Recalculated results for the experiments A and C after the labels improving

Pasted image 0
Screenshot 2023-03-06 at 12.46.19
Screenshot 2023-03-06 at 12.52.13
Screenshot 2023-03-06 at 12.54.04

When training machine learning models on unrefined image data, it can lead to a waste of both time and money, as expensive GPUs are utilized. However, by improving the quality of input data through the creation of image components, we can simultaneously solve both issues. This approach not only reduces the resources required for training, but also leads to improved model performance.

Components were predicted right in Experiment_A and wrong in Experiment_B and Experiment_C:

Pasted image 0
Pasted image 0
Pasted image 0

Components were predicted right in Experiment_B and Experiment_C and wrong in Experiment_A:

Pasted image 0
Pasted image 0
Pasted image 0

2. Method and details

For this task PETA dataset (Pedestrian Attribute Recognition At Far Distance, https://dl.acm.org/doi/10.1145/2647868.2654966 or https://paperswithcode.com/dataset/peta) was used. This is a good example of a real-world dataset with 19000 of images with 8705 persons as well as more that 100 labels such as gender and clothing style, and all of them at a far distance and multi label (each object can have multiple labels.

For the model training and experiment tracking we used CNVRG which is a very nice MLOps tool (https://cnvrg.io/). We used PyTorch MobileNet_v2 with ImageNet weights as our baseline model and our goal was to configure and to fine-tune this model for our problem. We measured our model quality with an F1-score because the classes are imbalanced in the dataset.

3 Grouping similar images into components

The PETA dataset comprised 19,000 unlabeled images of 8,705 individuals, which presented a challenge to our team. To group similar images, we explored various approaches and found fastdup to be an effective out-of-the-box solution (https://github.com/visual-layer/fastdup). fastdup offers a simple API that enables the creation of a collection of top components in just a few lines of code. We found that using a custom embedding, in the form of ReID network , provided better results (included in our code) which we found to be the most effective out of the various models we tested. In general it seems that a task specific embedding makes it easier to cluster objects based on the desired criteria. The PETA dataset images have a similar shape ratio to the 300x170 image size used for the model training. With the default settings of fastdup, we were able to group the similar images into 966 components and visualize some components, including good and bad ones. Examples of these components are presented below.

Pasted image 0
Pasted image 0
Pasted image 0
Pasted image 0

But looking at the components’ labels we noticed that the labels of some visually good components were bad, this indicates annotation issues.

Pasted image 0
Notice how the same woman was once annotated as having shorts and another time as not having shorts. While the images are the same

A samples of the bad component

This is a clear example of real-world data. If a machine learning model is trained with such input data it is going to have a very low quality because of the noisy data in input during training, because samples with different people have different labels sets. Sometimes machine learning engineers and data scientists start with very simple data cleaning and train models with the uncleaned data. That is a very big mistake because this is just a waste of their time.

4 Down to the details — The Data cleaning

In order to avoid unnecessary model training with low-quality data it is better to look at the images and their labels and think how to improve labels and components (groups) of similar images. For example, we decided to clean bad components according to different thresholds. For doing this we calculated standard deviation of the label counts and added new columns to our dataframe with flags, for example “is_bad_component_std_80”.

Pasted image 0

Sample of callback code for addition new columns with flags

Using our custom ImageLoader we could visualize bad and good components and understand our data as much as possible.

Pasted image 0

Sample of bad components visualized with Seedoo ImageLoader

Pasted image 0

Sample of good components visualized with Seedoo ImageLoader

Using the component cleaning logic, we fine-tuned Mobilenet_V2 model with customized input layer. For the input layer we set groups with 3 instead of 1 (like we feed the model with 3 images at once instead of 1 image). Testing our model we got very promising initial results with an F1 score around 0.9 that appeared good enough.

Pasted image 0

First trained model’s result with components pruning by variance with threshold 30

5 Adding new heads, training and following model improvement

As we aimed to improve the performance of our computer vision model, we added additional types of data processing and refined our approach to model training. Our focus was on optimizing the parameters of data processing, including components, labels, and images. To accomplish this, we developed a strategy for manipulating these parameters that included component pruning, label pruning, label intersection, and image pruning. We used various pruning methods to clean the data, such as component pruning by variance, geometric variance, and cubic variance, as well as label pruning based on counts ratio and thresholds. We also conducted comparisons between different models, generating a table of results for components with a certain number of images. Notably, we excluded singletons from the analysis due to potential noise in the data. Our overall approach represents a significant step forward in improving the accuracy and reliability of computer vision models.

Engineers in the field of machine learning often encounter the task of repeatedly training various models. It is important to note that relying solely on numerical model scores may not provide a complete understanding of model performance, and visual examination of the confusion matrix may be necessary. Our custom ImageLoader tool, developed in SeeDoo, enabled us to streamline model evaluation by implementing modules for automatic prediction on test datasets, allowing for N epochs to provide predictions for full validation dataset with the best model and export metrics. Results were exported as pandas dataframes rendered as HTML files. This approach reduced analysis time from several hours to just a few minutes.

Pasted image 0

Sample of the general metrics table being exported each N training epochs

Pasted image 0
Pasted image 0

Samples of the different confusion matrices being exported each N training epochs

Here is the detailed list of strategies we applied in our work:

  1. Component pruning — when we manipulate components (images with similar persons) and clean them according to its label counts statistics.
    1.1. Component pruning by variance with some threshold.
    1.2. Component pruning by geometric variance with some threshold.
    1.3. Component pruning by cubic variance with some threshold.
  2. Label pruning — when we manipulate the components’s labels.
    2.1. Label label pruning — when we delete labels based on labels counts ratio and thresholds within a component.
  3. Label intersection — when we keep only intersected images’ labels within a component.
  4. Image pruning — when we manipulate images within the components
    4.1 Image pruning based on labels counts ratio and thresholds for each image (deleting images having rare labels within a component).

This work was done by our ML engineer @Yauheni Kavaliou who did an excellent job in the analysis and summarizing the results.

--

--