An ORSAC method for data cleaning inspired by RANSAC

Thomas Jenkins; Autumn Goodwin; Sameerah Talafha

Authors

Thomas Jenkins Computer Vision Research, Vectech, Baltimore Author https://orcid.org/0009-0000-0777-8358
Autumn Goodwin Computer Vision Research, Vectech, Baltimore Author https://orcid.org/0000-0001-8086-9846
Sameerah Talafha Computer Vision Research, Vectech, Baltimore Author https://orcid.org/0000-0002-5302-8539

Keywords:

Classification, Computer vision, Data cleaning, Label noise, RANSAC

Abstract

In classification problems, mislabeled data can have a dramatic effect on thecapability of a trained model. The traditional method of dealing with mislabeleddata is through expert review. However, this is not always ideal, due to the largevolume of data in many classification datasets, such as image datasets supportingdeep learning models, and the limited availability of human experts for reviewingthe data. Herein, we propose an ordered sample consensus (ORSAC) method tosupport data cleaning by flagging mislabeled data. This method is inspired bythe random sample consensus (RANSAC) method for outlier detection. In short,the method involves iteratively training and testing a model on different splitsof the dataset, recording misclassifications, and flagging data that is frequentlymisclassified as probably mislabeled. We evaluate the method by purposefullymislabeling subsets of data and assessing the method’s capability to find suchdata. We demonstrate with three datasets, a mosquito image dataset, CIFAR-10,and CIFAR-100, that this method is reliable in finding mislabeled data with ahigh degree of accuracy. Our experimental results indicate a high proficiencyof our methodology in identifying mislabeled data across these diverse datasets,with performance assessed using different mislabeling frequencies.

An ORSAC method for data cleaning inspired by RANSAC

Authors

Keywords:

Abstract

Downloads

Published

Issue

Section

License

About Journal

Journal Policies

Author

Article Template

Information