As robotics improves, the potential for AI to move from the computer screen into the real world becomes a tantalizing possibility for humanity. Self-driving cars are just the beginning of the type of automation that is about to be unleashed on the world. However, while great leaps have been made with the development and training of language models, visual AI (also referred to as computer vision in some circles) still struggles with fundamental issues, like the lack of suitable annotated training and benchmark data. And unfortunately, most of that annotated data is never used.
While nearly all forms of AI need some annotated data, the need is arguably greater in visual AI than with language models. With ChatGPT and other language models, the initial training phase is done in a self-supervised manner. Language model developers essentially unload truckloads of data pulled from sources (usually the Internet), and the model is able to make sense of it during its initial training — although human input is critical during the fine-tuning and reinforcement learning stages.
That approach doesn’t work quite the same with visual AI, which is the other primary modality of modern neural networks and transformer models. Whereas large amounts of data are needed to train most contemporary models, it’s hard to argue that all domains are equal. In the text space, there is a natural one-dimensional domain quite suitable for next token prediction. Although the semantic complexity in language is high, contemporary large language models seem to be doing a solid job of learning in the text space. In contrast, visual AI has no such natural obvious next token prediction solution. This isn’t surprising considering the broad variability and complexity of the visual world.
Visual AI models require much more annotated data, which is done by humans. Large teams of data annotators around the world spend millions of hours providing textual descriptions of the images they see, whether it’s identifying cats in videos or blocked arteries in radiological images.
Up to this point, data annotation has represented a huge but necessary expense for training visual AI models to identify and differentiate entities in the world, to “see” like a human. Unfortunately, the visual AI community is stuck in first gear when it comes to data annotation, largely due to this reliance on human annotators and other factors.
But with a few small tweaks in process and technology, organizations can achieve remarkable improvements in their computer vision AI workflows.
Wasted Annotations
The big problem is that organizations are spending too much time and money manually creating data annotations and labels that they will never use. By some estimates, organizations never use 95% of their data annotations, representing a huge waste of resources. In fact, one organization I’ve talked to in the last year told me they throw away 499 out of every 500 annotations. What a waste!
How we got here is a combination of old technology and outdated processes. Consider a company developing a self-driving vehicle. To train the computer vision algorithm to identify objects that the self-driving car will encounter in the real world, such as traffic lights and bicycles, the company needs to annotate many thousands of hours of video footage of vehicles driving in every imaginable situation.
The problem is that the lion’s share of that video footage is not useful. It’s repetitive, redundant, and often doesn’t include the all-important edge cases — such as a bicycle suddenly appearing out of the rain at dusk — that are crucial to developing a computer vision algorithm that will allow the car to drive safely.
Those edge cases do exist in the video footage, but finding them typically requires painstaking effort by humans. Visual AI companies pay annotation vendors large sums to have people pore through the videos and manually traverse the long tails of the distribution. Sometimes, they’re able to identify a subset of the data that is more likely to contain the edge cases. The trick, in any case, is to figure out ways of getting to those all-important edge cases quickly and efficiently, while ignoring all the repetitive stuff.
Enter Auto Labeling
The good news is that AI models are getting better at doing some of the annotations for us. One promising technique is called autolabeling, which is a semi-supervised machine learning method, also referred to as pseudo labeling.
Auto labeling uses a new class of pretrained vision-language models, such as YOLO-World and Grounding DINO, which can automatically create their own “pseudo labels” that can be used to train the main visual AI model. There are numerous reasons why one still needs to train a new model, even when these pretrained models exist. For instance, pre-trained models are often prohibitively large and cannot run on edge devices.
Auto labeling isn’t a silver bullet. Organizations will need to select the appropriate vision-language model to use for auto labeling. And these methods tend to perform better for common cases than edge cases. For complex scenarios, human expertise is still necessary. However, a strategic approach to human labeling data can lead to improved development productivity and significant cost savings. Recent ML research from my team indicates that zero-shot based techniques can further reduce the burden on human annotation by giving humans the right insights to prioritize what to manually label for downstream unsupervised model training. These models have different strengths and weaknesses, so matching the capability and the resource requirements to the use case is critical to getting a good outcome.
In some visual AI cases, the auto labeling will not provide sufficient quality for production AI use cases, such as a self-driving car that needs to accurately identify that bicycle coming out of the shadows. In these cases, human annotators are still required. But how humans are integrated into the process makes a big difference in the ultimate success of the project.
A Better Labeling Process
To fulfil the promises of physical AI, a better process is needed for integrating human intelligence into the data annotation process. But there is more good news on this front, as the AI community is developing tools that provide a better and faster data annotation experience.
The first generation of mass data annotation was largely transactional in nature. A company would send a large number of images to a data labeling company and wait for the workers to manually annotate the images. This process was slow and expensive. But worse than that, it just didn’t work well, and led directly to companies discarding most of the annotations, such as the 499 out of 500 example I previously noted.
The emerging approach is much less transactional and much more interactive in nature. Instead of throwing the images over the wall and waiting for annotations to come back, organizations are able to empower their in-house AI developers to work with auto labeling models to get the data annotations they need. If the first batch of annotations is insufficient or doesn’t include the needed edge cases, they can identify that quickly, and work to get the exact annotations they need.
By curating their own auto labels directly in their AI development environment, AI developers are able to iterate much more quickly than they could before. By integrating data annotation directly into the AI development workflow, developers can more easily pick up on contextual clues about the annotated data that describe the edge cases. This leads to a more accurate model and a more satisfying AI development experience.
Today’s AI development software is improving quickly, and it’s already leaps and bounds ahead of where it was five years ago. While visual AI is more dependent on high quality annotated data than language models, improvements in technology and process are giving AI developers the productivity boost they need to build the next-generation of visual AI models to power the coming breakthroughs in physical AI.