This blog post was originally published at Tenyks’ website. It is reprinted here with the permission of Tenyks.
In the realm of computer vision, a shift is underway. This article explores the transformative power of foundation models, digging into their role in reshaping the entire computer vision pipeline.
‍It also demystifies the hype behind the idea that stitching together a few models will solve computer vision: data is and remains king! Training and labelling might slowly be replaced, in the computer vision pipeline 2.0, by foundation models, but it’s a possibility, not the current reality.
‍From conceptual shifts to the integration of cutting-edge models like CLIP, SAM, and GPT-4V, we show you how and why we believe computer vision is ready to land into uncharted territories, where a data-centric way remains the right approach to follow in setting up production-grade AI systems.
What is different: foundation models
What is a foundation model?
“We define foundation models as models trained on broad data (generally using self-supervision at scale) that can be adapted to a wide range of downstream tasks.”
Stanford Institute for Human-Centered AI (HAI)
For instance, Grounding DINO [1], a zero-shot object detection model that extends DINO [2], can be employed to detect arbitrary objects in various contexts. This adaptability is attained by leveraging the knowledge and representations acquired during the foundational training.
The rise of foundation models in computer vision
Figure 1. Evolution of Foundation Models in the field of Computer Vision. Source [3].
Figure 1 details the evolution of computer vision from a foundation model angle. We can see four main categories: traditional models (1998), textually prompted models (2021), visually prompted models (2021), and heterogeneous models (2023).
‍For every group, there is one particular model that accelerated the speed of evolution. For instance, traditional “inputs <model> output” models have evolved from LeNet and progressesed through AlexNet [5], VGG [6], InceptionNet [7], ResNet [8], and accelerated with the vision transformer (ViT) [9].
‍Despite some controversy with respect to whether or not computer vision is dead, The truth is that computer vision is evolving in parallel to other major breakthroughs in artificial intelligence (e.g., LLMs, diffusion models).
Enter GPT-4V!
The current pinnacle of vision in the generative AI (GenAI) domain is GPT-4V [4].
‍The following example shows how, given a query in the form of an image, this multi-modal foundation model can describe the provided image in very rich detail: perhaps even better than the average college student!
- Input:
Figure 3. “Slam on the Brakes” One of the winners of the 2023 Sony World Photography Awards
- Code:
import base64
import requests
import os
# OpenAI API Key
api_key = os.environ.get("MAGIC_API_KEY")
# Function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# Function to call endpoint
def gpt4v(image_path, api_key):
prompt = classification_prompt()
base64_image = encode_image(image_path)
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}"
}
payload = {
"model": "gpt-4-vision-preview",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this image?"
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
}
]
}
],
"max_tokens": 300
}
response = requests.post("https://api.openai.com/v1/chat/completions",
headers=headers, json=payload)
print(response.json())
- Output:
{'id': 'chatcmpl-68K9JOoOEvB', 'object': 'chat.completion', 'created': 1437, 'model': 'gpt-4-1106-vision-preview', 'usage': {'prompt_tokens': 1122, 'completion_tokens': 254, 'total_tokens': 1376}, 'choices': [{'message': {'role': 'assistant', 'content': 'This image captures a high-energy moment of equestrian barrel racing. The focus is on a horse and rider duo executing a tight turn around a barrel. The horse is a bay with a shiny coat, and its muscular build is on full display. It has its ears pinned back, showing concentration or the strain of the tight turn, and it wears a bridle with a bit and reins. The rider is leaning into the turn, showing good balance and control. They are wearing traditional western attire, including a cowboy hat, a plaid shirt, denim jeans, and cowboy boots. The rider\'s left leg is extended, only the toe touching the stirrup, as they guide the horse around the barrel.\n\nThe barrel is labeled "ALBERTA PREMIUM". The force of the turn is kicking up dirt and sand from the ground, creating a dynamic cloud of particles that accentuates the action. The background is slightly blurred, emphasizing the speed of the movement, and we can see fences and some banners or signs, suggesting that this is taking place in an arena during a competitive event.\n\nThe action is frozen in time by the photographer, giving the scene a dramatic and intense feeling, almost as if we can feel the ground vibrating from the horse\'s powerful strides.'}, 'finish_details': {'type': 'stop', 'stop': '<|fim_suffix|>'}, 'index': 0}]}
- 🤖 GPT-4V: “This image captures a high-energy moment of equestrian barrel racing. The focus is on a horse and rider duo executing a tight turn around a barrel. The horse is a bay with a shiny coat, and its muscular build is on full display. It has its ears pinned back, showing concentration or the strain of the tight turn, and it wears a bridle with a bit and reins. The rider is leaning into the turn, showing good balance and control. …”
The hype: CLIP + SAM + GPT-4V + [whatever the next big thing is]
The hype: computer vision is solved!
So, given that foundation models have taken the world by storm, the obvious question is now: Is computer vision finally solved?
‍One interesting way to answer this query is by looking at actionable answers! In other words: What are people out there creating with these new powerful building blocks?
‍Some argue that using GPT-4V in combination with other foundation models is enough to solve object detection. As we explain in the next section, the day to day production challenges of ML systems is less rosy.
‍Figure 4 reveals an instance where GPT-4V fails to count the number of hardhats correctly in 3 out of 5 attempts with images taken randomly from Google. The time spent in crafting a reliable prompt that works for 90% or more of the cases will likely be far greater than the time spent training a well-tested computer vision model for this task.
Figure 4. GPT-4V used to count the number of hardhats in different images
The reality: A tool is not a system
Despite the promising breakthroughs brought by foundation models, stacking together a multitude of models to solve a problem at scale is quite different from building a demo.
To begin with, two of the fundamental costs associated with transforming a demo into a production-grade system are:
- Development: A foundation model might work well in general but what about your specific task or use-case? In this scenario you might need to fine-tune your foundation model. For instance, what if you work with satellite images and need to fine-tune CLIP for this task? Assuming 4 months of development and 2 months of testing, a single ML engineer’s salary for this task might be around GBP 35,000. What if you need the job done in 3 months instead of 6 months? Will you be willing to pay the double of this amount?
- Engineering: Connecting the inputs-outputs of “chained” foundation models to operate at high-speed and high-throughput requires your MLOps team to figure out a foundation model “connecting” pipeline. Foundation models are great at zero-shot tasks, but assembling a line of models where the output of one stage can affect the input of the next stage is only one of the potential challenges to be addressed.
But foundation models are actually good enough!
This is absolutely true and exciting! Foundation models are reshaping computer vision!
‍In the next two sections we describe the traditional computer vision pipeline, and the role we expect foundation models to play in a new generation of computer vision pipeline.
The traditional computer vision pipeline
Figure 4. Traditional pipeline in the computer vision domain
Figure 4 shows the traditional computer vision pipeline from data collection to model monitoring. The illustration highlights the nature of each stage: model-centric, data-centric or hardware.
In this current pipeline, the first three stages precede the main stage in a model-centric approach: model training. For many years, model-centric has been crucial to develop new architectures (e.g., ResNet, transformers).
‍However, to build production grade ML systems, a data-centric approach has proven to be more effective. At least 6 of the 8 stages in this pipeline are data-centric in nature, meaning that they involve data at the core. For instance, model evaluation requires ML teams to identify the root causes of why a model is failing, so that it can be fixed before reaching the next stages. Also, model selection, often considered a model-centric stage, is, in reality, more closely tied to data: a profound understanding of how a model behaves varies with different slices of data.
We argue that, the rise of foundation models in computer vision might change the way this pipeline is being built though.
The new computer vision pipeline
Figure 5. Foundation models could give rise to a computer vision pipeline 2.0
Figure 5 shows the way we envision the computer vision pipeline 2.0. As this diagram reveals, foundation models are likely to eliminate two of the stages of the traditional pipeline:
- Data annotation: Annotating data at scale (i.e. 1M+ datasets) is perhaps the second most expensive process after model training. Controlling the quality of the labels is often the hardest challenge. Foundation models could improve auto-labelling, and overtime take this stage out of the CV pipeline.
- Model training: Once you have an annotated dataset, you can actually train a model. Nowadays, training a high-quality model for a specific use-case in industry is expensive. Large size datasets often require distributed training as the norm, if you aim to train a fine quality model. Hence, foundation models are well positioned to takeover this stage of the CV pipeline too.
‍As a result of these changes, all stages except deployment are likely to lean towards a data-first approach. Hence, data remains the key to building AI systems in the real-world.
Be careful of what you read online
The computer vision pipeline 2.0 presents a transformative shift led by foundation models, which hold the promise to streamline the cumbersome process of annotating large datasets and training high-quality models.
‍Be aware that data is still king in the production-grade world of machine learning. Even if foundation models replace annotation and training, you still need to acquire data, pre-process data, select, and define slices of data for evaluation and model selection. Additionally, monitoring data drifts is crucial.
‍Not everything you read online is true! Rapid prototyping by combining a bunch of models can help you brainstorm some ideas, but a well structured data-centric approach to build ML systems remains the key to build robust AI systems for real-world applications.
‍References
‍[1] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
[2] Emerging Properties in Self-Supervised Vision Transformers
[3] Foundational Models Defining a New Era in Vision: A Survey and Outlook
[4] GPT-4V(ision) System Card
[5] ImageNet Classification with Deep Convolutional Neural Networks
[6] Very Deep Convolutional Networks For Large-Scale Iimage Recognition
[7] Going deeper with convolutions
[8] Deep Residual Learning for Image Recognition
[9] An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
‍Authors: Jose Gabriel Islas Montero, Dmitry Kazhdan