Image-to-Text AI Models: CLIP, BLIP, WD 1.4 (aka WD14), SigLIP 2, and ChatGPT with Vision

Categorized as AI/ML, Art, Open Source Tagged , , ,
Save and Share:

Automated tagging, labeling, or describing of images is a crucial task in many applications, particularly in the preparation of datasets for machine learning. This is where image-to-text models come to the rescue. Among the leading image-to-text models are CLIP, BLIP, WD 1.4 (also known as WD14 or Waifu Diffusion 1.4 Tagger), SigLIP 2, and ChatGPT with Vision.

CLIP: A Revolutionary Leap

OpenAI’s Contrastive Language–Image Pretraining (CLIP) model has been widely recognized for its revolutionary approach to understanding and generating descriptions for images. CLIP leverages a large amount of internet text and image data to learn a multitude of visual concepts, thereby producing descriptive sentences for images.

However, according to user reviews, CLIP’s descriptive sentences can sometimes be redundant or overly verbose. A common criticism revolves around the model’s propensity to repeat similar descriptions for the same object or overemphasize certain attributes, such as the color of an object.

BLIP: Simplicity Meets Functionality

The BLIP model, while less detailed in its descriptions compared to CLIP, offers a simpler and more direct approach to image-to-text processing. As one reviewer noted, BLIP is “cool and all, but it’s pretty basic.” This model’s simplicity can be an advantage for applications that require straightforward, less verbose tags or descriptions.

Nevertheless, some users found that BLIP’s output often lacks the depth and granularity provided by models like WD14. While it can generate satisfactory results, BLIP may not be the best choice for applications that demand detailed, complex tags.

I’ve found WD14, despite being anime-focused, works great for actual photos of people too. I usually combine it with BLIP and most of the times it picks up much more details than BLIP.

Toni Corvera at YouTube comments

Blip is cool and all, but its pretty basic.

WD 1.4 (WD14) tagging is way better – more detail, juicier tags.

OrphBean at GitHub

WD 1.4 (aka WD14): Precision in Detail

The WD 1.4 model (also known as WD14 or Waifu Diffusion 1.4 Tagger), initially designed for anime images, has demonstrated surprising versatility, performing well even with photos. Users have lauded its advanced configuration options and batch processing capabilities, which make it a robust tool for image-to-text translation.

What sets WD14 apart is its ability to generate detailed, “juicier” tags, providing more in-depth descriptions of images compared to its counterparts. While this model is less likely to produce spurious tags, its focus on anime might be a limitation for certain types of images.

Q: Is the WD14 tagger better than the BLIP or deepdanbooru built in Automatic1111?

A: The extension gives better options for configuration and batch processing, and I’ve found it less likely to produce completely spurious tags than deepdanbooru.

CLIP/BLIP is different since those produce descriptive sentences rather than lists of tags, but the latter is usually more in line with my needs. And the built-in CLIP interrogator is prone to busting out things like “a picture of (description) and a picture of (slightly different description of the same thing” or “(mostly complete description) and pink hair and pink hair and pink hair and (repeat many times)”

Despite being made for anime, the WD14 tagger works pretty well on photos.

MorganTheDual at Reddit

SigLIP 2: A Powerful Image-to-Text Engine

Free and open model SigLIP 2 by Google, isn’t just a vision-language model; it’s a powerful engine for turning images into meaningful text. While it excels at tasks like image-text retrieval and zero-shot classification, its architecture and training enhancements make it a strong contender for image-to-text generation and understanding. Here’s a breakdown of how SigLIP 2 works in this context:

The Foundation: Vision Transformer (ViT) and Sigmoid Loss

  • Vision Transformer (ViT): Unlike convolutional neural networks (CNNs), SigLIP 2 uses a Vision Transformer (ViT) architecture. The ViT treats an image like a sequence of patches, similar to how words are treated as tokens in natural language processing. Each patch is converted into a vector representation (an embedding). This allows the model to use the powerful Transformer architecture, known for its ability to capture long-range dependencies, to understand the relationships between different parts of the image.
  • Sigmoid Loss (Not Contrastive): A key differentiator of SigLIP (and SigLIP 2) is the use of a sigmoid loss function instead of the more common contrastive loss (used in models like CLIP). Contrastive learning requires comparing an image to multiple text options. Sigmoid loss, on the other hand, treats image-text matching as a binary classification problem for each image-text pair. This seemingly small change has a big impact: it allows for more stable training and better performance, especially with larger batch sizes. It focuses on the individual image-text pair matches.

Enhanced Training for Text Generation

SigLIP 2 takes the foundation of SigLIP and adds several crucial enhancements that directly benefit its image-to-text capabilities:

  • Captioning-Based Pretraining: This is a huge step. SigLIP 2 incorporates captioning as part of its pretraining process. This means it’s explicitly trained to generate text descriptions of images. This is in contrast to models like the original CLIP, which were primarily trained on image-text matching, not generation.
  • Inspired by Self-Supervised Learning: SigLIP 2 leverages powerful techniques:
    • Self-Distillation: The model learns from its own predictions, refining its understanding over time.
    • Masked Prediction: Parts of the input (either image patches or text tokens) are hidden, and the model learns to predict the missing pieces. This forces it to develop a deeper understanding of both modalities.
  • LocCa Loss and Decoder: SigLIP 2 integrates the LocCa loss, which adds a Transformer decoder with cross-attention. This decoder is specifically trained on tasks like image captioning, referring expression prediction (identifying regions based on text), and grounded captioning. This enhances fine-grained localization and detailed feature extraction.

How it all comes together (Image -> Text)

  1. Image Input: An image is fed into the ViT encoder.
  2. Patch Embedding: The image is divided into patches, and each patch is transformed into an embedding vector.
  3. Transformer Encoding: The Transformer layers process these patch embeddings, capturing relationships between different parts of the image. Learned positional embeddings provide information about the location of each patch.
  4. Attention Pooling: An attention-based pooling mechanism (MAP head) aggregates the information from the patch embeddings into a single, comprehensive image representation.
  5. Text Decoder (Crucial for Image-to-Text): This is where SigLIP 2’s captioning capabilities come into play. The image representation from the encoder is fed into the Transformer decoder (added by the LocCa loss). The decoder uses cross-attention to focus on relevant parts of the image representation while generating text.
  6. Text Output: The decoder generates a sequence of text tokens, forming a caption or description of the image.

Multilingual Power

SigLIP 2 isn’t limited to English. It’s trained on a multilingual dataset (WebLI), giving it the ability to understand and generate text in multiple languages, making it a truly global image-to-text tool.

In Summary

SigLIP 2’s combination of a strong ViT-based architecture, sigmoid loss, and, crucially, decoder-based pretraining with captioning and self-distillation, makes it a highly effective model for image-to-text tasks. It goes beyond simple image-text matching to actively generate descriptive text, opening up possibilities for applications like automatic image captioning, visual question answering, and detailed image analysis.

Links

ChatGPT with Vision

If you need to describe a not very large dataset of images, you can use Image Describer tool in ChatGPT. It’s not suitable for large batches of images, but it’s perfect for smaller image sets because it provides absolutely the best descriptions in natural language. Descriptions from this tool are very precise and detailed. It’s not just tags, but coherent and informative text that also takes into account the relationships between objects in the image.

Example:

Test image set:

Image Describer output for these 3 images:

You’ve uploaded three images. Here’s a description of each:

  1. The first image features an adorable, stylized cartoon kitten with large, expressive eyes. It is predominantly orange with white accents on its face, chest, and paws. The kitten is holding a pen, which is almost as big as itself, and is drawing a happy face on a bright yellow ball. The art style is very polished and has a kind of glossy finish. The background suggests a whimsical setting with colorful, abstract shapes and a suggestion of a starry sky.
  2. The second image appears to be a screenshot of a chat or comment section from a social media platform. Various users have left comments, many of which include laughter or amusement indicated by acronyms like “LMAO” and emojis. The usernames and profile pictures are pixelated for privacy, with a variety of colors displayed in the pixelation.
  3. The third image is a highly detailed and realistic illustration of a young woman with a friendly expression. She has long black hair, fair skin, and wears a purple hoodie. The woman is also wearing a headset, suggesting she might be a video game streamer or involved in some form of online communication. The setting appears to be a room illuminated with a purple hue, possibly from LED lights, indicative of a gaming setup.

Each image has a very different context and visual style, ranging from whimsical illustration to a screenshot of a digital interaction, to a more realistic digital painting.

Wrapping Up

Choosing between CLIP, BLIP, WD 1.4, SigLIP 2, and Image Describer depends largely on the specific needs of a project. If precision and detail are paramount, WD 1.4 and SigLIP 2 offer a compelling option with their advanced configuration and detailed tagging capabilities. For simpler applications, BLIP’s straightforward approach might be more suitable. Meanwhile, CLIP provides a balance between detail and simplicity, albeit with a tendency for verbosity.

Image Describer provides the best results but is not suitable for describing or tagging large sets of images.

As these models continue to evolve and improve, they hold promising potential for a broad spectrum of applications, from content creation to data analysis. Despite their differences, CLIP, BLIP, WD 1.4, SigLIP 2, and GPT-Vision models are testament to the rapid advancements in image-to-text technology, each contributing unique strengths to this exciting field.

Leave a comment

Your email address will not be published. Required fields are marked *