4 Feb 2024

LLaVA 1.5 vs. 1.6

LLaVA (or Large Language and Vision Assistant), an open-source large multi-modal model, just released version 1.6. It claims to have improvements over version 1.5, which was released a few months ago:

Increasing the input image resolution to 4x more pixels. This allows it to grasp more visual details. It supports three aspect ratios, up to 672x672, 336x1344, 1344x336 resolution.

Better visual reasoning and OCR capability with an improved visual instruction tuning data mixture.

Better visual conversation for more scenarios, covering different applications. Better world knowledge and logical reasoning.

Efficient deployment and inference with SGLang.

Both the initial and new versions are available on Ollama, so I thought I’d give them a try on some of my images to see how they get on.

I’ve created a video showing how to do this on my YouTube channel, Learn Data with Mark, so if you prefer to consume content through that medium, I’ve embedded it below:

If you want to follow along, you’ll need to install Ollama, launch Ollama (ollama serve), and then pull down both models:

ollama pull llava:7b-v1.5-q4_0
ollama pull llava:v1.6

Now let’s try out the models.

A photo of me

We’ll start with the following image of me looking through a magnifying glass.

Figure 1. Me with a magnifying glass

Let’s create a chat session with the 1.5 version of LLaVA:

ollama run llava:7b-v1.5-q4_0

We can then ask it to describe the image:

"Can you describe this image? ./images/image.png"

Output

Added image './images/image1.png'
The image depicts a bald man wearing glasses, holding up a magnifying glass to his eye. He is standing in front of a black background, possibly to focus on the
subject matter being observed through the lens of the magnifying glass. His attentive gaze suggests that he might be examining an object or studying something in detail.

One neat recent feature of Ollama is that we can switch to another model from within the session by calling the load command. So to use the v1.6 model, we would call the following:

/load llava:v1.6

Output

Loading model 'llava:v1.6'

And if we ask this model to describe the image, we’ll see the following output:

 In the image, there is a man who appears to be in his middle years. He has a bald head and is wearing glasses with black frames and a thin black strap. The man is holding up what looks like a magnifying glass over his eye, as if examining something closely. He has a slight smile on his face and is looking directly at the camera, creating an engaging perspective.

The man is dressed in a casual manner, with a white t-shirt adorned with a small pattern and a pair of dark pants. His stance is relaxed, suggesting he might be in a lighthearted or informal setting.

Behind him, there's a plain black background that contrasts with the man and draws attention to him. The image does not contain any text or additional objects. The focus is solely on the man and his interaction with the magnifying glass.

1.6 gives a more detailed answer, but I think it’s more or less the same description as 1.5 gave.

Ollama Python

For the rest of the examples, we’re going to use the Ollama Python library, which we can install by running the following:

pip install ollama

Next, we’ll run the following in a Jupyter notebook or from a Python REPL:

import ollama

def generate(model, image, prompt):
    stream = ollama.generate(
        model=model,
        prompt=prompt,
        images=[image],
        stream=True
    )
    response = ""
    for chunk in stream:
        response += chunk['response']
        print(chunk['response'], end='', flush=True)
    print("")
    return response

Captioning an image

Next, we’re going to ask models to come up with a caption for the following image:

Figure 2. An arrow on bricks

We can do this by running the following code:

image = "images/image2.png"
prompt = "Create a caption for this image"
for model in ['llava:7b-v1.5-q4_0', 'llava:v1.6']:
  print(f"Model: {model}")
  generate(model, image, prompt)

Output

Model: llava:7b-v1.5-q4_0
 A blue brick wall with a white arrow pointing to the left.
Model: llava:v1.6
 "A playful twist on direction, with a blue wall providing the backdrop."

If we update the prompt to include the word 'creative', like this:

image = "images/image2.png"
prompt = "Create a creative caption for this image"
for model in ['llava:7b-v1.5-q4_0', 'llava:v1.6']:
  print(f"Model: {model}")
  generate(model, image, prompt)

We get a slightly better output:

Output

Model: llava:7b-v1.5-q4_0
"A blue arrow pointing the way, guiding us on our journey."
Model: llava:v1.6
 "Guiding the way to endless possibilities: a sign pointing towards an unknown adventure."

There’s not much in it, but it seems to me that version 1.6 has 'understood' the prompt a bit better.

Reading a banner image

Next up, let’s see if the models can read the text from a banner image of one of my blog posts.

Figure 3. A blog image

We’ll use the following prompt:

image = "images/image3.png"
prompt = "What text is written on this image?"
for model in ['llava:7b-v1.5-q4_0', 'llava:v1.6']:
    print(f"Model: {model}")
    generate(model, image, prompt)

I ran this a few times because it was giving different answers each time. Here’s the output for version 1.5:

Output

Model: llava:7b-v1.5-q4_0
5 Ways to Use AI in Your Business
Running a large language model locally on my laptop.
"Hugging Face"

And 1.6:

Output

Model: llava:7b-v1.5-q4_0
The text on the image reads: "HUGGING FACE - RUNNING A LARGE LANGUAGE MODEL LOCALLY ON MY LAPTOP"
The text on the image reads: "Hugging Face" followed by a description of how to use Hugging Face for running a large language model locally on one's laptop.
The text on the image reads: "Hugging Face Running a Large Language Model Locally On My Laptop"

It’s by no means perfect, but I think version 1.6 wins this one.

Extracting code from an image

But can it extract code from an image? These days lots of people create LinkedIn posts with code on images and I’d quite like to get that code and run it locally. Perhaps LLaVA can help out here. We’re going to use the following image:

Figure 4. Python code for a window function

And we’ll use the following prompt:

image = "images/image4.png"
prompt = "Extract the code from this image"
for model in ['llava:7b-v1.5-q4_0', 'llava:v1.6']:
    print(f"Model: {model}")
    generate(model, image, prompt)

And here’s how the models fared, starting with 1.5:

Output

Model: llava:7b-v1.5-q4_0

```sql
        a = A + 1
        b = B + 1
        c = C + 1
        d = D + 1

        if (a > b and a > c) then begin
            RP_RPT("A is greater than B but less than C")
        end if

        if (b > a and b > c) then begin
            RP_RPT("B is greater than A but less than C")
        end if

        if (c > a and c > b) then begin
            RP_RPT("C is greater than A and B")
        end if

        if (a = b and b = c) then begin
            RP_RPT("A,B,C are equal")
        end if

        if (a < b and a < c) then begin
            RP_RPT("A is less than B but greater than C")
        end if

        if (b < a and b < c) then begin
            RP_RPT("B is less than A but greater than C")
        end if

        if (c < a and c < b) then begin
            RP_RPT("C is less than A but greater than B")
        end if

        if (a > b or not(b = c)) then begin
            RP_RPT("A,B are not equal")
        end if

        if (a < b and a <= C) then begin
            RP_RPT("A is less than B but greater than or equal to C")
        end if

        if (b >= A and b <= C) then begin
            RP_RPT("B is within the range of A & C")
        end if

        if (c > A and c <= B) then begin
            RP_RPT("C is within the range of A & B")
        end if
    end if

    if (A = -1024 or  B = -1024 or C = -1024) then begin
        RP_RPT("A,B,C are all equal to the value of -1024")
    end if

    if (D >= A + B + C and D <= 2**31 - 1) then begin
        RP_RPT("D is within the range of A+B+C")

And now 1.6

Output

Model: llava:v1.6
 The image contains a screenshot of a computer screen with a code snippet displayed. Here is the extracted code:

```python
import statistics as stats

data = [10, 20, 30, 40, 50]
mean = stats.mode(data)
median = stats.median(data)
stdev = stats.stdev(data)

print('Mean:', mean)
print('Median:', median)
print('Standard Deviation (STDEV):', stdev)
```

This code is written in Python and uses the `statistics` module to calculate the mean, median, and standard deviation of a list of numbers. The list `data` contains five elements, which are then processed using the `mode`, `median`, and `stdev` functions from the `statistics` library.

Neither model does a good job when it comes to extracting code.

Understanding a diagram

And we’ll finish off by seeing whether the models can explain this diagram, which shows a relational vs graph approach to modelling a problem.

Figure 5. Relational vs graph modelling

We’ll use this prompt:

image = "images/image5.png"
prompt = "Can you describe this diagram?"
for model in ['llava:7b-v1.5-q4_0', 'llava:v1.6']:
    print(f"Model: {model}")
    generate(model, image, prompt)

Now, let’s see how they get on. First up, 1.5:

Output

Model: llava:7b-v1.5-q4_0

This diagram displays a network of interconnected files, with multiple layers and different shapes. There is a mix of both square and rectangular shapes, which can be seen in various parts of the image. The relationships between these elements are represented by arrows connecting them to one another, indicating that some files are dependent or related to others.

This network diagram appears to represent an intricate system where different files have specific connections with other files within the structure. It provides a visual representation of how these files interact and engage with each other in this complex environment.

This doesn’t make a whole lot of sense. Let’s see if 1.6 does any better:

Output

Model: llava:v1.6
 The image you've shared is a flowchart that describes a process or workflow, likely from an algorithmic or computational context. Here's a description of the elements in the flowchart:

1. **Start Point**: There are two arrows pointing to a starting point labeled "from this..." suggesting the beginning of a process or algorithm.

2. **Sequence of Operations**: The sequence of operations is depicted by a series of boxes, each representing an action or function within the context of the flowchart. These actions seem to be related to data processing or transformation:
   - "Node1" appears to represent some sort of input or initial stage.
   - "Node2" and "Node3" seem to be intermediate steps or stages in the process.
   - "Node4" represents another significant step or decision point.
   - "Node5" is the final node, indicating the completion of the workflow or the output of the process.

3. **Arrows**: The arrows show the flow or direction of the operations. They indicate that the starting point leads to Node1 and then moves through the subsequent nodes until it reaches Node5.

4. **Labels on Nodes**: Each node has labels that are likely to represent actions, functions, or data elements involved in the process. However, without more context, these labels are not clearly defined.

5. **Final Arrow**: There's an arrow from Node5 pointing back to "from this..." which may indicate that the output of Node5 is being fed back into another part of the system.

6. **Connections Between Nodes**: The lines connecting the nodes represent the flow or relationship between them, showing that one node leads to the next in a logical sequence.

7. **Annotations**: There are annotations within the flowchart, but they're not fully legible due to the quality of the image. One can make out phrases like "Node1" and "Node5", indicating the names or identities of these nodes in the workflow.

This type of diagram is commonly used in computer science and software engineering to visualize the steps involved in a process, making it easier to understand and communicate the logic behind a particular algorithm or system.

It’s slightly better, but still doesn’t do a good job of explaining the difference between the approaches.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.