Skip to main content

ChatOllama

Ollama allows you to run open-source large language models, such as Llama 2, locally.

Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile.

It optimizes setup and configuration details, including GPU usage.

For a complete list of supported models and model variants, see the Ollama model library.

Setup

First, follow these instructions to set up and run a local Ollama instance:

  • Download and install Ollama onto the available supported platforms (including Windows Subsystem for Linux)
  • Fetch available LLM model via ollama pull <name-of-model>
    • View a list of available models via the model library
    • e.g., ollama pull llama3
  • This will download the default tagged version of the model. Typically, the default points to the latest, smallest sized-parameter model.

On Mac, the models will be download to ~/.ollama/models

On Linux (or WSL), the models will be stored at /usr/share/ollama/.ollama/models

  • Specify the exact version of the model of interest as such ollama pull vicuna:13b-v1.5-16k-q4_0 (View the various tags for the Vicuna model in this instance)
  • To view all pulled models, use ollama list
  • To chat directly with a model from the command line, use ollama run <name-of-model>
  • View the Ollama documentation for more commands. Run ollama help in the terminal to see available commands too.

Usage

You can see a full list of supported parameters on the API reference page.

If you are using a LLaMA chat model (e.g., ollama pull llama3) then you can use the ChatOllama interface.

This includes special tokens for system message and user input.

Interacting with Models

Here are a few ways to interact with pulled local models

In the terminal:

  • All of your local models are automatically served on localhost:11434
  • Run ollama run <name-of-model> to start interacting via the command line directly

Via an API

Send an application/json request to the API endpoint of Ollama to interact.

curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt":"Why is the sky blue?"
}'

See the Ollama API documentation for all endpoints.

Via LangChain

See a typical basic example of using Ollama via the ChatOllama chat model in your LangChain application.

View the API Reference for ChatOllama for more.

# LangChain supports many other chat models. Here, we're using Ollama
from langchain_community.chat_models import ChatOllama
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

# supports many more optional parameters. Hover on your `ChatOllama(...)`
# class to view the latest available supported parameters
llm = ChatOllama(model="llama3")
prompt = ChatPromptTemplate.from_template("Tell me a short joke about {topic}")

# using LangChain Expressive Language chain syntax
# learn more about the LCEL on
# /docs/concepts/#langchain-expression-language-lcel
chain = prompt | llm | StrOutputParser()

# for brevity, response is printed in terminal
# You can use LangServe to deploy your application for
# production
print(chain.invoke({"topic": "Space travel"}))
Why did the astronaut break up with his girlfriend?

Because he needed space!

LCEL chains, out of the box, provide extra functionalities, such as streaming of responses, and async support

topic = {"topic": "Space travel"}

for chunks in chain.stream(topic):
print(chunks)
Why
did
the
astronaut
break
up
with
his
girlfriend
before
going
to
Mars
?


Because
he
needed
space
!

For streaming async support, here's an example - all possible via the single chain created above.

topic = {"topic": "Space travel"}

async for chunks in chain.astream(topic):
print(chunks)

Take a look at the LangChain Expressive Language (LCEL) Interface for the other available interfaces for use when a chain is created.

Building from source

For up to date instructions on building from source, check the Ollama documentation on Building from Source

Extraction

Use the latest version of Ollama and supply the format flag. The format flag will force the model to produce the response in JSON.

Note: You can also try out the experimental OllamaFunctions wrapper for convenience.

from langchain_community.chat_models import ChatOllama

llm = ChatOllama(model="llama3", format="json", temperature=0)
API Reference:ChatOllama
from langchain_core.messages import HumanMessage

messages = [
HumanMessage(
content="What color is the sky at different times of the day? Respond using JSON"
)
]

chat_model_response = llm.invoke(messages)
print(chat_model_response)
API Reference:HumanMessage
content='{ "morning": "blue", "noon": "clear blue", "afternoon": "hazy yellow", "evening": "orange-red" }\n\n  \n\n\n\n\n\n  \n\n\n\n\n\n  \n\n\n\n\n\n  \n\n\n\n\n\n  \n\n\n\n\n\n  \n\n\n\n\n\n  \n\n\n\n\n\n  \n\n\n\n\n\n  \n\n\n\n\n\n  \n\n\n\n\n\n ' id='run-e893700f-e2d0-4df8-ad86-17525dcee318-0'
import json

from langchain_community.chat_models import ChatOllama
from langchain_core.messages import HumanMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

json_schema = {
"title": "Person",
"description": "Identifying information about a person.",
"type": "object",
"properties": {
"name": {"title": "Name", "description": "The person's name", "type": "string"},
"age": {"title": "Age", "description": "The person's age", "type": "integer"},
"fav_food": {
"title": "Fav Food",
"description": "The person's favorite food",
"type": "string",
},
},
"required": ["name", "age"],
}

llm = ChatOllama(model="llama2")

messages = [
HumanMessage(
content="Please tell me about a person using the following JSON schema:"
),
HumanMessage(content="{dumps}"),
HumanMessage(
content="Now, considering the schema, tell me about a person named John who is 35 years old and loves pizza."
),
]

prompt = ChatPromptTemplate.from_messages(messages)
dumps = json.dumps(json_schema, indent=2)

chain = prompt | llm | StrOutputParser()

print(chain.invoke({"dumps": dumps}))

Name: John
Age: 35
Likes: Pizza

Multi-modal

Ollama has support for multi-modal LLMs, such as bakllava and llava.

Browse the full set of versions for models with tags, such as Llava.

Download the desired LLM via ollama pull bakllava

Be sure to update Ollama so that you have the most recent version to support multi-modal.

Check out the typical example of how to use ChatOllama multi-modal support below:

!pip install --upgrade --quiet  pillow
Note: you may need to restart the kernel to use updated packages.
import base64
from io import BytesIO

from IPython.display import HTML, display
from PIL import Image


def convert_to_base64(pil_image):
"""
Convert PIL images to Base64 encoded strings

:param pil_image: PIL image
:return: Re-sized Base64 string
"""

buffered = BytesIO()
pil_image.save(buffered, format="JPEG") # You can change the format if needed
img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
return img_str


def plt_img_base64(img_base64):
"""
Disply base64 encoded string as image

:param img_base64: Base64 string
"""
# Create an HTML img tag with the base64 string as the source
image_html = f'<img src="data:image/jpeg;base64,{img_base64}" />'
# Display the image by rendering the HTML
display(HTML(image_html))


file_path = "../../../static/img/ollama_example_img.jpg"
pil_image = Image.open(file_path)

image_b64 = convert_to_base64(pil_image)
plt_img_base64(image_b64)
<img src="" /> 
from langchain_community.chat_models import ChatOllama
from langchain_core.messages import HumanMessage

llm = ChatOllama(model="bakllava", temperature=0)


def prompt_func(data):
text = data["text"]
image = data["image"]

image_part = {
"type": "image_url",
"image_url": f"data:image/jpeg;base64,{image}",
}

content_parts = []

text_part = {"type": "text", "text": text}

content_parts.append(image_part)
content_parts.append(text_part)

return [HumanMessage(content=content_parts)]


from langchain_core.output_parsers import StrOutputParser

chain = prompt_func | llm | StrOutputParser()

query_chain = chain.invoke(
{"text": "What is the Dollar-based gross retention rate?", "image": image_b64}
)

print(query_chain)
90%

Concurrency Features

Ollama supports concurrency inference for a single model, and or loading multiple models simulatenously (at least version 0.1.33).

Start the Ollama server with:

  • OLLAMA_NUM_PARALLEL: Handle multiple requests simultaneously for a single model
  • OLLAMA_MAX_LOADED_MODELS: Load multiple models simultaneously

Example: OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ollama serve

Learn more about configuring Ollama server in the official guide.


Was this page helpful?


You can also leave detailed feedback on GitHub.