计算机

如何用Ollama部署本地模型？

目前来看，Ollama是我遇到的最简单的开源的本地大模型运行框架。当前最流行的Python语言与Ollama可以做出怎样的运用呢？

谢现实

Jun 11, 2025 — 5 min read

在LLMs上搜索：common ollama command，可得到常用的Ollama命令：

# Download a Model
ollama pull <model_name>

# Run a Model and Chat
ollama run <model_name> 

# List Downloaded Models
ollama list

# Remove a Model
ollama rm <model_name>

本地模型建议用小一些的，运行速率会快一些。关于Python与Ollama交互的教程，Ollama提供了官方的Python Library，但以下两点要求需要注意：

Ollama should be installed and running
Pull a model to use with the library: ollama pull <model> e.g. ollama pull gemma3:1b.

如何用Python运行Ollama模型

如果觉得官方文档写得特别复杂，可在大模型搜索：how to use python to use Ollama LLMs，可得到Python与Ollama的交互方式有多种。

1. Simple Chat Interaction

You can send a prompt to an LLM and receive a response.

import ollama

# Basic chat request
response = ollama.chat(
    model='llama3.1',
    messages=[
        {'role': 'user', 'content': 'Hello! Tell me about the weather on Mars.'}
    ]
)

# Print the response
print(response['message']['content'])

返回的`response`的数据类型是`ChatResponse`

When you call ollama.chat(), the response (or each chunk if streaming) is a dictionary with the following common keys:

model (str): The name of the model that generated the response (e.g., 'llama3').
created_at (str): A timestamp string indicating when the response was generated (e.g., '2025-06-11T21:00:22.123456789Z').
message (dict): This is the most important part, containing the actual message generated by the LLM. It has two keys:
- role (str): The role of the speaker (e.g., 'assistant').
- content (str): The generated text content.
done (bool): A boolean indicating whether the generation is complete.
- If stream=False, this will always be True in the single returned response.
- If stream=True, this will be False for intermediate chunks and True only for the final chunk, which contains all the aggregated metrics.
total_duration (int): The total duration of the generation process in nanoseconds (only present in the final chunk if streaming, or the single response if not streaming).
load_duration (int): The time taken to load the model into memory in nanoseconds (only in the final chunk).
prompt_eval_count (int): The number of tokens in the input prompt.
prompt_eval_duration (int): The time taken to evaluate the prompt in nanoseconds (only in the final chunk).
eval_count (int): The number of tokens generated in the response.
eval_duration (int): The time taken to generate the response tokens in nanoseconds (only in the final chunk).

2. Streaming Responses

For real-time output (useful for longer responses), use streaming:

import ollama

# Stream a chat response
stream = ollama.chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
    stream=True
)

# Print the response as it streams
for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Setting stream=True allows the model to return chunks of the response as they are generated.
Each chunk contains partial output, printed in real-time.

3. Using System Prompts

To customize the model’s behavior, include a system prompt:

import ollama

response = ollama.chat(
    model='llama3.1',
    messages=[
        {'role': 'system', 'content': 'You are a pirate. Respond in pirate-speak.'},
        {'role': 'user', 'content': 'Tell me about the sea.'}
    ]
)

print(response['message']['content'])

The system role sets the model’s persona or behavior.
This is useful for tailoring responses (e.g., making the model respond like a pirate or a specific character).

4. Generating Text (Non-Chat)

If you just want to generate text without a chat context:

import ollama

# Generate text
response = ollama.generate(
    model='llama3.1',
    prompt='Write a short poem about the moon.'
)

# Print the response
print(response['response'])

generate is used for simple text generation without a conversational structure.
The prompt parameter specifies the input text.

💡

In the Ollama Python client library, both ollama.generate() and ollama.chat() are used to interact with LLMs, but they are designed for different interaction patterns and handle conversation context differently.

5. Working with Embeddings

Ollama also supports generating embeddings for text, useful for tasks like semantic search:

import ollama

# Generate embeddings
embeddings = ollama.embed(
  model='mxbai-embed-large',
  input='Llamas are members of the camelid family',
)

# Print the embedding vector (truncated for brevity)
print(embeddings['embedding'][:10])

embeddings generates a numerical vector representing the input text.
Useful for NLP tasks like text similarity or clustering.

如何让Ollama LLMs调用不同的工具？

LLMs Agent（大型语言模型智能体）能够理解自然语言，然后调用相应工具完成复杂任务。Ollama官网提供了Python Library。在Gemini上搜索How to use Ollama LLMs to build a agent that can call different function tool，可得到如下代码：

import ollama
import json

# Define your tools/functions
def get_weather(city: str) -> str:
    return f"The weather in {city} is sunny and 25°C."  # Simulated

def add(a: int, b: int) -> int:
    return a + b

def multiply(a: int, b: int) -> int:
    return a * b

# Registry of tools
tools = {
    "get_weather": get_weather,
    "add": add,
    "multiply": multiply,
}

# Prompt template (few-shot style to teach the format)
SYSTEM_PROMPT = """
You are a smart assistant that can call tools. 
When needed, respond ONLY in JSON like this:
{
  "function": "add",
  "args": {"a": 3, "b": 5}
}
Available tools:
- get_weather(city: str)
- add(a: int, b: int)
- multiply(a: int, b: int)
"""

# Agent function
def call_agent(user_input: str):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_input}
    ]

    response = ollama.chat(model="gemma3:1b", messages=messages)
    content = response['message']['content']

    try:
        data = json.loads(content)
        function_name = data["function"]
        args = data.get("args", {})

        if function_name in tools:
            result = tools[function_name](**args)
            return f"✅ Called `{function_name}` with args {args} → Result: {result}"
        else:
            return f"❌ Unknown function `{function_name}`"

    except json.JSONDecodeError:
        return f"💬 LLM response: {content}"

# CLI loop
if __name__ == "__main__":
    print("🔧 Python Agent with Ollama. Type 'exit' to quit.")
    while True:
        user_input = input("\nYou: ")
        if user_input.lower() in ["exit", "quit"]:
            break
        result = call_agent(user_input)
        print(f"Agent: {result}")

注意到上面的代码采用的是提示提的方法告诉大模型去调用工具，现在Ollama已经更新Tools Calling的功能，包括现在也支持MCP协议了。这篇文章在ollama.chat里直接输入tools=ollama_tools参数，而非采用提示词的方法。这样直接调用的方法需要模型支持tools的参数。如下：

messages = [{'role': 'user', 'content': '3加5是多少？'}]

response = ollama.chat(
  model='qwen3:0.6b',
  messages=messages,
  tools=[add_two_numbers], # Python SDK supports passing tools as functions
  stream=True

如何用Gradio制作一个与Ollama交互的UI界面

感觉Gradio是专门为大模型应用而设计的，几行代码便可生成一个基本浏览器的UI。以下是由LLMs生成的代码，运行后可用浏览器打开应用：

import gradio as gr
import ollama

# Function to interact with Ollama
def chat_with_ollama(message, history):
    # Convert history to Ollama's messages format
    messages = [{"role": "system", "content": "You are a helpful AI assistant."}]
    for h in history:
        messages.append({"role": "user", "content": h[0]})
        if h[1]: # Check if assistant response exists
            messages.append({"role": "assistant", "content": h[1]})
    messages.append({"role": "user", "content": message})

    # Stream the response from Ollama
    response_stream = ollama.chat(model="llama3", messages=messages, stream=True)
    full_response = ""
    for chunk in response_stream:
        if 'message' in chunk and 'content' in chunk['message']:
            content = chunk['message']['content']
            full_response += content
            yield full_response # Yield partial responses for streaming effect

# Create and launch the Gradio interface
iface = gr.ChatInterface(
    fn=chat_with_ollama,
    title="Ollama Llama3 Chatbot",
    description="Interact with your local Llama3 model powered by Ollama."
)

iface.launch()

UI界面其实用到的是ChatInterface这个类，稍微看下官网的指导就会了。

如何用Ollama部署本地模型？

谢现实

如何用Python运行Ollama模型

1. Simple Chat Interaction

返回的`response`的数据类型是`ChatResponse`

2. Streaming Responses

3. Using System Prompts

4. Generating Text (Non-Chat)

5. Working with Embeddings

如何让Ollama LLMs调用不同的工具？

如何用Gradio制作一个与Ollama交互的UI界面

Read more

Document Parser-文档解析

Python工具箱

流行的制造者(Hit Makers)

AI智能PPT生成工具

如何用Python运行Ollama模型

1. Simple Chat Interaction

返回的response的数据类型是ChatResponse

2. Streaming Responses

3. Using System Prompts

4. Generating Text (Non-Chat)

5. Working with Embeddings

如何让Ollama LLMs调用不同的工具？

如何用Gradio制作一个与Ollama交互的UI界面

Read more

Document Parser-文档解析

Python工具箱

流行的制造者(Hit Makers)

AI智能PPT生成工具

返回的`response`的数据类型是`ChatResponse`