应该怎么用Ollama本地模型

目前来看,Ollama是我遇到的最简单的开源的本地大模型运行框架。当前最流行的Python语言与Ollama可以做出怎样的运用呢?

应该怎么用Ollama本地模型

在LLMs上搜索:common ollama command,可得到常用的Ollama命令:

# Download a Model
ollama pull <model_name>

# Run a Model and Chat
ollama run <model_name> 

# List Downloaded Models
ollama list

# Remove a Model
ollama rm <model_name>

本地模型建议用小一些的,运行速率会快一些。关于Python与Ollama交互的教程,Ollama提供了官方的Python Library,但以下两点要求需要注意:

  • Ollama should be installed and running
  • Pull a model to use with the library: ollama pull <model> e.g. ollama pull llama3.2 .

如何用Python运行Ollama模型

如果觉得官方文档写得特别复杂,可在大模型搜索:how to use python to use Ollama LLMs,可得到Python与Ollama的交互方式有多种。

1. Simple Chat Interaction

You can send a prompt to an LLM and receive a response.

import ollama

# Basic chat request
response = ollama.chat(
    model='llama3.1',
    messages=[
        {'role': 'user', 'content': 'Hello! Tell me about the weather on Mars.'}
    ]
)

# Print the response
print(response['message']['content'])

返回的response的数据类型是ChatResponse

When you call ollama.chat(), the response (or each chunk if streaming) is a dictionary with the following common keys:

  • model (str): The name of the model that generated the response (e.g., 'llama3').
  • created_at (str): A timestamp string indicating when the response was generated (e.g., '2025-06-11T21:00:22.123456789Z').
  • message (dict): This is the most important part, containing the actual message generated by the LLM. It has two keys:
    • role (str): The role of the speaker (e.g., 'assistant').
    • content (str): The generated text content.
  • done (bool): A boolean indicating whether the generation is complete.
    • If stream=False, this will always be True in the single returned response.
    • If stream=True, this will be False for intermediate chunks and True only for the final chunk, which contains all the aggregated metrics.
  • total_duration (int): The total duration of the generation process in nanoseconds (only present in the final chunk if streaming, or the single response if not streaming).
  • load_duration (int): The time taken to load the model into memory in nanoseconds (only in the final chunk).
  • prompt_eval_count (int): The number of tokens in the input prompt.
  • prompt_eval_duration (int): The time taken to evaluate the prompt in nanoseconds (only in the final chunk).
  • eval_count (int): The number of tokens generated in the response.
  • eval_duration (int): The time taken to generate the response tokens in nanoseconds (only in the final chunk).

2. Streaming Responses

For real-time output (useful for longer responses), use streaming:

import ollama

# Stream a chat response
stream = ollama.chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
    stream=True
)

# Print the response as it streams
for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)
  • Setting stream=True allows the model to return chunks of the response as they are generated.
  • Each chunk contains partial output, printed in real-time.

3. Using System Prompts

To customize the model’s behavior, include a system prompt:

import ollama

response = ollama.chat(
    model='llama3.1',
    messages=[
        {'role': 'system', 'content': 'You are a pirate. Respond in pirate-speak.'},
        {'role': 'user', 'content': 'Tell me about the sea.'}
    ]
)

print(response['message']['content'])
  • The system role sets the model’s persona or behavior.
  • This is useful for tailoring responses (e.g., making the model respond like a pirate or a specific character).

4. Generating Text (Non-Chat)

If you just want to generate text without a chat context:

import ollama

# Generate text
response = ollama.generate(
    model='llama3.1',
    prompt='Write a short poem about the moon.'
)

# Print the response
print(response['response'])
  • generate is used for simple text generation without a conversational structure.
  • The prompt parameter specifies the input text.
💡
In the Ollama Python client library, both ollama.generate() and ollama.chat() are used to interact with LLMs, but they are designed for different interaction patterns and handle conversation context differently.

5. Working with Embeddings

Ollama also supports generating embeddings for text, useful for tasks like semantic search:

import ollama

# Generate embeddings
embeddings = ollama.embeddings(
    model='llama3.1',
    prompt='The quick brown fox jumps over the lazy dog.'
)

# Print the embedding vector (truncated for brevity)
print(embeddings['embedding'][:10])
  • embeddings generates a numerical vector representing the input text.
  • Useful for NLP tasks like text similarity or clustering.

如何用Gradio制作一个Ollama本地模型的UI

感觉Gradio是专门为大模型应用而设计的,几行代码便可生成一个基本浏览器的UI。以下是由LLMs生成的代码,运行后浏览器上会打开应用:

import gradio as gr
import ollama

# Function to interact with Ollama
def chat_with_ollama(message, history):
    # Convert history to Ollama's messages format
    messages = [{"role": "system", "content": "You are a helpful AI assistant."}]
    for h in history:
        messages.append({"role": "user", "content": h[0]})
        if h[1]: # Check if assistant response exists
            messages.append({"role": "assistant", "content": h[1]})
    messages.append({"role": "user", "content": message})

    # Stream the response from Ollama
    response_stream = ollama.chat(model="llama3", messages=messages, stream=True)
    full_response = ""
    for chunk in response_stream:
        if 'message' in chunk and 'content' in chunk['message']:
            content = chunk['message']['content']
            full_response += content
            yield full_response # Yield partial responses for streaming effect

# Create and launch the Gradio interface
iface = gr.ChatInterface(
    fn=chat_with_ollama,
    title="Ollama Llama3 Chatbot",
    description="Interact with your local Llama3 model powered by Ollama."
)

iface.launch()

UI界面其实用到的是ChatInterface这个类,稍微看下官网的指导就会了。

如何让Ollama LLMs调用不同的工具?

LLMs Agent(大型语言模型智能体)能够理解自然语言,然后调用工具完成复杂任务。Agent其实大模型理解了人的指令之后自动调用执行的函数。在Gemini上搜索How to use Ollams LLMs to build a agent that can call different function tool,可得到如下代码:

import ollama
import json
import datetime
import requests # For a mock web search tool


# --- 1. Define Tool Functions ---
# These are regular Python functions that our LLM agent can "call".
# It's crucial to provide clear docstrings and type hints, as the LLM
# uses this information to understand when and how to use the tool.

def get_current_time(timezone: str = "UTC") -> str:
    """
    Returns the current time in a specified timezone.
    Args:
        timezone (str): The timezone to get the current time for (e.g., "America/New_York", "Europe/London", "Asia/Shanghai").
                        Defaults to "UTC" if not specified.
    Returns:
        str: The current time in the specified timezone.
    """
    try:
        from zoneinfo import ZoneInfo # Python 3.9+
    except ImportError:
        # Fallback for older Python versions or systems without zoneinfo
        # Note: This might not be perfectly accurate for all timezones
        # without a proper timezone library like `pytz`.
        print("Warning: 'zoneinfo' not found. Using simple datetime for time.")
        now = datetime.datetime.now()
        return f"Current time (approx) in {timezone}: {now.strftime('%H:%M:%S')}"

    try:
        tz = ZoneInfo(timezone)
        current_time = datetime.datetime.now(tz).strftime("%Y-%m-%d %H:%M:%S %Z%z")
        return f"Current time in {timezone}: {current_time}"
    except Exception as e:
        return f"Error getting time for timezone {timezone}: {e}. Please provide a valid IANA timezone name (e.g., 'America/New_York')."


def add_numbers(num1: float, num2: float) -> float:
    """
    Adds two numbers and returns their sum.
    Args:
        num1 (float): The first number.
        num2 (float): The second number.
    Returns:
        float: The sum of the two numbers.
    """
    return num1 + num2

def mock_web_search(query: str) -> str:
    """
    Performs a mock web search and returns a predefined result.
    This is a placeholder for actual web search functionality.
    Args:
        query (str): The search query.
    Returns:
        str: A mock search result.
    """
    print(f"DEBUG: Performing mock web search for: '{query}'")
    # In a real application, you would use a library like `requests`
    # to call a search API (e.g., Google Search API, DuckDuckGo Search API).
    # For demonstration, we return a fixed string.
    if "python" in query.lower() and "agent" in query.lower():
        return "Mock Search Result: Python agents are software entities that can perceive their environment, make decisions, and take actions. They often use LLMs for reasoning."
    elif "capital of france" in query.lower():
        return "Mock Search Result: The capital of France is Paris."
    else:
        return f"Mock Search Result for '{query}': Information not found in mock database."


# A dictionary to map tool names (as strings) to their actual Python function objects.
# This is essential for executing the tool once the LLM decides to call it.
available_tools = {
    "get_current_time": get_current_time,
    "add_numbers": add_numbers,
    "mock_web_search": mock_web_search,
}


# --- 2. Initialize the LLM and Conversation History ---
# We'll use a list to store the conversation messages, which helps the LLM
# maintain context and reason about previous turns.

model_name = "llama3.2" # Make sure this model is pulled in Ollama
# You can change to 'qwen2.5', 'mistral', etc., if you have them pulled.

# System message to guide the LLM on its capabilities and when to use tools.
# This prompt is crucial for effective tool calling.
messages = [
    {
        "role": "system",
        "content": (
            "You are a helpful AI assistant. You have access to the following tools:\n"
            "- `get_current_time(timezone: str)`: Returns the current time in a specified timezone (e.g., 'America/New_York', 'Europe/London', 'Asia/Shanghai').\n"
            "- `add_numbers(num1: float, num2: float)`: Adds two numbers and returns their sum.\n"
            "- `mock_web_search(query: str)`: Performs a mock web search and returns a predefined result. Use this for general knowledge questions or if you need to 'look up' information.\n\n"
            "If the user asks a question that can be answered by one of these tools, call the tool. "
            "If you need to use multiple tools, call them sequentially. "
            "Always respond in a helpful and clear manner. If a tool call doesn't directly answer the question, "
            "explain what you found and how it relates to the user's query."
        ),
    }
]

print(f"Starting conversation with Ollama model: {model_name}")
print("Type 'exit' to end the chat.")
print("-" * 50)


# --- 3. Agent Loop ---
# This loop continuously takes user input, sends it to the LLM,
# processes any tool calls, and responds to the user.

while True:
    user_input = input("You: ")
    if user_input.lower() == 'exit':
        print("Exiting chat. Goodbye!")
        break

    # Add user's message to the conversation history
    messages.append({"role": "user", "content": user_input})

    try:
        # Call Ollama chat API with the current conversation history and available tools
        response = ollama.chat(
            model=model_name,
            messages=messages,
            tools=[
                # Ollama automatically converts Python functions to JSON schemas
                # if you pass the function objects directly in the 'tools' list.
                get_current_time,
                add_numbers,
                mock_web_search,
            ]
        )

        # Process the LLM's response
        tool_calls = response['message'].get('tool_calls')

        if tool_calls:
            # The LLM wants to call a tool
            print(f"AI (Tool Call): The model requested to call tools.")
            for tool_call in tool_calls:
                tool_name = tool_call['function']['name']
                tool_args = tool_call['function']['arguments']

                print(f"  Calling tool: {tool_name} with arguments: {tool_args}")

                if tool_name in available_tools:
                    tool_function = available_tools[tool_name]
                    try:
                        # Execute the tool function with the arguments provided by the LLM
                        tool_output = tool_function(**tool_args)
                        print(f"  Tool output: {tool_output}")

                        # Add the tool output back to the conversation history
                        messages.append({
                            "role": "tool",
                            "content": json.dumps(tool_output), # Ollama expects string content for tool messages
                            "tool_call_id": tool_call.get('id', 'default_id') # Include tool_call_id if available
                        })

                        # Now, re-call the LLM with the tool output to get the final answer
                        # This is a critical step for the LLM to incorporate the tool's result.
                        final_response_after_tool = ollama.chat(
                            model=model_name,
                            messages=messages,
                        )
                        ai_content = final_response_after_tool['message']['content']
                        print(f"AI: {ai_content}")
                        messages.append({"role": "assistant", "content": ai_content})

                    except Exception as e:
                        error_message = f"Error executing tool '{tool_name}': {e}"
                        print(f"ERROR: {error_message}")
                        messages.append({
                            "role": "tool",
                            "content": json.dumps({"error": error_message}),
                            "tool_call_id": tool_call.get('id', 'default_id')
                        })
                        # Re-call LLM to inform about tool error
                        error_response = ollama.chat(
                            model=model_name,
                            messages=messages,
                        )
                        ai_error_content = error_response['message']['content']
                        print(f"AI: {ai_error_content}")
                        messages.append({"role": "assistant", "content": ai_error_content})

                else:
                    # This case should ideally not happen if the LLM is well-behaved
                    # and only calls tools it knows about from `tools` parameter.
                    print(f"ERROR: LLM tried to call an unknown tool: {tool_name}")
                    unknown_tool_message = f"I tried to call a tool named '{tool_name}' but it's not available."
                    messages.append({"role": "assistant", "content": unknown_tool_message})
                    print(f"AI: {unknown_tool_message}")

        else:
            # The LLM did not request a tool call, it just responded with text
            ai_content = response['message']['content']
            print(f"AI: {ai_content}")
            messages.append({"role": "assistant", "content": ai_content})

    except ollama.ResponseError as e:
        print(f"Ollama API Error: {e}")
        print("Please ensure Ollama server is running and the model is available.")
        messages.pop() # Remove the last user message to avoid re-sending
    except requests.exceptions.ConnectionError:
        print("Connection Error: Could not connect to Ollama server.")
        print("Please ensure Ollama is running (`ollama serve`) and accessible.")
        messages.pop() # Remove the last user message
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        messages.pop() # Remove the last user message for general errors