Multimodal Messages: Text, Image, and Audio Input

Every message Aria has handled so far has been plain text. But Julie's real life isn't all text — sometimes she'll forward a screenshot of an email instead of retyping it, or leave a quick voice note dictating a reply while she's walking between meetings. Right now, Aria has no way to handle either.

In this article, we'll teach Aria to accept images and audio, not just text. Along the way, you'll learn about base64 encoding — a concept that shows up constantly once you start working with files and AI APIs, and one that's genuinely simple once it's explained properly.

🟡 Skill level: Intermediate.

Quick Reference

When to use this: Whenever an agent needs to process something other than plain text — a screenshot, a photo, a voice recording.

Basic syntax:

from langchain.messages import HumanMessage

message = HumanMessage(content=[
    {"type": "text", "text": "What does this say?"},
    {"type": "image", "base64": img_b64, "mime_type": "image/png"},
])

Common patterns:

content can be a plain string (what we've used so far) or a list of typed parts — text, image, or audio
Files need to be base64-encoded before they can be included in a message
Not every model supports every modality — you have to pick one that does

Gotchas:

⚠️ Using a model that doesn't support image or audio input will fail, sometimes with a confusing error — always check the model's capabilities first.
⚠️ Base64 is an encoding, not compression or encryption — encoded files are actually a bit larger than the originals, not smaller or more secure.

What You Need to Know First

Everything from Articles 1–4
Basic familiarity with reading files in Python (open(...))

What We'll Cover in This Article

How a message's content can hold more than just plain text
What base64 encoding is and why AI APIs need it
How to send an image to an agent
How to send audio to an agent, and why that requires choosing a different model

What We'll Explain Along the Way

What a MIME type is
Why not all models support all input types

From Plain Text to Multi-Part Messages

Every HumanMessage we've created so far has looked like this:

HumanMessage(content="What's in my inbox?")

Here, content is just a plain string. But content can also be a list of parts, where each part is a dictionary describing one piece of the message — and one of those parts can be "type": "text", behaving exactly like a plain string would:

# Purpose: Show that the list form behaves identically to a plain string
# Context: A structural change with no behavior change — sets up image/audio next
# Input: The same kind of question as before, just structured differently
# Output: An identical response to what a plain string would produce

from dotenv import load_dotenv
load_dotenv()

from langchain.agents import create_agent
from langchain.messages import HumanMessage

agent = create_agent(model="gpt-5-nano")

question = HumanMessage(content=[
    {"type": "text", "text": "What's a polite way to decline a coffee invite?"}
])

response = agent.invoke({"messages": [question]})
print(response["messages"][-1].content)

Nothing about the behavior changed — this responds exactly like the plain-string version would. What changed is the shape. And once content is a list, we can add other kinds of parts alongside the text — which is exactly what image and audio input are.

A Quick Detour: What Is Base64?

Before sending an image or audio file, we need to understand one concept: base64 encoding.

AI APIs communicate using JSON — a text-based format. But an image file is binary data: a sequence of bytes that isn't text at all, and can't be safely dropped into a JSON message as-is. Base64 is a way of representing binary data using only plain text characters (letters, numbers, a few symbols), so that binary content can be safely embedded inside a text-based format like JSON.

Think of it like translating a photograph into Morse code so it can be sent over a telegraph wire that only understands dots and dashes — the wire can't carry a photo directly, but it can carry a text-based encoding of one, which gets decoded back into the original on the other end. Base64 is doing the same job: converting binary data into a text-safe encoding, with no loss of information, so it can travel through a text-only channel.

One important clarification: base64 is not compression, and it's not encryption. A base64-encoded file is actually slightly larger than the original (roughly 33% bigger), and anyone can decode it back to the original — it provides no security at all. It's purely a format conversion.

Python's standard library includes everything needed to do this — no extra installation required:

import base64

Giving Aria Eyes: Image Input

Let's say a colleague emailed Julie a photo of a handwritten note, and Julie wants Aria to read it. First, we read the image file and base64-encode it:

# Purpose: Read an image file from disk and encode it for sending to the agent
# Context: Prepares an image to be included in a multimodal message
# Input: A path to an image file on disk
# Output: A base64-encoded string representing the image's binary data

import base64

# Step 1: Open the image file in binary mode ("rb" = read bytes)
with open("handwritten_note.png", "rb") as image_file:
    image_bytes = image_file.read()

# Step 2: Encode the raw bytes as base64, then decode to a plain Python string
# (base64.b64encode returns bytes; .decode("utf-8") turns that into a str)
img_b64 = base64.b64encode(image_bytes).decode("utf-8")

Now we build a multimodal message — text and image together — and send it to an agent. We're using "gpt-5-nano" here as a placeholder model identifier; in practice, swap in whichever model your provider offers that explicitly supports image input.

# Purpose: Send an image alongside text to an agent that supports image input
# Context: Aria reads a photo Julie received, instead of typed text
# Input: The base64-encoded image from the previous step
# Output: A response describing or transcribing what's in the image

from langchain.agents import create_agent
from langchain.messages import HumanMessage

agent = create_agent(model="gpt-5-nano")

multimodal_question = HumanMessage(content=[
    {"type": "text", "text": "Can you transcribe what this handwritten note says?"},
    {"type": "image", "base64": img_b64, "mime_type": "image/png"},
])

response = agent.invoke({"messages": [multimodal_question]})
print(response["messages"][-1].content)

Notice the structure of the image part: "type": "image" tells the agent what kind of content this is, "base64" carries the actual encoded data, and "mime_type" tells the model exactly what format the original file was in — "image/png" here, since we encoded a PNG file. Get the mime_type wrong (say, "image/png" for an actual JPEG file) and the model may fail to interpret the data correctly.

Giving Aria Ears: Audio Input

Audio works exactly the same way structurally — base64-encode the file, add it as a part in the content list — but with one important difference: not every model that handles text and images can also handle audio. You need to pick a model that explicitly supports audio input.

# Purpose: Read an audio file and encode it, same pattern as the image example
# Context: Julie left a voice note instead of typing a message
# Input: A path to a .wav audio file
# Output: A base64-encoded string representing the audio's binary data

import base64

with open("julies_voice_note.wav", "rb") as audio_file:
    audio_bytes = audio_file.read()

aud_b64 = base64.b64encode(audio_bytes).decode("utf-8")

# Purpose: Send audio to an agent built on a model that supports audio input
# Context: Demonstrates that the modality you need determines which model you choose
# Input: The base64-encoded audio from the previous step
# Output: A response based on what was said in the audio

from langchain.agents import create_agent
from langchain.messages import HumanMessage

# Note the different model here — this one explicitly supports audio input.
# Always check your provider's documentation for which models support which
# modalities; not every model that handles text and images also handles audio.
audio_agent = create_agent(model="gpt-4o-audio-preview")

multimodal_question = HumanMessage(content=[
    {"type": "text", "text": "What is Julie asking for in this voice note?"},
    {"type": "audio", "base64": aud_b64, "mime_type": "audio/wav"},
])

response = audio_agent.invoke({"messages": [multimodal_question]})
print(response["messages"][-1].content)

This is the same pattern as image input — "type": "audio", "base64", "mime_type" — applied to a different kind of file, with a model chosen specifically because it supports that modality.

Common Misconceptions

❌ Misconception: Any model that handles text can also handle images and audio

Reality: Modality support varies model by model. A model built for text and images may not support audio at all, and vice versa.

Why this matters: Using the wrong model for a given input type will fail — sometimes with a clear error, sometimes with confusing or degraded behavior. Always confirm a model's supported modalities before building around it.

Example:

# ❌ Wrong assumption: "if it handles text, it handles everything"
agent = create_agent(model="some-text-only-model")
# sending audio to this agent will fail

# ✅ Correct: pick a model explicitly documented as supporting audio
audio_agent = create_agent(model="gpt-4o-audio-preview")

❌ Misconception: Base64 encoding compresses or secures a file

Reality: Base64 is purely a format conversion — turning binary data into text-safe characters. It doesn't make files smaller (it actually makes them about 33% larger) and provides zero security, since it's trivially reversible by anyone.

Why this matters: If you're thinking about file size or security, base64 isn't the tool for either — it solves a completely different problem (getting binary data through a text-only channel like JSON).

Troubleshooting Common Issues

Problem: The model fails or returns nonsense for an image or audio file

Symptoms: An error, or a response that doesn't seem to acknowledge the file content at all.

Common Causes:

The model being used doesn't actually support that modality (most common)
The mime_type doesn't match the actual file format (e.g., labeling a JPEG as "image/png")
The file path was wrong, so an empty or corrupted file got encoded

Diagnostic Steps:

# Step 1: Confirm the file actually has content before encoding
import os
print(os.path.getsize("handwritten_note.png"), "bytes")

# Step 2: Double check the mime_type matches the real file format
# .png -> "image/png", .jpg/.jpeg -> "image/jpeg", .wav -> "audio/wav"

# Step 3: Confirm your model's documentation lists support for this modality

Solution: Match the mime_type exactly to the real file format, and confirm via your model provider's documentation that the chosen model supports the modality you're sending.

Prevention: Keep a quick mental note (or a small lookup table in your code) mapping file extensions to their correct MIME types, so this isn't something you have to re-derive each time.

Problem: `FileNotFoundError` when reading the file

Symptoms: An error before the agent is even called, while opening the file.

Common Causes:

A relative file path that doesn't match where the script is actually being run from

Solution: Use an absolute path, or confirm your current working directory matches where the file actually lives.

Check Your Understanding

Quick Quiz

What problem does base64 encoding actually solve?

Show Answer
It converts binary data (like an image or audio file) into plain text characters, so it can be safely included in a text-based format like JSON, which is what AI APIs communicate with. It doesn't compress or secure the data — it's purely a format conversion.
Why might sending audio to create_agent(model="gpt-5-nano") fail, even though sending images to that same model worked fine?

Show Answer
Because modality support is model-specific. A model that supports text and image input doesn't automatically support audio input too — you need to confirm support for each modality individually and choose a model accordingly.
What's wrong with this code?
```
message = HumanMessage(content=[
    {"type": "image", "base64": jpeg_b64, "mime_type": "image/png"}
])
```
Show Answer
The mime_type says "image/png" but the variable name suggests the actual file is a JPEG. The mime_type should match the real format of the encoded file — here it should be "image/jpeg".

Hands-On Exercise

Challenge: Write a small helper function that takes a file path and automatically picks the correct mime_type based on the file extension, for .png, .jpg/.jpeg, and .wav files.

Show Solution

import base64

def encode_file_with_mime_type(file_path: str) -> dict:
    """Read a file, base64-encode it, and determine its mime_type from
    the file extension."""
    mime_types = {
        ".png": "image/png",
        ".jpg": "image/jpeg",
        ".jpeg": "image/jpeg",
        ".wav": "audio/wav",
    }

    extension = "." + file_path.rsplit(".", 1)[-1].lower()
    mime_type = mime_types.get(extension)

    if mime_type is None:
        raise ValueError(f"Unsupported file extension: {extension}")

    with open(file_path, "rb") as f:
        encoded = base64.b64encode(f.read()).decode("utf-8")

    return {"base64": encoded, "mime_type": mime_type}

# Example usage:
image_part = encode_file_with_mime_type("handwritten_note.png")
print(image_part["mime_type"])  # "image/png"

Explanation: Centralizing the extension-to-mime_type mapping in one helper avoids repeating (and potentially mismatching) it every time you build a multimodal message.

Summary: Key Takeaways

A message's content can be a plain string, or a list of typed parts — text, image, and audio can all coexist
Base64 encoding converts binary file data into text-safe characters so it can travel through JSON-based APIs — it's not compression or encryption
Image parts need "type": "image", "base64", and an accurate "mime_type"
Audio works the same structural way, but requires choosing a model that explicitly supports audio input
Not all models support all modalities — always confirm before building around a specific input type
Aria can now process screenshots and voice notes, not just typed messages

Version Information

Tested with:

Python: >=3.10, <4.0
langchain: >=1.1.3 (latest stable as of writing: 1.3.4)
base64 — part of the Python standard library, no installation needed

Known issues:

⚠️ Audio input support varies significantly by provider and model — always confirm current support in your model provider's documentation, as this is an area that changes frequently.

What's Next?

You now understand how to send images and audio to an agent, and why model choice matters for each modality.

The natural next step is MCP: Connecting Agents to External Servers — so far, every ability Aria has came from a tool you wrote yourself. That article covers connecting her to tools and services built by other people entirely.

References

LangChain Academy: Introduction to LangChain (Python) — this section is inspired by and adapted from this course
LangChain Docs: Multimodality — official guide to message content and multimodal input
MDN: Base64 encoding — general reference on what base64 is and how it works
langchain on PyPI — latest version and release history

Quick Reference​

What You Need to Know First​

What We'll Cover in This Article​

What We'll Explain Along the Way​

From Plain Text to Multi-Part Messages​

A Quick Detour: What Is Base64?​

Giving Aria Eyes: Image Input​

Giving Aria Ears: Audio Input​

Common Misconceptions​

❌ Misconception: Any model that handles text can also handle images and audio​

❌ Misconception: Base64 encoding compresses or secures a file​

Troubleshooting Common Issues​

Problem: The model fails or returns nonsense for an image or audio file​

Problem: FileNotFoundError when reading the file​

Check Your Understanding​

Quick Quiz​

Hands-On Exercise​

Summary: Key Takeaways​

Version Information​

What's Next?​

References​