Multimodal Messages: Text, Image, and Audio Input
Every message Aria has handled so far has been plain text. But Julie's real life isn't all text — sometimes she'll forward a screenshot of an email instead of retyping it, or leave a quick voice note dictating a reply while she's walking between meetings. Right now, Aria has no way to handle either.
In this article, we'll teach Aria to accept images and audio, not just text. Along the way, you'll learn about base64 encoding — a concept that shows up constantly once you start working with files and AI APIs, and one that's genuinely simple once it's explained properly.
🟡 Skill level: Intermediate.
Quick Reference
When to use this: Whenever an agent needs to process something other than plain text — a screenshot, a photo, a voice recording.
Basic syntax:
from langchain.messages import HumanMessage
message = HumanMessage(content=[
{"type": "text", "text": "What does this say?"},
{"type": "image", "base64": img_b64, "mime_type": "image/png"},
])
Common patterns:
contentcan be a plain string (what we've used so far) or a list of typed parts — text, image, or audio- Files need to be base64-encoded before they can be included in a message
- Not every model supports every modality — you have to pick one that does
Gotchas:
- ⚠️ Using a model that doesn't support image or audio input will fail, sometimes with a confusing error — always check the model's capabilities first.
- ⚠️ Base64 is an encoding, not compression or encryption — encoded files are actually a bit larger than the originals, not smaller or more secure.
See also: Memory and Threads: Agents That Remember
What You Need to Know First
- Everything from Articles 1–4
- Basic familiarity with reading files in Python (
open(...))
What We'll Cover in This Article
- How a message's
contentcan hold more than just plain text - What base64 encoding is and why AI APIs need it
- How to send an image to an agent
- How to send audio to an agent, and why that requires choosing a different model
What We'll Explain Along the Way
- What a MIME type is
- Why not all models support all input types
From Plain Text to Multi-Part Messages
Every HumanMessage we've created so far has looked like this:
HumanMessage(content="What's in my inbox?")
Here, content is just a plain string. But content can also be a list of parts, where each part is a dictionary describing one piece of the message — and one of those parts can be "type": "text", behaving exactly like a plain string would:
# Purpose: Show that the list form behaves identically to a plain string
# Context: A structural change with no behavior change — sets up image/audio next
# Input: The same kind of question as before, just structured differently
# Output: An identical response to what a plain string would produce
from dotenv import load_dotenv
load_dotenv()
from langchain.agents import create_agent
from langchain.messages import HumanMessage
agent = create_agent(model="gpt-5-nano")
question = HumanMessage(content=[
{"type": "text", "text": "What's a polite way to decline a coffee invite?"}
])
response = agent.invoke({"messages": [question]})
print(response["messages"][-1].content)
Nothing about the behavior changed — this responds exactly like the plain-string version would. What changed is the shape. And once content is a list, we can add other kinds of parts alongside the text — which is exactly what image and audio input are.
A Quick Detour: What Is Base64?
Before sending an image or audio file, we need to understand one concept: base64 encoding.
AI APIs communicate using JSON — a text-based format. But an image file is binary data: a sequence of bytes that isn't text at all, and can't be safely dropped into a JSON message as-is. Base64 is a way of representing binary data using only plain text characters (letters, numbers, a few symbols), so that binary content can be safely embedded inside a text-based format like JSON.
Think of it like translating a photograph into Morse code so it can be sent over a telegraph wire that only understands dots and dashes — the wire can't carry a photo directly, but it can carry a text-based encoding of one, which gets decoded back into the original on the other end. Base64 is doing the same job: converting binary data into a text-safe encoding, with no loss of information, so it can travel through a text-only channel.
One important clarification: base64 is not compression, and it's not encryption. A base64-encoded file is actually slightly larger than the original (roughly 33% bigger), and anyone can decode it back to the original — it provides no security at all. It's purely a format conversion.
Python's standard library includes everything needed to do this — no extra installation required:
import base64
Giving Aria Eyes: Image Input
Let's say a colleague emailed Julie a photo of a handwritten note, and Julie wants Aria to read it. First, we read the image file and base64-encode it:
# Purpose: Read an image file from disk and encode it for sending to the agent
# Context: Prepares an image to be included in a multimodal message
# Input: A path to an image file on disk
# Output: A base64-encoded string representing the image's binary data
import base64
# Step 1: Open the image file in binary mode ("rb" = read bytes)
with open("handwritten_note.png", "rb") as image_file:
image_bytes = image_file.read()
# Step 2: Encode the raw bytes as base64, then decode to a plain Python string
# (base64.b64encode returns bytes; .decode("utf-8") turns that into a str)
img_b64 = base64.b64encode(image_bytes).decode("utf-8")
Now we build a multimodal message — text and image together — and send it to an agent. We're using "gpt-5-nano" here as a placeholder model identifier; in practice, swap in whichever model your provider offers that explicitly supports image input.
# Purpose: Send an image alongside text to an agent that supports image input
# Context: Aria reads a photo Julie received, instead of typed text
# Input: The base64-encoded image from the previous step
# Output: A response describing or transcribing what's in the image
from langchain.agents import create_agent
from langchain.messages import HumanMessage
agent = create_agent(model="gpt-5-nano")
multimodal_question = HumanMessage(content=[
{"type": "text", "text": "Can you transcribe what this handwritten note says?"},
{"type": "image", "base64": img_b64, "mime_type": "image/png"},
])
response = agent.invoke({"messages": [multimodal_question]})
print(response["messages"][-1].content)
Notice the structure of the image part: "type": "image" tells the agent what kind of content this is, "base64" carries the actual encoded data, and "mime_type" tells the model exactly what format the original file was in — "image/png" here, since we encoded a PNG file. Get the mime_type wrong (say, "image/png" for an actual JPEG file) and the model may fail to interpret the data correctly.
Giving Aria Ears: Audio Input
Audio works exactly the same way structurally — base64-encode the file, add it as a part in the content list — but with one important difference: not every model that handles text and images can also handle audio. You need to pick a model that explicitly supports audio input.
# Purpose: Read an audio file and encode it, same pattern as the image example
# Context: Julie left a voice note instead of typing a message
# Input: A path to a .wav audio file
# Output: A base64-encoded string representing the audio's binary data
import base64
with open("julies_voice_note.wav", "rb") as audio_file:
audio_bytes = audio_file.read()
aud_b64 = base64.b64encode(audio_bytes).decode("utf-8")
# Purpose: Send audio to an agent built on a model that supports audio input
# Context: Demonstrates that the modality you need determines which model you choose
# Input: The base64-encoded audio from the previous step
# Output: A response based on what was said in the audio
from langchain.agents import create_agent
from langchain.messages import HumanMessage
# Note the different model here — this one explicitly supports audio input.
# Always check your provider's documentation for which models support which
# modalities; not every model that handles text and images also handles audio.
audio_agent = create_agent(model="gpt-4o-audio-preview")
multimodal_question = HumanMessage(content=[
{"type": "text", "text": "What is Julie asking for in this voice note?"},
{"type": "audio", "base64": aud_b64, "mime_type": "audio/wav"},
])
response = audio_agent.invoke({"messages": [multimodal_question]})
print(response["messages"][-1].content)
This is the same pattern as image input — "type": "audio", "base64", "mime_type" — applied to a different kind of file, with a model chosen specifically because it supports that modality.
Common Misconceptions
❌ Misconception: Any model that handles text can also handle images and audio
Reality: Modality support varies model by model. A model built for text and images may not support audio at all, and vice versa.
Why this matters: Using the wrong model for a given input type will fail — sometimes with a clear error, sometimes with confusing or degraded behavior. Always confirm a model's supported modalities before building around it.
Example:
# ❌ Wrong assumption: "if it handles text, it handles everything"
agent = create_agent(model="some-text-only-model")
# sending audio to this agent will fail
# ✅ Correct: pick a model explicitly documented as supporting audio
audio_agent = create_agent(model="gpt-4o-audio-preview")
❌ Misconception: Base64 encoding compresses or secures a file
Reality: Base64 is purely a format conversion — turning binary data into text-safe characters. It doesn't make files smaller (it actually makes them about 33% larger) and provides zero security, since it's trivially reversible by anyone.
Why this matters: If you're thinking about file size or security, base64 isn't the tool for either — it solves a completely different problem (getting binary data through a text-only channel like JSON).
Troubleshooting Common Issues
Problem: The model fails or returns nonsense for an image or audio file
Symptoms: An error, or a response that doesn't seem to acknowledge the file content at all.
Common Causes:
- The model being used doesn't actually support that modality (most common)
- The
mime_typedoesn't match the actual file format (e.g., labeling a JPEG as"image/png") - The file path was wrong, so an empty or corrupted file got encoded
Diagnostic Steps:
# Step 1: Confirm the file actually has content before encoding
import os
print(os.path.getsize("handwritten_note.png"), "bytes")
# Step 2: Double check the mime_type matches the real file format
# .png -> "image/png", .jpg/.jpeg -> "image/jpeg", .wav -> "audio/wav"
# Step 3: Confirm your model's documentation lists support for this modality
Solution: Match the mime_type exactly to the real file format, and confirm via your model provider's documentation that the chosen model supports the modality you're sending.
Prevention: Keep a quick mental note (or a small lookup table in your code) mapping file extensions to their correct MIME types, so this isn't something you have to re-derive each time.
Problem: FileNotFoundError when reading the file
Symptoms: An error before the agent is even called, while opening the file.
Common Causes:
- A relative file path that doesn't match where the script is actually being run from
Solution: Use an absolute path, or confirm your current working directory matches where the file actually lives.
Check Your Understanding
Quick Quiz
-
What problem does base64 encoding actually solve?
Show Answer
It converts binary data (like an image or audio file) into plain text characters, so it can be safely included in a text-based format like JSON, which is what AI APIs communicate with. It doesn't compress or secure the data — it's purely a format conversion.
-
Why might sending audio to
create_agent(model="gpt-5-nano")fail, even though sending images to that same model worked fine?Show Answer
Because modality support is model-specific. A model that supports text and image input doesn't automatically support audio input too — you need to confirm support for each modality individually and choose a model accordingly.
-
What's wrong with this code?
message = HumanMessage(content=[
{"type": "image", "base64": jpeg_b64, "mime_type": "image/png"}
])Show Answer
The
mime_typesays"image/png"but the variable name suggests the actual file is a JPEG. Themime_typeshould match the real format of the encoded file — here it should be"image/jpeg".
Hands-On Exercise
Challenge: Write a small helper function that takes a file path and automatically picks the correct mime_type based on the file extension, for .png, .jpg/.jpeg, and .wav files.
Show Solution
import base64
def encode_file_with_mime_type(file_path: str) -> dict:
"""Read a file, base64-encode it, and determine its mime_type from
the file extension."""
mime_types = {
".png": "image/png",
".jpg": "image/jpeg",
".jpeg": "image/jpeg",
".wav": "audio/wav",
}
extension = "." + file_path.rsplit(".", 1)[-1].lower()
mime_type = mime_types.get(extension)
if mime_type is None:
raise ValueError(f"Unsupported file extension: {extension}")
with open(file_path, "rb") as f:
encoded = base64.b64encode(f.read()).decode("utf-8")
return {"base64": encoded, "mime_type": mime_type}
# Example usage:
image_part = encode_file_with_mime_type("handwritten_note.png")
print(image_part["mime_type"]) # "image/png"
Explanation: Centralizing the extension-to-mime_type mapping in one helper avoids repeating (and potentially mismatching) it every time you build a multimodal message.
Summary: Key Takeaways
- A message's
contentcan be a plain string, or a list of typed parts — text, image, and audio can all coexist - Base64 encoding converts binary file data into text-safe characters so it can travel through JSON-based APIs — it's not compression or encryption
- Image parts need
"type": "image","base64", and an accurate"mime_type" - Audio works the same structural way, but requires choosing a model that explicitly supports audio input
- Not all models support all modalities — always confirm before building around a specific input type
- Aria can now process screenshots and voice notes, not just typed messages
Version Information
Tested with:
- Python:
>=3.10, <4.0 langchain:>=1.1.3(latest stable as of writing:1.3.4)base64— part of the Python standard library, no installation needed
Known issues:
- ⚠️ Audio input support varies significantly by provider and model — always confirm current support in your model provider's documentation, as this is an area that changes frequently.
What's Next?
You now understand how to send images and audio to an agent, and why model choice matters for each modality.
The natural next step is MCP: Connecting Agents to External Servers — so far, every ability Aria has came from a tool you wrote yourself. That article covers connecting her to tools and services built by other people entirely.
References
- LangChain Academy: Introduction to LangChain (Python) — this section is inspired by and adapted from this course
- LangChain Docs: Multimodality — official guide to message content and multimodal input
- MDN: Base64 encoding — general reference on what base64 is and how it works
langchainon PyPI — latest version and release history