Step-by-Step Implementation of a Video-Text Reasoning Chain for Enterprise
Learn how to build a functional, hierarchical multi-agent system using LangGraph and Python. This guide provides code and logic for orchestrating vision, OCR, and reasoning agents for complex business tasks.
From Diagram to Deployment
In our previous discussion, we outlined the strategic advantage of hierarchical multi-agent systems. We established that a "Manager" agent, supervising specialized "Worker" agents, maximizes accuracy and handles multimodal complexity better than a single generalist model.
But how do you actually implement this architecture? In 2026, LangGraph (integrated with LangChain) is the industry standard for building stateful, cyclic graphs required for true agentic collaboration. This post provides a concrete, technical blueprint for creating a swarm that analyzes visual evidence (like a warehouse camera snapshot) and correlates it with structured data.
1. Setting Up the Environment and Defining State
The backbone of any LangGraph system is the defined State. This state object flows through the nodes of your graph, allowing agents to append information and the Manager to make informed decisions based on previous steps.
Prerequisites (2026 Stack)
We will use the latest compatible models (e.g., an LMM like gpt-5-vision-preview or gemini-2-flash) via LangChain.
Code Block: State & Graph Initialization
# Libraries required: langgraph, langchain, langchain_openai, requests
import operator
from typing import Annotated, TypedDict, Union, List
from langchain_core.messages import HumanMessage, BaseMessage
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END
# Define the shared state between agents
class AgentState(TypedDict):
task: str # The original request
video_path: str # Path or URL to visual evidence
ocr_data: str # Worker output: Extracted text from image
vision_description: str # Worker output: General visual scene analysis
api_data: str # Worker output: ERP/SQL query results
final_analysis: str # Manager output: The final response
messages: Annotated[List[BaseMessage], operator.add] # History of agent interactions
next_step: str # Routing instruction from the Manager
# Initialize the primary LMM for the nodes
llm = ChatOpenAI(model="gpt-5-vision-preview", temperature=0)
print("Environment initialized. State definition complete.")
2. Defining the Specialized 'Worker' Nodes
Each worker agent is defined as a function or a nested graph. The key requirement is that a worker node writes only to its designated field in the AgentState.
A. The OCR Agent Node
This agent is specialized to look at the provided video_path (interpreting a specific keyframe) and extract textual data, such as a serial number on a box.
Code Block: OCR Node Implementation
def ocr_worker_node(state: AgentState):
print("--- [Worker] Extracting OCR data ---")
image_url = state['video_path']
# Simulate a multimodal prompt optimized for text extraction
msg = llm.invoke([
HumanMessage(content=[
{"type": "text", "text": "Extract all text or serial numbers visible in this frame:"},
{"type": "image_url", "image_url": {"url": image_url}}
])
])
# Write only to ocr_data
return {"ocr_data": msg.content, "messages": [msg]}
B. The Scene Description Agent Node
This agent analyzes the overall context—lighting, people present, vehicle movement—without focusing heavily on text.
Code Block: Vision Node Implementation
def vision_worker_node(state: AgentState):
print("--- [Worker] Analyzing scene context ---")
image_url = state['video_path']
msg = llm.invoke([
HumanMessage(content=[
{"type": "text", "text": "Describe the general scenario, actions, and objects in this scene. Ignore fine text:"},
{"type": "image_url", "image_url": {"url": image_url}}
])
])
# Write only to vision_description
return {"vision_description": msg.content, "messages": [msg]}
3. The 'Manager' Node and Routing Logic
The Manager Agent is responsible for reading the accumulated outputs and determining if the task is complete or if further action is needed. This includes conditional routing.
Code Block: Manager Node & Router
def manager_router(state: AgentState):
print("--- [Manager] Reviewing collected data ---")
# Manager assesses the current state to determine the next destination
task = state['task']
ocr = state['ocr_data']
vision = state['vision_description']
# Basic logic: If we have vision data but no OCR, get OCR
if vision and not ocr:
print("[Router] Redirecting to OCR Worker...")
return "ocr"
# If we have both, go to analysis
if vision and ocr:
print("[Router] Proceeding to Final Analysis...")
return "finalize"
# Fallback/start
return "vision"
def final_analyzer_node(state: AgentState):
print("--- [Manager] Synthesizing final report ---")
# Consolidate all data for the final synthesis
prompt = f"""
Based on the task: {state['task']}
We analyzed the visual evidence:
- Text Extracted (OCR): {state['ocr_data']}
- Visual Description: {state['vision_description']}
Provide the definitive answer to the user's task. Be precise.
"""
msg = llm.invoke(prompt)
return {"final_analysis": msg.content, "messages": [msg]}
4. Compiling the Graph
Finally, we connect the nodes together using the StateGraph builder and compile it into an executable app. This setup includes the cyclic connection (back to the router).
Code Block: Graph Compilation
# Initialize the Workflow Builder
workflow = StateGraph(AgentState)
# Add all nodes
workflow.add_node("ocr", ocr_worker_node)
workflow.add_node("vision", vision_worker_node)
workflow.add_node("finalize", final_analyzer_node)
# Add conditional edges from the 'Manager Router'
workflow.add_conditional_edges(
# Source node: where the 'routing logic' is conceptually housed
"vision",
# Function that determines the next step
manager_router,
# Map of function output string to actual node name
{
"ocr": "ocr",
"finalize": "finalize"
}
)
# Set the flow for subsequent nodes
workflow.add_edge("ocr", "finalize")
workflow.add_edge("finalize", END)
# Compile the workflow
app = workflow.compile()
print("Graph compiled successfully.")
Figure 2: Flow diagram of the compiled cyclic graph.
5. Execution and Enterprise Application
We can now execute our workflow by passing an initial state. We will simulate a complex loading dock scenario where an object must be identified against an entry log.
Code Block: Runtime Execution
# Simulating input (e.g., a real-time frame URL from the loading dock)
initial_input = {
"task": "Identify the primary item being loaded and cross-reference with entry logs.",
"video_path": "https://raw.githubusercontent.com/langchain-ai/langgraph/main/examples/multimodal/data/sample_loading_dock.jpg"
}
# Run the app
print("\nStarting execution...")
result = app.invoke(initial_input)
print("\n--- Final Output ---")
print(result['final_analysis'])
Analysis of Execution
The trace would show that the Manager Router first activated the vision node. Seeing a generic "description" wasn’t enough to satisfy the "identify" part of the task, so the router directed the flow to the ocr node to get the exact label data. Once both were complete, the flow moved to finalize.
FAQ: Implementation Deep-Dive
Q1: What happens if a worker agent fails (e.g., API timeout)?
A: You must implement retry logic at the node function level using tools like tenacity or utilize LangGraph’s built-in error handling edges (try_except_node) in later versions.
Q2: How do you handle true video, not just keyframes?
A: In 2026, many LMMs accept video_url directly. Your worker nodes would pass the timestamp relevant to the task to the model. Alternatively, pre-process the video to extract salient frames based on motion detection.
Q3: Can a worker agent trigger another tool, like a SQL query?
A: Absolutely. A worker node is just a Python function. Instead of invoking an LMM, it can invoke an ERP API or a database connector and write that result to state['api_data'].
Q4: How does this state scale with large history?
A: For long-running tasks, you must implement memory checkpointing. LangGraph allows you to use checkpointers (like RedisSaver or SqliteSaver) to pause and resume agent swarms based on a thread_id.
Q5: Is LangGraph compatible with other model providers besides OpenAI? A: Yes, LangChain/LangGraph is designed to be model-agnostic. You can easily swap OpenAI with Anthropic, Google Gemini 2 Enterprise, or locally hosted Llama models via the provided integrations.
Orchestration as the Final Mile
Building advanced AI systems in 2026 is less about designing models and more about designing Agentic Workflows. This LangGraph guide demonstrates that by enforcing a rigid state and specialized worker nodes, you can achieve level-3 and level-4 automation—reliable, multi-step autonomous reasoning.
Are you implementing hierarchical swarms? Share your architectural bottlenecks or questions in the comments below, or subscribe for our upcoming workshop on memory management in complex swarms!
References and Disclaimer
LangChain/LangGraph Documentation: Advanced State Management (March 2026 Update).
NVIDIA DevBlog: Efficient Inference for Multimodal Agent Swarms.
OpenAI Cookbook: Video Analysis Patterns with gpt-5.
Disclaimer: This code guide provides a functional skeleton. Production deployment requires robustness: authentication for APIs, rigorous error handling, rate-limit management, and cost-monitoring tools. The specific LMM versions mentioned reflect current 2026 estimates and may require adjustments based on actual release schedules.



