State Management and Callbacks in Complex Asynchronous ADK Workflows

Introduction: The Problem of "Fire and Forget"

In a sophisticated agentic system, many valuable tasks are not instantaneous. An agent tasked with a simple, synchronous goal like "summarize this text" can return a response in seconds. But an agent tasked with, "Run a full financial audit on our Q3 sales data, cross-reference it with marketing campaign expenditures, and generate a detailed forecast report," might need minutes, or even hours, to complete its work.

This introduces the critical engineering problem of asynchronicity. The client application that initiates the task cannot simply block and wait for a response. This would lead to frozen UIs, timed-out connections, and an unusable system. The core challenge is twofold: how does the system reliably track the state of this long-running task, and how does it notify the original caller when the work is finally done?

The Engineering Solution: Durable State and A2A Callbacks

The google.adk.runtime solves this by implementing patterns from modern, durable workflow orchestration engines. The architecture separates the task initiation from its result, using a combination of a persistent state backend and an asynchronous callback mechanism.

The Task Object & State Backend: When a client makes an asynchronous A2A /run call, the ADK runtime does not execute the task directly. Instead, it immediately creates a Task object, persists it to a highly available State Backend (such as a managed Redis or Spanner instance), and returns a unique task_id to the client. This task_id is the handle for all future interactions. The State Backend tracks the lifecycle of the task (PENDING, RUNNING, FAILED, SUCCESS) and can store intermediate results, making the workflow durable and resumable.
The A2A Callback Service: While the client could poll a /status endpoint using the task_id, this is highly inefficient at scale. The ADK framework promotes a push-based model using A2A callbacks. During the initial /run request, the client can provide a callback_url. When the long-running agent task is complete, the runtime's internal Callback Service makes a secure, signed A2A /run call to this URL, delivering the final result as its payload.

+--------+  1. POST /run (async)  +----------------+
| Client |----------------------->|   ADK Runtime  |
|        |<-----------------------| (returns task_id)|
+--------+  2. { "task_id": ... } +-------+--------+
    ^                                     | 3. Persists Task State
    |                                     v
    | 6. POST /run (callback)   +----------------+
    +---------------------------| Callback Svc.  |
                                +----------------+

Implementation Details

The developer experience is designed to abstract away the complexity of state persistence and callback delivery.

Snippet 1: Client Initiating an Asynchronous Task

The client specifies is_async=True and provides its own callback endpoint.

# client_app.py
from google.adk import client

# Connect to the remote agent
audit_agent = client.connect("https://long-running-auditor.gcp.com")

# Initiate the task asynchronously
task = audit_agent.run(
    method="start_q3_audit",
    params={"company_id": "acme-corp-123"},
    is_async=True,
    callback_url="https://my-app.internal/a2a/callbacks/run"
)

# The client receives a task ID immediately and can disconnect.
# It stores this ID to correlate with the future callback.
print(f"Asynchronous task initiated with ID: {task.id}")

# -->

Snippet 2: The Agent's Asynchronous Task Implementation

The agent developer uses the @runtime.asynchronous_task decorator. The provided TaskContext object allows the agent to report progress back to the central State Backend.

# audit_agent.py
from google.adk import agents, runtime
import time

class FinancialAuditAgent(agents.SpecialistAgent):

    @runtime.asynchronous_task
    def start_q3_audit(self, context: runtime.TaskContext, company_id: str) -> dict:
        """
        A long-running task to perform a financial audit.
        The ADK runtime manages the state and callback invocation.
        """
        print(f"Starting audit for {company_id} (Task ID: {context.task_id})")

        # Simulate long-running steps, updating progress along the way.
        time.sleep(60) # Step 1: Data extraction
        context.update_progress(percent=33, message="Source data extracted from ledgers.")

        time.sleep(60) # Step 2: Analysis
        context.update_progress(percent=66, message="Cross-referencing analysis complete.")

        # Final step: generate the report
        final_report = {"report_url": f"https://storage.cloud.google.com/{context.task_id}/audit.pdf"}

        # Upon returning, the ADK runtime marks the task as SUCCESS
        # and triggers the callback to the client's URL.
        return final_report

Snippet 3: The Client's Callback Handler

The client exposes its own A2A /run endpoint to receive the results.

# client_app.py (the callback receiver)
from google.adk import a2a
from flask import request

@a2a.endpoint(path="/callbacks/run")
def handle_agent_callbacks(method: str, params: dict):
    # ADK automatically validates the signature of the incoming callback
    a2a.validate_request_signature(request)

    if method == "on_audit_complete":
        task_id = params.get("original_task_id")
        result = params.get("result")
        report_url = result.get("report_url")

        print(f"Callback received: Task {task_id} completed successfully!")
        print(f"Final report is available at: {report_url}")
        # Logic to notify the end-user or trigger the next step.

Performance & Security Considerations

Performance: The choice of the State Backend is the primary performance consideration. For high-frequency, low-latency state updates (e.g., tracking progress percentage), a managed in-memory store like Redis is optimal. For workflows requiring high durability and transactional consistency (e.g., ensuring a task runs exactly once), a more robust database like Spanner or Cloud SQL is the superior choice. The ADK runtime can be configured to use either, allowing architects to make the right trade-off.

Security: Callback URLs are publicly exposed endpoints and must be secured.

Callback Authentication: An attacker could try to call your callback_url with fake data. To prevent this, the ADK Callback Service cryptographically signs every outgoing request (e.g., using a JSON Web Token - JWT - in the Authorization header). The client-side @a2a.endpoint decorator is responsible for automatically validating this signature, ensuring the callback is authentic and originated from the trusted ADK runtime.
State Encryption: Any sensitive data written to the State Backend, such as intermediate financial calculations, must be encrypted at rest. This protects the data from unauthorized access even if the underlying storage is compromised.

Conclusion: The ROI of Durable Orchestration

Implementing a durable state and callback system is what elevates an AI agent from a simple request-response tool to a reliable component in a mission-critical business process.

The return on this architectural investment is clear:

Decoupling & Resilience: It completely decouples the client from the agent. A mobile app can initiate a task and immediately close, confident that the result will be delivered later.
Fault Tolerance: By persisting state, the ADK runtime can retry or resume failed workflow steps, making the entire system vastly more robust.
Efficiency and Scale: A push-based callback model is profoundly more resource-efficient than a polling-based model, reducing network chatter and server load, and enabling the system to scale to millions of concurrent asynchronous tasks.

Built-in orchestration primitives like these are essential for moving beyond toy agents and into the domain of complex, long-running, and dependable autonomous systems that can be trusted with meaningful work.