Data Extraction & Hashing Workflows › Async Batching for Large Datasets

Async Batching for Large Datasets

Q: Why a bounded queue instead of just reading everything and hashing in parallel?

A billion-row table will not fit in memory, and a buffered read triggers the OOM killer before hashing starts. The bounded asyncio queue applies backpressure — when it fills, the producer blocks, capping resident rows at max_queue_size times batch_rows regardless of table size.

Q: Should I use a thread pool or a process pool for hashing?

Default to a ThreadPoolExecutor because hashlib releases the GIL during digest computation, so threads scale nearly linearly without pickling rows across a process boundary. Switch to a ProcessPoolExecutor only when the Python-level serialization per row is itself CPU-heavy and GIL-bound.

Q: What happens to a batch that fails every retry?

It is marked exhausted and must be routed to a dead-letter queue rather than silently dropped. Assert at job end that committed plus dead-lettered batches account for every source row; a shortfall with no reported mismatch signals silent batch loss.

Async batching is the flow-control stage that sits between an unbounded source cursor and the deterministic hashing loop inside the data extraction and hashing workflow. Its job is narrow and load-bearing: pull rows off a streaming reader as fast as the source will yield them, group them into fixed-size work units, and hand those units to CPU-bound digest computation without ever letting the number of in-flight rows grow without bound. For data engineers, migration specialists, Python pipeline builders, and platform operations teams reconciling billion-row tables, throughput is almost never limited by network bandwidth — it is limited by how well I/O-bound extraction and CPU-bound cryptographic hashing are coordinated. This stage owns that coordination.

The reason it exists as a distinct workload is that the two halves of extraction-and-hashing have incompatible performance profiles. Reading rows is I/O-bound and wants concurrency; computing a SHA-256 or BLAKE2b digest over each row is CPU-bound and blocks the event loop if run inline. Async batching resolves the tension by making the event loop responsible only for moving bounded batches through a queue, while the actual hashing is offloaded to a worker pool. Get this wrong and the pipeline fails in one of two ways: an unbounded reader exhausts heap and triggers the OOM killer, or an inline hash starves the event loop and collapses extraction throughput to a crawl.

Architectural Boundaries

This stage begins the moment a streaming, key-sorted reader is available — the mechanics of lock-light, partitioned reads live upstream in the parallel row extraction techniques reference — and it ends the moment a hashed, verifiable batch is committed to the reconciliation sink. It consumes an async iterator of typed rows keyed on a stable reconciliation key, plus a batch configuration profile. It produces ordered ReconciliationBatch objects, each carrying its rows and their per-row digests, ready for the comparison engine.

Three concerns are isolated inside the boundary and must never leak into one another. Ingestion is strictly asynchronous and does nothing but read and enqueue. Digesting is strictly synchronous CPU work, offloaded to a thread or process pool so it never runs on the event-loop thread. Commit is per-batch and idempotent, so a restart resumes from the last durably recorded batch rather than replaying the whole cursor. Keeping these three phases decoupled is what lets the same engine back both a one-shot snapshot job and a long-running incremental validator.

The canonical byte representation each row is reduced to before hashing is defined upstream by data equivalence modeling; this stage treats that serialization as a black box and concerns itself only with flow control. The digests it emits flow downstream into the structural diffing and sync engines that walk the two digest streams and route divergence into a discrepancy manifest.

Prerequisites

Confirm each of the following before wiring async batching into a reconciliation run. Every item removes a common source of memory blow-ups or event-loop stalls.

Streaming reader in place. The source exposes an async iterator (asyncpg cursor, aiomysql server-side cursor, or an async wrapper over a warehouse fetch) that yields rows incrementally — never a buffered fetchall.
Schema gate passed. Both engines’ column sets and types are reconciled by the schema validation pre-checks stage, so no structural drift reaches the hashing loop mid-run.
Memory budget declared. A hard RSS ceiling per worker is known, so max_queue_size × batch_rows × avg_row_bytes can be sized to stay inside it with headroom.
Read-only credentials. The reconciliation identity holds SELECT / SCAN only, scoped to the tables under validation — the stage never mutates source or target.
Dependency libraries pinned. asyncio, hashlib, and the async driver are version-pinned so queue and executor semantics are reproducible across hosts.
Idempotent commit target. A lightweight metadata store (Redis or a Postgres table) exists to record the last committed batch_id for resume-on-restart.

Step-by-Step Implementation

The steps below build a memory-bounded async batching engine. Each ends with an assertion or observable output so it can be verified in isolation before the next layer is added.

Step 1 — Declare the batching configuration

Centralize every flow-control and hashing parameter in one typed object. Nothing that affects memory footprint or digest output may live outside it.

python

import asyncio
import hashlib
import logging
import time
from concurrent.futures import ThreadPoolExecutor
from dataclasses import dataclass, field
from typing import AsyncIterator, List

logger = logging.getLogger("reconcile.async_batching")


@dataclass(frozen=True)
class BatchConfig:
    max_queue_size: int = 1000      # bounded queue depth ⇒ hard backpressure
    batch_rows: int = 500           # rows per work unit
    worker_threads: int = 4         # digest workers
    retry_attempts: int = 3
    backoff_base: float = 0.5       # seconds; doubled per attempt
    hash_algorithm: str = "sha256"

Verify the config is frozen and its memory envelope is what you expect:

python

cfg = BatchConfig()
assert cfg.max_queue_size * cfg.batch_rows == 500_000   # max in-flight rows

The product max_queue_size × batch_rows is the single most important number on the page: it is the ceiling on how many rows can ever be resident at once. Multiply it by the average serialized row width to get the worst-case buffer memory, and size both factors to stay inside the declared RSS budget.

Step 2 — Model the batch as an atomic reconciliation scope

Each batch is the unit of retry, commit, and failure isolation. Model it explicitly so its lifecycle status is observable at every stage.

python

@dataclass
class ReconciliationBatch:
    batch_id: str
    rows: List[dict]
    checksums: List[str] = field(default_factory=list)
    status: str = "pending"                       # pending → hashed → committed | exhausted
    created_at: float = field(default_factory=time.time)

Verify a batch starts in the expected state:

python

b = ReconciliationBatch(batch_id="batch_000000", rows=[{"id": 1}])
assert b.status == "pending" and b.checksums == []

Step 3 — Ingest rows and enqueue with backpressure

The producer reads the async cursor, accumulates rows up to batch_rows, and puts each full batch on a bounded queue. When the queue is full, await queue.put(...) blocks — that block is the backpressure that stops extraction from outrunning hashing.

python

async def extract_and_enqueue(
    source_cursor: AsyncIterator[dict],
    queue: "asyncio.Queue[ReconciliationBatch]",
    cfg: BatchConfig,
    shutdown: asyncio.Event,
) -> None:
    """I/O-bound ingestion. Reads rows, batches them, applies backpressure."""
    buffer: List[dict] = []
    counter = 0
    async for row in source_cursor:
        if shutdown.is_set():
            break
        buffer.append(row)
        if len(buffer) >= cfg.batch_rows:
            batch = ReconciliationBatch(f"batch_{counter:06d}", buffer.copy())
            await queue.put(batch)          # blocks when full ⇒ backpressure
            buffer.clear()
            counter += 1
    if buffer:                              # flush the final partial batch
        await queue.put(ReconciliationBatch(f"batch_{counter:06d}", buffer))
    logger.info("Ingestion complete: %d full batches enqueued", counter)

Verify batching and backpressure against a small fixture with a queue of depth 1:

python

async def _demo_ingest():
    async def rows():
        for i in range(3):
            yield {"id": i}
    q: asyncio.Queue = asyncio.Queue(maxsize=1)
    cfg = BatchConfig(batch_rows=2)
    await asyncio.gather(
        extract_and_enqueue(rows(), q, cfg, asyncio.Event()),
        _drain(q, expected=2),
    )

# 3 rows at batch_rows=2 ⇒ one full batch of 2 + one partial batch of 1.

Step 4 — Offload digesting to a worker pool

Digest computation is CPU-bound and must never run on the event-loop thread. Compute per-row digests in a plain synchronous function, then invoke it through run_in_executor so the GIL-releasing hashlib work happens on a pool thread.

python

def compute_digests(batch: ReconciliationBatch, cfg: BatchConfig) -> ReconciliationBatch:
    """CPU-bound hashing, executed on a pool thread — never the event loop."""
    try:
        for row in batch.rows:
            row_bytes = str(sorted(row.items())).encode("utf-8")
            batch.checksums.append(
                hashlib.new(cfg.hash_algorithm, row_bytes).hexdigest()
            )
        batch.status = "hashed"
    except Exception:
        logger.exception("Digest computation failed for %s", batch.batch_id)
        batch.status = "failed"
    return batch

Verify determinism — the same rows must always produce the same digests:

python

cfg = BatchConfig()
b1 = compute_digests(ReconciliationBatch("b", [{"a": 1, "b": 2}]), cfg)
b2 = compute_digests(ReconciliationBatch("b", [{"b": 2, "a": 1}]), cfg)
assert b1.status == "hashed" and b1.checksums == b2.checksums

Sorting row.items() normalizes field order, so a document store returning {"b":2,"a":1} and a relational row returning (a=1, b=2) hash identically. The production-grade serialization contract that guarantees this across types is defined by the column-level checksum generation reference.

Step 5 — Wrap each batch in retry with exponential backoff

Transient failures (a pool thread interruption, a downstream commit hiccup) should not fail the whole job. Retry each batch a bounded number of times with exponential backoff, then mark it exhausted for the dead-letter path.

python

async def process_batch(
    batch: ReconciliationBatch,
    executor: ThreadPoolExecutor,
    cfg: BatchConfig,
) -> ReconciliationBatch:
    """Orchestrates offloaded hashing with bounded retry/backoff."""
    loop = asyncio.get_running_loop()
    for attempt in range(cfg.retry_attempts):
        result = await loop.run_in_executor(executor, compute_digests, batch, cfg)
        if result.status == "hashed":
            return result
        wait = cfg.backoff_base * (2 ** attempt)
        logger.warning(
            "Retry %d/%d for %s after %.2fs",
            attempt + 1, cfg.retry_attempts, batch.batch_id, wait,
        )
        await asyncio.sleep(wait)
    batch.status = "exhausted"
    return batch

Step 6 — Drive the pipeline with graceful shutdown

The main loop starts the producer, drains the queue as consumers, and stops cleanly once the producer is done and the queue is empty. The finally block guarantees the producer is awaited, in-flight consumers are gathered, and the executor is shut down.

python

class AsyncBatchingEngine:
    def __init__(self, cfg: BatchConfig):
        self.cfg = cfg
        self.queue: asyncio.Queue = asyncio.Queue(maxsize=cfg.max_queue_size)
        self.executor = ThreadPoolExecutor(max_workers=cfg.worker_threads)
        self.shutdown = asyncio.Event()

    async def run(self, source_cursor: AsyncIterator[dict]) -> None:
        producer = asyncio.create_task(
            extract_and_enqueue(source_cursor, self.queue, self.cfg, self.shutdown)
        )
        consumers: List[asyncio.Task] = []
        try:
            while True:
                try:
                    batch = await asyncio.wait_for(self.queue.get(), timeout=2.0)
                except asyncio.TimeoutError:
                    if (producer.done() and self.queue.empty()) or self.shutdown.is_set():
                        break
                    continue
                consumers.append(
                    asyncio.create_task(process_batch(batch, self.executor, self.cfg))
                )
                await asyncio.sleep(0)          # yield to the event loop
        finally:
            await producer
            await asyncio.gather(*consumers, return_exceptions=True)
            self.executor.shutdown(wait=True)
            logger.info("Async batching engine terminated gracefully.")

Verify the engine drains a finite cursor without leaking tasks:

python

async def _smoke():
    async def rows():
        for i in range(1_050):
            yield {"id": i, "v": i * 2}
    await AsyncBatchingEngine(BatchConfig(batch_rows=500)).run(rows())
    assert not asyncio.all_tasks() - {asyncio.current_task()}

The full engine — with connection-pool lifecycle hooks, circuit breakers, distributed tracing, and dead-letter routing — is developed in the implementing async batching for high-throughput pipelines reference.

Concurrency Strategy Trade-offs

Three offload strategies dominate production async batching. The choice hinges on how CPU-heavy the per-row digest is versus how much the extraction side dominates. The compliance row matters because reconciliation frequently runs over regulated columns, and the digest that crosses a trust boundary must itself be defensible.

Criterion	Inline (event-loop) hashing	ThreadPoolExecutor offload	ProcessPoolExecutor offload
Event-loop safety	Blocks the loop — extraction stalls	Loop stays free; work on pool threads	Loop stays free; work in child processes
Parallelism ceiling	Single-threaded	Near-linear while `hashlib` releases the GIL	True multi-core, GIL-independent
Per-batch overhead	None	Minimal (thread handoff)	Higher (pickling rows across the process boundary)
Memory profile	Lowest	Shared address space, bounded by queue depth	Per-process copy of each batch
Best when	Digest is trivial / rows tiny	Default — `hashlib` digests over moderate rows	CPU-saturated hashing of wide rows on many cores
Compliance / regulatory	Simplest audit surface but poor throughput evidence	Threads share memory — mask/tokenize PII before enqueue	Process isolation eases per-worker credential and data-locality controls
Scale ceiling	Poor	Good for I/O-dominated reconciliation	Excellent for CPU-dominated hashing

For most heterogeneous migrations the ThreadPoolExecutor is the default: hashlib releases the GIL during digest computation, so threads deliver near-linear speedup without paying the row-pickling cost a process pool incurs. Reserve the ProcessPoolExecutor for cases where the per-row canonicalization work (not just the hash) is heavy and CPU-bound, so the GIL becomes the ceiling. Regardless of pool type, prefer SHA-256 or BLAKE2b over MD5 — MD5’s collision weakness disqualifies it from a defensible audit trail.

Scaling and Performance

Async batching is memory-bounded on queue depth and CPU-bound on hashing throughput, so both dimensions must be engineered together.

Partitioning strategy. Shard the reconciliation space on contiguous ranges of the reconciliation key so each engine instance owns a disjoint, independently checkpointable partition. Run one AsyncBatchingEngine per partition and aggregate the digest streams centrally; keyset (seek) pagination in the upstream reader keeps extraction O(batch) rather than degrading as OFFSET grows.

Batch sizing. Smaller batches lower peak memory but raise per-batch and queue-handoff overhead; larger batches improve hashing throughput but raise the worst-case buffer. Start from batch_rows ≈ memory_budget / (max_queue_size × avg_row_bytes × safety_factor) and tune against observed RSS. For tables with wide blob columns, size by byte budget rather than row count and exclude non-participating columns from the digest set.

Memory bounding. The bounded asyncio.Queue is the memory ceiling: because the producer blocks on put when the queue is full, resident rows can never exceed max_queue_size × batch_rows plus the batches currently in flight in consumers. Never replace it with an unbounded queue “for throughput” — that removes the only backpressure mechanism and reintroduces the OOM failure mode.

GIL and parallelism. Digesting releases the GIL inside hashlib, so a thread pool sharded across cores gives near-linear hashing throughput; the Python-level canonicalization loop, however, is GIL-bound, so if serialization dominates, move to a process pool sharded by key range. Tolerance-sensitive columns that feed this scaling decision are tuned via the threshold tuning for tolerance reference.

Failure Modes and Diagnostic Runbook

Each named failure mode below lists its cause, the signal that detects it, and the remediation.

OOM from an unbounded reader. Cause: a buffered fetchall or an unbounded queue accumulates the whole table in memory. Signal: worker RSS climbs linearly until the OOM killer fires, independent of batch size. Remediation: use a server-side async cursor and keep max_queue_size finite so put blocks — verify the ingestion path never calls fetchall.
Event-loop starvation. Cause: hashing (or a synchronous driver call) runs inline on the loop thread. Signal: loop.slow_callback_duration warnings and queue wait times above ~50 ms; extraction throughput collapses. Remediation: route all CPU work through run_in_executor; audit for any blocking driver call left un-awaited.
Connection-pool exhaustion. Cause: cursors held open across batch processing instead of released after consumption. Signal: new extraction tasks block acquiring a connection; pool wait metrics spike. Remediation: release the cursor immediately after a batch is enqueued and set an explicit max_inactive_connection_lifetime on asyncpg / aiomysql.
Duplicate work after restart. Cause: the pipeline resumes from the start of the cursor because completion state was never persisted. Signal: re-processing overlaps already-committed batch_ids; downstream sees duplicate digests. Remediation: record the last committed batch_id in the metadata store and resume the reader from that key boundary.
Silent batch loss on exhausted retries. Cause: a batch reaches exhausted status and is dropped without routing. Signal: the committed row count is below the source count with no mismatch reported. Remediation: route every exhausted batch to a dead-letter queue and assert committed + dead_lettered == source_rows at job end.
Backpressure deadlock on shutdown. Cause: the producer blocks on a full queue while consumers have already stopped draining. Signal: the job hangs at shutdown with a non-empty queue and idle consumers. Remediation: signal the shutdown Event the producer checks each iteration, and drain in-flight batches in the finally block before shutting the executor.

In This Reference

This flow-control stage is developed further in a dedicated companion reference:

The implementing async batching for high-throughput pipelines guide extends the engine with connection-pool lifecycle hooks, circuit breakers around downstream validation APIs, distributed-tracing instrumentation, dead-letter routing, and dynamic batch resizing under memory pressure.

Frequently Asked Questions

Why a bounded queue instead of just reading everything and hashing in parallel?

Reading everything first defeats the purpose: a billion-row table will not fit in memory, and a buffered read triggers the OOM killer long before hashing starts. The bounded asyncio.Queue is the backpressure mechanism — when it fills, the producer blocks, which caps resident rows at max_queue_size × batch_rows regardless of table size. That constant memory ceiling is the whole point of the stage.

Should I use a thread pool or a process pool for hashing?

Default to a ThreadPoolExecutor: hashlib releases the GIL during digest computation, so threads scale nearly linearly without the cost of pickling rows across a process boundary. Switch to a ProcessPoolExecutor only when the Python-level serialization work per row is itself CPU-heavy, because that loop is GIL-bound and threads stop helping once it dominates.

How do I pick batch_rows and max_queue_size?

Their product is the maximum number of rows ever resident, so multiply it by the average serialized row width to get worst-case buffer memory and size both to fit inside the worker’s RSS budget with headroom. Within that envelope, larger batches favour hashing throughput and smaller batches favour lower peak memory and faster shutdown drains. Tune against observed RSS on a representative partition rather than guessing.

What happens to a batch that fails every retry?

It is marked exhausted and must be routed to a dead-letter queue rather than silently dropped. Assert at job end that committed plus dead-lettered batches account for every source row; a shortfall with no reported mismatch is the signature of silent batch loss and means the dead-letter path is missing.

Data Extraction & Hashing Workflows — the parent stage this batching engine feeds digests into.
Parallel row extraction techniques — the partitioned, lock-light reads that supply this stage’s source cursor.
Column-level checksum generation — the field-level digesting contract each batch’s hashing loop applies.
Schema validation pre-checks — the gate that stops structural drift from reaching the hashing loop mid-run.
Data equivalence modeling — the canonicalization rules that define what each row’s bytes are before hashing.

# Async Batching for Large Datasets

# Architectural Boundaries

# Prerequisites

# Step-by-Step Implementation

# Step 1 — Declare the batching configuration

# Step 2 — Model the batch as an atomic reconciliation scope

# Step 3 — Ingest rows and enqueue with backpressure

# Step 4 — Offload digesting to a worker pool

# Step 5 — Wrap each batch in retry with exponential backoff

# Step 6 — Drive the pipeline with graceful shutdown

# Concurrency Strategy Trade-offs

# Scaling and Performance

# Failure Modes and Diagnostic Runbook

# In This Reference

# Frequently Asked Questions

# Related

Async Batching for Large Datasets

Architectural Boundaries

Prerequisites

Step-by-Step Implementation

Step 1 — Declare the batching configuration

Step 2 — Model the batch as an atomic reconciliation scope

Step 3 — Ingest rows and enqueue with backpressure

Step 4 — Offload digesting to a worker pool

Step 5 — Wrap each batch in retry with exponential backoff

Step 6 — Drive the pipeline with graceful shutdown

Concurrency Strategy Trade-offs

Scaling and Performance

Failure Modes and Diagnostic Runbook

In This Reference

Frequently Asked Questions

Related