Data Extraction & Hashing Workflows › Parallel Row Extraction Techniques

Parallel Row Extraction Techniques for Cross-Engine Reconciliation

Q: Why half-open intervals instead of BETWEEN for slice boundaries?

BETWEEN is inclusive on both ends, so two adjacent slices sharing a boundary value both read the boundary key and the diff sees a duplicate. Half-open [lower, upper) intervals make each key belong to exactly one slice, guaranteeing every row is read once with no gaps.

Q: Multiprocessing or asyncio for parallel extraction?

Process-based extraction sidesteps the GIL and suits CPU-adjacent work or drivers without async support but pays fork and pickling overhead. When the workload is almost entirely network I/O against many targets, an event-loop approach scales more cheaply over the same bounded-queue contract. Both must enforce the same backpressure and ordering guarantees.

Q: How do I keep source and target slice boundaries aligned across engines?

Compute boundaries once from a single authority and apply identical [lower, upper) predicates to both engines, forcing the same collation on both ORDER BY clauses. If one side sorts by database collation and the other by byte order, otherwise-aligned slices misalign and manufacture phantom discrepancies.

Q: Does parallel extraction ever write to the source engine?

No. Extraction is strictly read-only and stateless with respect to row content. The reconciliation identity holds SELECT/SCAN grants only, sessions run read-only under snapshot isolation, and the only state persisted is per-slice checkpoint progress so a restart is idempotent.

Parallel row extraction is the workload that reads records concurrently out of a source and a target engine fast enough to validate terabyte-scale datasets before the reconciliation window closes — without holding long transactions, starving downstream queues, or blocking production OLTP traffic. It is the first executable step of the data extraction and hashing workflows stage: everything the digesting and comparison logic does afterwards depends on this layer delivering every row exactly once, in a deterministic order, at a rate the rest of the pipeline can absorb. This reference is written for data engineers, migration specialists, Python pipeline builders, and platform operations teams who must design an extraction tier that balances throughput, memory footprint, and cross-engine compatibility.

The problem is deceptively simple to state and hard to get right: split a table into disjoint slices, read those slices in parallel, and hand the rows to the digesting stage without ever reading a row twice or dropping one on the floor. Get the partition boundaries wrong and workers overlap or leave gaps; forget an ORDER BY and the downstream merge diff silently misaligns; let a result set grow unbounded and a worker is OOM-killed mid-run. This page defines where the extraction workload begins and ends, walks a production-grade implementation step by step, weighs the partitioning strategies against each other, and gives a diagnostic runbook for the failure modes that recur across heterogeneous migrations.

Architectural Boundaries: What This Stage Consumes and Produces

The extraction tier begins the moment a coordinator resolves partition boundaries over a validated schema and ends the moment ordered rows land on a bounded queue for the digesting stage to drain. It consumes a read-only connection to each engine, a stable reconciliation key, and a slice plan describing non-overlapping key ranges. It produces a stream of typed rows — sorted by the reconciliation key within each slice — plus per-slice checkpoint records that make the whole job idempotent and restartable.

Three concerns are isolated inside this boundary and must never leak into each other: boundary resolution (how the key space is cut into slices), row retrieval (how a single slice is streamed with bounded memory), and flow control (how producers are paused when the consumer falls behind). Keeping them separate is what lets the same extractor feed both a batch snapshot job and a streaming validator. Crucially, the extraction layer must remain stateless with respect to row content, idempotent under retry, and strictly decoupled from hashing or comparison logic — a re-run over the same key range must emit an identical row stream.

Rows arrive here already schema-checked. The type contracts that decide which columns participate and how they align across engines are defined upstream in the cross-platform schema mapping reference, and the gate that halts the job on structural drift lives in the schema validation pre-checks stage. Once rows are extracted, they feed either the column-level checksum generation digesting path or the canonicalization logic described in data equivalence modeling. For network-bound targets where I/O latency dominates over CPU, the complementary async batching for large datasets pattern replaces the process pool shown here with an event loop over the same bounded-queue contract.

Prerequisites

Before wiring parallel extraction into a reconciliation run, confirm the following are in place. Each item removes a common source of duplicate reads, gaps, or worker death.

Schema contract resolved. Both engines’ column names, types, and nullability are captured and reconciled by the schema validation pre-checks gate — extraction must not start against an unverified schema.
Indexed reconciliation key. A stable key (single or composite) exists and is indexed on both sides, so range predicates seek rather than scan. Unindexed boundary columns turn every slice into a full table scan.
Read-only, low-isolation access. The reconciliation identity holds SELECT/SCAN only, and the session runs under READ COMMITTED or REPEATABLE READ snapshot isolation — never a read-write transaction that can block production writers.
Deterministic collation agreed. Source and target sort the key in the same collation (LC_ALL=C.UTF-8 or an explicit byte ordering) so slice boundaries align across engines.
Dependency libraries pinned. psycopg2 (or asyncpg/aiomysql), concurrent.futures, and the driver’s server-side-cursor support are pinned so streaming behaviour is reproducible across hosts.
Backpressure budget set. A queue depth and per-worker fetch size are chosen against a memory budget, so producers block instead of buffering an unbounded result set.

Step-by-Step Implementation

The steps below build a chunked, process-pooled extraction pipeline with explicit backpressure, retry logic, and structured logging. It uses psycopg2 server-side cursors and concurrent.futures for cross-platform compatibility. Each step ends with an assertion or observable output so it can be verified before the next is layered on.

Step 1 — Resolve non-overlapping slice boundaries

A coordinator computes disjoint, half-open key ranges [lower, upper) from the key’s MIN and MAX and a target rows-per-slice. Half-open intervals are what guarantee no key is read twice at a boundary.

python

import logging
from dataclasses import dataclass
from typing import List, Tuple

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger("reconcile.extract")


@dataclass(frozen=True)
class SlicePlan:
    table: str
    key_column: str
    chunk_size: int = 500_000


def compute_boundaries(min_key: int, max_key: int, plan: SlicePlan) -> List[Tuple[int, int]]:
    """Cut [min_key, max_key] into disjoint half-open [lower, upper) slices."""
    if min_key > max_key:
        return []
    bounds: List[Tuple[int, int]] = []
    lower = min_key
    while lower <= max_key:
        upper = lower + plan.chunk_size
        bounds.append((lower, upper))          # half-open: upper is exclusive
        lower = upper
    logger.info("Planned %d slices over key range [%d, %d]", len(bounds), min_key, max_key)
    return bounds

Verify the slices are disjoint and gap-free:

python

plan = SlicePlan(table="orders", key_column="id", chunk_size=1000)
bounds = compute_boundaries(1, 2500, plan)
assert bounds[0] == (1, 1001)
assert all(bounds[i][1] == bounds[i + 1][0] for i in range(len(bounds) - 1))  # no gaps/overlaps

Step 2 — Stream one slice with a server-side cursor

Never fetchall() a slice — a wide slice will exhaust heap and trigger an OOM kill. A named (server-side) cursor keeps resident memory bounded to itersize rows regardless of slice width, and the mandatory ORDER BY on the reconciliation key is what lets the downstream merge diff align source against target.

python

from contextlib import contextmanager
from typing import Any, Dict, Generator
import psycopg2
from psycopg2.extras import RealDictCursor


@contextmanager
def read_only_connection(dsn: str):
    conn = psycopg2.connect(dsn, cursor_factory=RealDictCursor)
    conn.set_session(readonly=True, isolation_level="REPEATABLE READ")
    try:
        yield conn
    finally:
        conn.close()


def stream_slice(
    dsn: str, plan: SlicePlan, lower: int, upper: int, itersize: int = 5_000
) -> Generator[Dict[str, Any], None, None]:
    """Stream one half-open slice via a server-side cursor to bound memory."""
    with read_only_connection(dsn) as conn:
        with conn.cursor(name=f"extract_{lower}_{upper}") as cursor:
            cursor.itersize = itersize          # server-side batch size
            cursor.execute(
                f"SELECT * FROM {plan.table} "
                f"WHERE {plan.key_column} >= %s AND {plan.key_column} < %s "
                f"ORDER BY {plan.key_column}",   # deterministic order is non-negotiable
                (lower, upper),
            )
            for row in cursor:
                yield dict(row)

The ORDER BY and the key_column are interpolated from a typed plan, not from user input; the boundary values themselves are always passed as bound parameters to keep the query injection-safe and plan-cacheable.

Step 3 — Wrap each slice in a retrying worker with backpressure

Each worker streams its slice onto a shared bounded queue. queue.put(row, timeout=...) blocks when the queue is full, which is the backpressure signal that pauses extraction whenever the digesting consumer falls behind. A completion sentinel (None) lets the consumer count finished workers deterministically.

python

import time
from multiprocessing import Queue


def extract_worker(
    dsn: str, plan: SlicePlan, bounds: Tuple[int, int], out_queue: Queue, max_retries: int = 3
) -> None:
    lower, upper = bounds
    logger.info("Worker starting slice [%d, %d)", lower, upper)
    for attempt in range(1, max_retries + 1):
        try:
            for row in stream_slice(dsn, plan, lower, upper):
                out_queue.put(row, timeout=30)          # blocks under backpressure
            out_queue.put(None, timeout=30)             # completion sentinel
            logger.info("Worker completed slice [%d, %d)", lower, upper)
            return
        except Exception:
            logger.warning("Attempt %d/%d failed for slice [%d, %d)", attempt, max_retries, lower, upper)
            if attempt == max_retries:
                logger.error("Slice [%d, %d) exhausted retries; emitting sentinel", lower, upper)
                out_queue.put(None, timeout=30)
                return
            time.sleep(min(2 ** attempt, 10))           # exponential backoff

Step 4 — Fan out over a process pool and drain the queue

The coordinator submits one worker per slice to a ProcessPoolExecutor and drains rows in the main process, forwarding each to the digesting stage and counting sentinels until every worker has reported done.

python

from concurrent.futures import ProcessPoolExecutor
from typing import Callable


def run_parallel_extraction(
    dsn: str,
    plan: SlicePlan,
    boundaries: List[Tuple[int, int]],
    forward: Callable[[Dict[str, Any]], None],
    max_workers: int = 8,
    queue_size: int = 50_000,
) -> int:
    out_queue: Queue = Queue(maxsize=queue_size)
    rows_seen = 0
    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        for bounds in boundaries:
            executor.submit(extract_worker, dsn, plan, bounds, out_queue)
        completed = 0
        while completed < len(boundaries):
            row = out_queue.get(timeout=60)
            if row is None:                     # a worker finished
                completed += 1
                continue
            forward(row)                        # hand to hashing / diff stage
            rows_seen += 1
    logger.info("Extraction drained %d rows across %d slices", rows_seen, len(boundaries))
    return rows_seen

Verify the fan-out end to end against an in-memory fake so the contract is exercised without a live database:

python

collected: List[Dict[str, Any]] = []
fake_bounds = [(1, 3), (3, 5)]

# monkeypatch stream_slice for the test to yield deterministic rows
def _fake_stream(dsn, plan, lower, upper, itersize=5000):
    for k in range(lower, upper):
        yield {"id": k}

stream_slice = _fake_stream  # noqa: F811  (test double)
total = run_parallel_extraction("dsn", SlicePlan("t", "id"), fake_bounds, collected.append, max_workers=2)
assert total == 4 and sorted(r["id"] for r in collected) == [1, 2, 3, 4]  # every row once, none dropped

For deeper tuning of process allocation, shared-memory overhead, and CPU affinity, the optimizing parallel extraction with Python multiprocessing companion carries the profiling workflow that finds the worker count where throughput plateaus.

Partitioning Strategy Trade-offs

The partitioning strategy dictates whether extraction is balanced or skewed, and whether boundaries can be computed cheaply or demand a pre-scan. Three approaches dominate production pipelines. The compliance row matters because the boundary predicate frequently touches regulated columns, and any key material that leaves a trust zone must itself be defensible.

Criterion	Range partitioning (indexed key)	Hash-bucket partitioning (mod on key)	Warehouse partition pruning
Boundary cost	Cheap — one `MIN`/`MAX` query	Requires pre-scan or statistical sampling	Free — reuse native partition metadata
Load balance	Skews on hot-spot / gappy keys	Even distribution across workers	Depends on partition design
Duplicate-read safety	Guaranteed by half-open intervals	Guaranteed by disjoint bucket ids	Guaranteed by partition isolation
Best engines	PostgreSQL, MySQL, indexed OLTP	Any keyed store, incl. document stores	Snowflake, BigQuery, Redshift
Compliance / regulatory	Predicate may expose key ranges in logs — log bucket ids, not raw PII keys	Hash buckets pseudonymize the key before it reaches worker logs	Pruning keeps raw rows inside the warehouse trust boundary
Scale ceiling	Excellent when key is monotonic and dense	Excellent, but pre-scan cost grows with cardinality	Excellent for columnar, partition-native tables

For most relational-to-relational migrations, range partitioning on a dense, monotonically increasing key is the default: boundaries are a single query and half-open intervals make duplicate reads impossible. Switch to hash-bucket partitioning when the key is sparse, gappy, or subject to hot-spot writes that would starve one worker while others idle — and when pseudonymizing the key in logs matters for the security posture the reconciliation architecture mandates. For cloud warehouses, lean on native partition pruning and query pushdown rather than emulating cursors the engine was never built to hold open.

Scaling and Performance

Parallel extraction is I/O-bound on the wire and connection-bound on the database, so both dimensions must be engineered rather than guessed.

Partitioning strategy. Shard on contiguous ranges of the indexed reconciliation key so each worker owns a disjoint, independently checkpointable slice. Keyset (seek) pagination via the WHERE key >= lower AND key < upper predicate keeps each fetch O(batch); never paginate with OFFSET, which re-scans skipped rows and drifts under concurrent writes.

Batch sizing. Size the server-side cursor itersize and queue depth against a fixed memory budget: itersize ≈ memory_budget / (avg_row_bytes × workers × safety_factor). Because the consumer holds at most queue_size rows, the dominant resident cost is the queue plus each worker’s fetch buffer — tune those two numbers, not the slice width.

Memory bounding. Server-side cursors mean a slice of arbitrary width costs constant client memory; the risk moves to the database, where cursors allocate work_mem/temp_buffers. Cap concurrent cursors so the engine does not spill to disk under many simultaneous slices.

GIL and parallelism. Extraction is dominated by socket reads, so ProcessPoolExecutor sidesteps the GIL entirely and gives near-linear scaling until the database’s connection or I/O ceiling is hit. Threads help only for the lightest slices; past a handful of workers the bottleneck is the engine, not Python — which is why over-provisioning workers on a low-bandwidth link wastes connection slots and inflates tail latency instead of adding throughput.

Failure Modes and Diagnostic Runbook

Each named failure mode below lists its cause, the signal that detects it, and the remediation.

Overlapping or gapped slices. Cause: closed intervals or off-by-one boundary math read a key twice or skip one. Signal: row counts disagree with MAX - MIN, or the downstream diff reports duplicate keys. Remediation: use strictly half-open [lower, upper) intervals and assert bounds[i].upper == bounds[i+1].lower in CI (Step 1).
OOM on wide slices. Cause: a client-side fetchall() or an unnamed cursor materializes the whole slice. Signal: worker RSS climbs linearly until the OOM killer fires. Remediation: always use a named server-side cursor with a bounded itersize; size batches by byte budget, not row count.
Idle-in-transaction pileup. Cause: a worker holds a long read transaction, blocking autovacuum or production writers. Signal: pg_stat_activity shows idle in transaction sessions with growing age. Remediation: run read-only under snapshot isolation, cap statement/transaction timeouts, and close cursors immediately after a slice completes.
Connection-pool saturation. Cause: more workers than the engine’s connection ceiling. Signal: new workers block on connect; the database rejects sessions. Remediation: bound max_workers below the pool limit and share a pooler (PgBouncer / RDS Proxy) so slots are recycled.
Silent misalignment from missing sort. Cause: a slice query without ORDER BY returns rows in engine-defined order, so the merge diff drifts. Signal: long runs of alternating missing-in-source / missing-in-target that vanish when re-sorted. Remediation: enforce ORDER BY on the reconciliation key in every slice and assert monotonic keys at the queue boundary; tune tolerance separately per the threshold tuning for tolerance reference.
Lost slices on worker preemption. Cause: a worker dies after emitting rows but before checkpointing, so a restart reprocesses or skips it. Signal: re-runs produce a different row count for the same key range. Remediation: persist per-slice completion to a lightweight state store (Redis, DynamoDB, or a pipeline_state table) and have the coordinator skip completed slices on restart.
Phantom discrepancies from queue backlog. Cause: the consumer stalls, the queue fills, and a put times out, aborting a slice mid-stream. Signal: sustained queue-full metrics with slice retries. Remediation: raise the put timeout or queue depth, and treat sustained fullness as a downstream hashing bottleneck — not an extraction limit — routing it to the structural mismatch detection triage flow only after extraction is ruled out.

In This Reference

This extraction model is developed further in a dedicated companion reference:

The optimizing parallel extraction with Python multiprocessing guide profiles process allocation, shared-memory and pickling overhead, worker-count tuning, and CPU affinity to find the point where adding workers stops adding throughput.

Frequently Asked Questions

Why half-open intervals instead of BETWEEN for slice boundaries?

BETWEEN is inclusive on both ends, so two adjacent slices that share a boundary value both read the boundary key and the downstream diff sees a duplicate. Half-open [lower, upper) intervals — key >= lower AND key < upper — make each key belong to exactly one slice, which is what guarantees every row is read once with no gaps.

Multiprocessing or asyncio for parallel extraction?

Choose by where the time goes. Process-based extraction sidesteps the GIL and suits CPU-adjacent work or drivers without async support, but pays fork and pickling overhead. When the workload is almost entirely network I/O against many targets, the event-loop approach in the async batching reference scales more cheaply over the same bounded-queue contract. Both must enforce the same backpressure and ordering guarantees.

How do I keep source and target slice boundaries aligned across engines?

Compute boundaries once from a single authority and apply the identical [lower, upper) predicates to both engines, and force the same collation on both ORDER BY clauses (LC_ALL=C.UTF-8 or an explicit byte ordering). If one side sorts by database collation and the other by byte order, otherwise-aligned slices misalign and manufacture phantom discrepancies.

Does parallel extraction ever write to the source engine?

No. Extraction is strictly read-only and stateless with respect to row content. The reconciliation identity holds SELECT/SCAN grants only, sessions run read-only under snapshot isolation, and the only state the stage persists is per-slice checkpoint progress so a restart is idempotent.

Data extraction & hashing workflows — the stage overview this extraction tier is the first step of.
Async batching for large datasets — the event-loop alternative for I/O-bound, network-heavy extraction.
Schema validation pre-checks — the gate that must pass before extraction opens a single cursor.
Column-level checksum generation — the digesting stage that consumes the row stream this tier produces.
SQL to NoSQL sync validation — applying partitioned extraction across consistency-model boundaries during live cutovers.

For cursor lifecycle semantics and the process-pool API used above, consult Python’s concurrent.futures reference.

# Parallel Row Extraction Techniques for Cross-Engine Reconciliation

# Architectural Boundaries: What This Stage Consumes and Produces

# Prerequisites

# Step-by-Step Implementation

# Step 1 — Resolve non-overlapping slice boundaries

# Step 2 — Stream one slice with a server-side cursor

# Step 3 — Wrap each slice in a retrying worker with backpressure

# Step 4 — Fan out over a process pool and drain the queue

# Partitioning Strategy Trade-offs

# Scaling and Performance

# Failure Modes and Diagnostic Runbook

# In This Reference

# Frequently Asked Questions

# Related

Parallel Row Extraction Techniques for Cross-Engine Reconciliation

Architectural Boundaries: What This Stage Consumes and Produces

Prerequisites

Step-by-Step Implementation

Step 1 — Resolve non-overlapping slice boundaries

Step 2 — Stream one slice with a server-side cursor

Step 3 — Wrap each slice in a retrying worker with backpressure

Step 4 — Fan out over a process pool and drain the queue

Partitioning Strategy Trade-offs

Scaling and Performance

Failure Modes and Diagnostic Runbook

In This Reference

Frequently Asked Questions

Related