Data Extraction & Hashing Workflows › Parallel Row Extraction Techniques › Optimizing Parallel Extraction with Python Multiprocessing

Optimizing Parallel Extraction with Python Multiprocessing

Q: Why derive max_workers from a memory budget instead of os.cpu_count()?

A worker's cost is RAM and a connection, not just a core. Sizing to cores works until each worker's cursor buffer and object graph exceed available memory and the kernel OOM-kills workers mid-slice. Taking the minimum of cores, the memory budget, and a bounded share of the engine max_connections keeps the pool inside every ceiling at once.

Q: Why keep raw rows inside the worker and only return a digest?

The default multiprocessing result path pickles everything that crosses the process boundary, and for wide or nested rows that serialization cost scales non-linearly and dominates. Emitting one compact SliceResult per slice keeps row bytes local, removes pickling from the scaling curve, and keeps PII off the IPC channel in plaintext.

Q: How do I guarantee the parallel digest is deterministic?

Every slice carries a mandatory ORDER BY on the reconciliation key so row order within a slice is stable, and per-slice digests are combined in sorted partition_id order rather than completion order. A verification step asserts the parallel job digest equals a single-process baseline over the same key space.

Q: What should happen when a worker hits BrokenProcessPool or OOM?

Degrade through explicit tiers: halve the workers with chunksize=1 to isolate the offending slice, then decouple reads onto a thread pool while keeping digest work on the process pool, and finally fall back to single-process async batching that trades throughput for deterministic memory. Each SliceResult is checkpointed so a restart resumes from last_key instead of rescanning.

This page answers a single, sharp question: when the parallel row extraction tier is CPU-bound — because each row must be canonicalized and digested before it leaves the worker — how do you size and drive a Python multiprocessing pool so it saturates cores without exhausting RAM or the database connection budget? It assumes the schema has already cleared the schema validation pre-checks gate and that slice boundaries have been resolved upstream; the job here is turning those disjoint key ranges into an ordered, once-only row stream at maximum throughput. If your bottleneck is network I/O against many remote targets rather than local CPU, the event-loop approach in async batching for large datasets is the better fit — process-based extraction earns its overhead only when real per-row compute is on the critical path.

Problem Framing: Where Multiprocessing Wins and Where It Silently Corrupts

Consider a concrete migration: 8 billion rows moving from a PostgreSQL source to a document-store target, where every row must be hashed with a deterministic digest so a later column-level checksum generation comparison can prove bitwise parity. Single-process extraction pins one core at 100% while 31 others idle, and the reconciliation window closes before the scan finishes. The obvious fix — a ProcessPoolExecutor with max_workers set to the core count — introduces two failure modes that never appear in a small test run:

Memory blowout. Each worker holds its own connection buffer, cursor prefetch window, and Python object graph. Thirty-two workers each buffering a 5,000-row slice of a wide table can request more than the box has, and the kernel responds with OOM kills mid-slice — losing exactly the rows a naive job never re-reads.
Connection exhaustion. Database drivers do not support sharing a connection or cursor across a fork() boundary. A pool sized to cores, not to the engine’s max_connections, produces too many connections errors under load and leaves idle in transaction sessions holding locks.

The default multiprocessing.Pool compounds both problems: it serializes every argument and result with pickle, so wide rows or nested JSON pay a non-linear serialization tax on the way out of each worker. The pattern below sidesteps that by keeping row bytes inside the worker — only compact digest records cross the process boundary — and by deriving worker count from a memory budget rather than from os.cpu_count().

Implementation: A Memory-Budgeted, Connection-Isolated Extraction Pool

The module below is the production skeleton. It (a) derives a safe worker count from a memory budget and the engine connection ceiling, (b) gives every worker its own read-only connection via a factory, © streams each slice with a bounded server-side cursor and a mandatory ORDER BY, (d) computes a deterministic per-row digest inside the worker, and (e) emits one compact SliceResult per slice with structured logging and explicit error capture — no toy placeholders.

python

from __future__ import annotations

import hashlib
import logging
import os
import resource
from concurrent.futures import ProcessPoolExecutor, as_completed
from dataclasses import dataclass, field

import psycopg2

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(processName)s %(levelname)s %(message)s",
)
logger = logging.getLogger("parallel_extraction")


@dataclass(frozen=True)
class Slice:
    """A half-open [lower, upper) key range assigned to one worker."""
    partition_id: int
    lower: int
    upper: int


@dataclass
class SliceResult:
    """Compact result that crosses the process boundary — never raw rows."""
    partition_id: int
    row_count: int
    slice_digest: str
    last_key: int | None = None
    error: str | None = None
    failed_keys: list[int] = field(default_factory=list)


def plan_worker_count(
    per_worker_bytes: int,
    reserved_bytes: int = 2 * 1024**3,
    engine_max_connections: int = 100,
    connection_headroom: float = 0.5,
) -> int:
    """Derive a safe worker count from RAM budget AND the connection ceiling.

    The pool is the minimum of three ceilings: physical cores, the memory
    budget, and the share of engine connections we are allowed to consume.
    """
    available = os.sysconf("SC_PAGE_SIZE") * os.sysconf("SC_AVAIL_PHYS_PAGES")
    mem_ceiling = max(1, (available - reserved_bytes) // per_worker_bytes)
    conn_ceiling = max(1, int(engine_max_connections * connection_headroom))
    core_ceiling = os.cpu_count() or 1
    workers = min(core_ceiling, mem_ceiling, conn_ceiling)
    logger.info(
        "worker plan: cores=%d mem_ceiling=%d conn_ceiling=%d -> workers=%d",
        core_ceiling, mem_ceiling, conn_ceiling, workers,
    )
    return workers


def worker_connection_factory(dsn: str):
    """Open a dedicated read-only session for THIS worker process.

    Connections are never shared across the fork boundary; each worker
    builds its own under snapshot-stable, autocommit read-only semantics.
    """
    conn = psycopg2.connect(dsn)
    conn.set_session(readonly=True, autocommit=True)
    return conn


def _row_digest(row: tuple) -> bytes:
    """Deterministic per-row digest over a canonical byte encoding.

    Uses BLAKE2b (a NIST-catalogued family, RFC 7693) for speed with a
    cryptographic guarantee. NULLs are encoded distinctly from empty
    strings so silent truncation cannot collide with a real value.
    """
    h = hashlib.blake2b(digest_size=16)
    for value in row:
        token = b"\x00NULL\x00" if value is None else repr(value).encode("utf-8")
        h.update(len(token).to_bytes(8, "big"))
        h.update(token)
    return h.digest()


def extract_slice(dsn: str, table: str, key_col: str, sl: Slice,
                  fetch_size: int = 5000, statement_timeout_ms: int = 30_000) -> SliceResult:
    """Stream one slice, digest each row in-process, return a compact result.

    Runs in a worker process. A hard address-space ceiling turns a runaway
    buffer into a catchable MemoryError instead of a silent kernel OOM kill.
    """
    soft, hard = resource.getrlimit(resource.RLIMIT_AS)
    resource.setrlimit(resource.RLIMIT_AS, (4 * 1024**3, hard))

    running = hashlib.blake2b(digest_size=16)
    count = 0
    last_key: int | None = None
    conn = None
    try:
        conn = worker_connection_factory(dsn)
        with conn.cursor(name=f"extract_{sl.partition_id}") as cur:
            cur.itersize = fetch_size
            cur.execute(f"SET statement_timeout TO {statement_timeout_ms}")
            cur.execute(
                f"SELECT * FROM {table} "
                f"WHERE {key_col} >= %s AND {key_col} < %s ORDER BY {key_col}",
                (sl.lower, sl.upper),
            )
            key_index = [d.name for d in cur.description].index(key_col)
            for row in cur:
                running.update(_row_digest(row))
                last_key = row[key_index]
                count += 1
        logger.info("slice %d ok: rows=%d", sl.partition_id, count)
        return SliceResult(sl.partition_id, count, running.hexdigest(), last_key)
    except (psycopg2.Error, MemoryError) as exc:
        logger.error("slice %d failed after %d rows: %s", sl.partition_id, count, exc)
        return SliceResult(sl.partition_id, count, running.hexdigest(),
                           last_key, error=repr(exc))
    finally:
        if conn is not None:
            conn.close()


def run_extraction(dsn: str, table: str, key_col: str, slices: list[Slice],
                   per_worker_bytes: int = 512 * 1024**2) -> dict[int, SliceResult]:
    """Fan slices across a memory-budgeted process pool; collect results."""
    workers = plan_worker_count(per_worker_bytes)
    results: dict[int, SliceResult] = {}
    with ProcessPoolExecutor(max_workers=workers) as pool:
        futures = {
            pool.submit(extract_slice, dsn, table, key_col, sl): sl
            for sl in slices
        }
        for fut in as_completed(futures):
            sl = futures[fut]
            try:
                results[sl.partition_id] = fut.result()
            except Exception as exc:  # BrokenProcessPool, unpickleable errors
                logger.critical("slice %d crashed the worker: %s", sl.partition_id, exc)
                results[sl.partition_id] = SliceResult(
                    sl.partition_id, 0, "", error=repr(exc))
    return results

Key Implementation Notes

Worker count is a minimum of three ceilings, never just cores. plan_worker_count takes the smallest of physical cores, the RAM budget ((available − reserved) / per_worker_bytes), and a bounded share of the engine’s max_connections. This is the single change that stops the pool from OOM-killing itself or tripping too many connections on the source.
Digest choice: BLAKE2b, not MD5. BLAKE2b (RFC 7693) is faster than MD5 on modern CPUs while remaining collision-resistant, so it is safe when the digest is the integrity signal itself. Reserve MD5 strictly for non-security change-detection; the trade-off is weighed in full under column-level checksum generation.
Length-prefixed, NULL-tagged encoding prevents silent truncation. Each field is framed by its byte length and NULLs get a distinct sentinel, so ("ab", "c") and ("a", "bc") never hash equal and a truncated string cannot collide with the original. This is the edge case that makes cross-engine parity trustworthy rather than merely plausible.
Only SliceResult crosses the pickle boundary. Raw rows stay inside the worker; the parent receives a compact digest and count. This is what removes serialization from the scaling curve for wide or deeply nested rows.
RLIMIT_AS converts a kernel OOM kill into a catchable MemoryError. A per-process address-space ceiling means a runaway prefetch degrades to a captured error on that slice instead of a silent kill that loses rows a naive re-run never re-reads.
Compliance stays inside the worker. Because row bytes never leave the process in plaintext, PII masking or tokenization can be applied in _row_digest before any inter-process communication — sensitive fields never traverse the IPC channel. Data-residency routing belongs here too, aligned with the cross-platform schema mapping reference so region tags are honoured at the point of extraction.

Verification: Prove the Parallel Run Equals the Serial Baseline

Determinism is the whole point, so assert it. Run a single-process baseline over a bounded sample and confirm the parallel pool produces an identical aggregate digest. Slice-order independence matters: fold the per-slice digests in partition_id order so the combined result does not depend on which worker finishes first.

python

def combine(results: dict[int, SliceResult]) -> str:
    """Order-independent fold of per-slice digests into one job digest."""
    h = hashlib.blake2b(digest_size=16)
    for pid in sorted(results):
        r = results[pid]
        assert r.error is None, f"slice {pid} failed: {r.error}"
        h.update(r.slice_digest.encode())
    return h.hexdigest()


def serial_baseline(dsn: str, table: str, key_col: str, slices: list[Slice]) -> str:
    single = {sl.partition_id: extract_slice(dsn, table, key_col, sl) for sl in slices}
    return combine(single)


# Regression assertion — parallel MUST equal serial over the same key space
parallel = combine(run_extraction(dsn, "public.orders", "id", slices))
serial = serial_baseline(dsn, "public.orders", "id", slices)
assert parallel == serial, "non-determinism: parallel digest diverged from baseline"
print("verified: parallel digest matches serial baseline", parallel)

A mismatch here is a bug you want to catch before production: it points to floating-point precision drift, implicit type coercion across the driver, or timezone normalization differing between runs. You can reproduce the same check from the shell with python -c "from extract import *; ..." in CI against a seeded fixture table.

Operational Considerations

Explicit fallback tiers for OOM and broken pools. When run_extraction reports BrokenProcessPool or repeated MemoryErrors, degrade deterministically rather than retrying blindly:

Tier 1 — throttle. Halve max_workers and set chunksize=1 to isolate the leaking slice; re-run with PYTHONFAULTHANDLER=1 to capture a native traceback.
Tier 2 — decouple I/O from CPU. Move slice reads to a ThreadPoolExecutor and keep only the digest step on the ProcessPoolExecutor, avoiding driver thread-safety issues while preserving parallel compute.
Tier 3 — single-process async. Fall back to the async batching path, trading 30–40% throughput for deterministic memory and a simpler connection lifecycle.

Checkpointing for idempotent restarts. Persist each SliceResult’s partition_id, last_key, and slice_digest to a state table. On restart, skip slices whose digest already matches and resume partial slices from last_key — no full-table rescan, which directly cuts cloud compute and network egress spend.

Metrics to expose. Emit per-slice row_count and wall-time, pool queue depth, and worker RSS. Alert on python_gc_objects_collected_total spikes and any accumulation of idle in transaction sessions in pg_stat_activity; both precede the failure modes above. Right-size per_worker_bytes from observed RSS rather than guessing — it is the input that governs the entire worker-count budget.

Parallel Row Extraction Techniques — the parent reference defining slice boundaries, ordering, and the bounded-queue contract this pool plugs into.
Implementing Async Batching for High-Throughput Pipelines — the event-loop alternative when extraction is network-I/O-bound rather than CPU-bound.
Column-Level Checksum Generation — the digesting stage that consumes these per-row digests and the MD5-vs-SHA-256 trade-off in depth.
Schema Validation Pre-Checks — the gate that must pass before this pool opens a single cursor.
Cross-Platform Schema Mapping — the type and residency contracts that decide which columns each worker digests.

For process-pool and shared-memory lifecycle semantics, consult Python’s multiprocessing documentation and the concurrent.futures reference.

Frequently Asked Questions

Why derive max_workers from a memory budget instead of os.cpu_count()?

Because a worker’s cost is RAM and a connection, not just a core. Sizing to cores works until each worker’s cursor buffer and object graph multiply past available memory, at which point the kernel OOM-kills workers mid-slice and loses rows. Taking the minimum of the core count, the memory budget, and a bounded share of the engine’s max_connections keeps the pool inside every ceiling at once.

Why keep raw rows inside the worker and only return a digest?

The default multiprocessing result path pickles everything that crosses the process boundary, and for wide or nested rows that serialization cost scales non-linearly and dominates. Emitting one compact SliceResult per slice keeps row bytes local, removes pickling from the scaling curve, and has the side benefit that PII never traverses the IPC channel in plaintext.

How do I guarantee the parallel digest is deterministic?

Two rules. First, every slice carries a mandatory ORDER BY on the reconciliation key so row order within a slice is stable. Second, combine per-slice digests in sorted partition_id order, not in completion order, so which worker finishes first cannot change the result. The verification step asserts the parallel job digest equals a single-process baseline over the same key space.

What should happen when a worker hits BrokenProcessPool or OOM?

Degrade through explicit tiers rather than blind retries: halve the workers with chunksize=1 to isolate the offending slice, then decouple reads onto a thread pool while keeping digest work on the process pool, and finally fall back to single-process async batching that trades throughput for deterministic memory. Each SliceResult is checkpointed so any restart resumes from last_key instead of rescanning the table.

# Optimizing Parallel Extraction with Python Multiprocessing

# Problem Framing: Where Multiprocessing Wins and Where It Silently Corrupts

# Implementation: A Memory-Budgeted, Connection-Isolated Extraction Pool

# Key Implementation Notes

# Verification: Prove the Parallel Run Equals the Serial Baseline

# Operational Considerations

# Related

# Frequently Asked Questions