SQL to NoSQL Sync Validation › How to Validate SQL vs NoSQL Data Parity

How to Validate SQL vs NoSQL Data Parity

Q: Why not just compare row counts or a single table checksum?

A row count proves cardinality and nothing about payloads, and a single table-wide checksum tells you the tables differ but not where. Partition-level digests verify a whole partition with one cheap compare and drill only failing partitions to the exact mismatched keys, so a verdict is both fast and actionable at billions of rows.

Q: My live cutover reports mismatches every run — is the data corrupt?

Usually it is replication lag, not corruption. Under dual-write the NoSQL target trails the SQL source, so a raw snapshot compare disagrees on the freshest rows. Align CDC offsets by logical timestamp or transaction sequence ID, compare within a tumbling window, check replication-slot lag, and only treat a mismatch as corruption once it survives the window and exceeds tolerance.

This page answers one precise question: given a relational source and a document, key-value, or wide-column target that are supposed to hold the same logical data, how do you prove parity when the bytes, the types, the ordering, and the consistency model all differ? It is the concrete diagnostic runbook that sits under the SQL to NoSQL sync validation reference, and it assumes you already hold an equivalence contract from that stage — a key mapping, a field-participation list, and a null policy — plus the physical type translations from the cross-platform schema mapping reference. What follows turns that contract into a partition-aware comparison worker you can run against production, written for data engineers, migration specialists, Python pipeline builders, and platform operations teams.

Parity validation earns its own runbook because the two naive checks both lie. A row count that agrees tells you the cardinality matches and nothing about the payloads; a byte-for-byte comparison across heterogeneous stores fails on every record because a NUMERIC(38,9) serializes differently from a BSON Decimal128 even when the values are identical. The only defensible verdict comes from canonicalizing both sides into an engine-agnostic byte image, hashing it, and comparing digests partition by partition so a divergence points at a bounded set of keys instead of “somewhere in ten billion rows.”

Problem Framing

You are migrating 10 billion rows from a PostgreSQL commerce schema into DynamoDB, dual-writing during a live cutover, and you need to detect silent truncation before you promote the target as primary. A DECIMAL(18,4) price landing in a double, a microsecond TIMESTAMPTZ rounded to whole seconds, a target write dropped under backpressure, and a SQL NULL that became an absent NoSQL key are all invisible to row counts and all fatal to a financial ledger. You cannot materialize both datasets — neither fits in memory and a full scan on each side saturates network I/O and trips read-throttling on the managed cluster.

The rule that makes this tractable: compare digests, not payloads, and align partitions by a shared sort key so a chunk hash localizes every divergence. Hash the whole partition first; only when a partition digest disagrees do you drill into row-level digests to isolate the exact mismatched keys. That two-tier structure turns a 10-billion-row comparison into a cheap forest of partition hashes plus a handful of targeted deep dives.

Implementation

The worker paginates both engines by a shared sort key, canonicalizes every record into identical bytes, folds row digests into one partition digest, and compares. Row-level detail is computed only for partitions that disagree. The canonicalization and digest contract is deliberately the same one used by column-level checksum generation, so a digest computed during extraction matches one recomputed here.

Step 1 — Canonicalize each record into an engine-agnostic byte image

Equivalence is not identity: two records may differ in physical representation yet be logically identical after coercion. Quantize decimals to the source scale through Python’s decimal module, force timestamps to timezone-aware UTC ISO-8601 via datetime, collapse SQL NULL and an absent NoSQL key to one sentinel, strip engine-only metadata, and sort keys before serializing.

python

import hashlib
import json
import logging
from dataclasses import dataclass
from datetime import datetime, timezone
from decimal import Decimal, ROUND_HALF_EVEN, InvalidOperation
from typing import Any, Dict, Iterable, Mapping

logging.basicConfig(
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s", level=logging.INFO
)
logger = logging.getLogger("recon.parity")

_MONEY = Decimal("0.0001")                      # DECIMAL(18,4) scale — must round-trip
_NULL = "__NULL__"                       # one sentinel: SQL NULL == absent NoSQL key
_EXCLUDE = frozenset({"_id", "row_version", "_etag", "created_at", "updated_by"})


def _coerce(value: Any) -> Any:
    """Deterministic, precision-preserving coercion — the root of every digest."""
    if value is None:
        return _NULL
    if isinstance(value, Decimal):
        try:
            return str(value.quantize(_MONEY, rounding=ROUND_HALF_EVEN))
        except InvalidOperation:
            logger.error("decimal quantize failed for %r", value)
            raise
    if isinstance(value, datetime):
        if value.tzinfo is None:
            raise ValueError(f"naive datetime {value!r}: timezone lost before hashing")
        return value.astimezone(timezone.utc).isoformat()
    if isinstance(value, bytes):
        return value.decode("utf-8", errors="replace")
    return value


def canonical_bytes(record: Mapping[str, Any]) -> bytes:
    """Stable byte image of one record, identical across engines for equal data."""
    cleaned = {k: _coerce(v) for k, v in record.items() if k not in _EXCLUDE}
    return json.dumps(
        cleaned, sort_keys=True, separators=(",", ":"), ensure_ascii=False
    ).encode("utf-8")


def row_digest(record: Mapping[str, Any]) -> str:
    return hashlib.sha256(canonical_bytes(record)).hexdigest()

Step 2 — Fold a partition into one order-independent digest

A partition digest must not depend on read order, because the two engines paginate differently. XOR-folding the per-row digests makes the partition hash commutative while still keying each row so mismatches stay locatable.

python

@dataclass(frozen=True)
class PartitionResult:
    partition_id: str
    row_count: int
    digest: str                         # order-independent fold of row digests
    keyed: Dict[str, str]               # reconciliation_key -> row_digest


def fold_partition(
    partition_id: str, rows: Iterable[Mapping[str, Any]], key_field: str
) -> PartitionResult:
    """Reduce a partition to a commutative digest plus a per-key digest index."""
    acc = 0
    keyed: Dict[str, str] = {}
    count = 0
    for row in rows:
        rk = row.get(key_field)
        if rk is None:
            raise ValueError(f"row in {partition_id} missing reconciliation key {key_field}")
        d = row_digest(row)
        keyed[str(rk)] = d
        acc ^= int(d, 16)                # XOR fold: order-independent, still keyed
        count += 1
    return PartitionResult(partition_id, count, f"{acc:064x}", keyed)

Step 3 — Compare partitions and localize divergence

Compare the cheap partition digests first; only descend into the keyed index when they disagree, and emit a structured discrepancy manifest naming the exact keys and how they differ. The manifest is the artifact — this worker writes to neither engine, exactly as the security boundaries for reconciliation contract requires.

python

@dataclass(frozen=True)
class Discrepancy:
    partition_id: str
    missing_in_target: list       # keys present in SQL, absent in NoSQL
    missing_in_source: list       # keys present in NoSQL, absent in SQL
    value_mismatch: list          # keys present both sides, digests differ


def compare_partition(src: PartitionResult, tgt: PartitionResult) -> Discrepancy | None:
    """Return None when partitions are byte-equal; else a localized manifest."""
    if src.digest == tgt.digest and src.row_count == tgt.row_count:
        return None                     # fast path: whole partition verified by one compare
    s, t = set(src.keyed), set(tgt.keyed)
    mismatch = [k for k in (s & t) if src.keyed[k] != tgt.keyed[k]]
    disc = Discrepancy(
        partition_id=src.partition_id,
        missing_in_target=sorted(s - t),
        missing_in_source=sorted(t - s),
        value_mismatch=sorted(mismatch),
    )
    logger.warning(
        "parity break in %s: %d missing-in-target, %d missing-in-source, %d value-mismatch",
        disc.partition_id, len(disc.missing_in_target),
        len(disc.missing_in_source), len(disc.value_mismatch),
    )
    return disc

Key Implementation Notes

Digest choice is a compliance decision, not a speed decision. SHA-256 is the default because MD5 collisions are exploitable and disallowed under most audit regimes; when throughput dominates and the threat model is accidental corruption rather than an adversary, BLAKE3 is defensible. Record the algorithm in the manifest so a reviewer can reproduce any verdict. The full trade-off lives in column-level checksum generation.
The XOR fold is what makes partitions comparable at all. Because the two engines return rows in different physical orders, an order-sensitive hash would report every partition as divergent. XOR is commutative and keeps each row individually keyed, so a partition can be verified with a single compare yet still drilled to exact keys when it fails.
One sentinel collapses the NULL-vs-absent trap. A missing NoSQL field and a relational NULL are semantically different but must reconcile as equal unless your contract says otherwise; _coerce maps both to _NULL so a benign representation gap does not manufacture a mismatch. Whether they should collapse is decided once in data equivalence modeling, not per worker.
Temporal and decimal precision fail silently or not at all. _coerce rejects naive datetimes outright and quantizes decimals before any driver truncates them, because a dropped timezone or a widened DECIMAL is the classic “migration looked fine, reconciliation failed” defect — and a reportable integrity fault when the column is monetary.
Exact equality is rarely the right verdict for live streams. Under dual-write, a partition can differ purely because the target has not yet caught up. Do not treat every mismatch as corruption; bound benign drift with a tolerance profile from threshold tuning for tolerance before you raise an alert.

CDC window alignment for live cutovers

During a live cutover the target trails the source, so a raw snapshot compare will always disagree. Align Change Data Capture offsets by logical timestamp or transaction sequence ID and compare within a tumbling window (five minutes is a common starting point) so out-of-order delivery does not read as divergence. Consult the source engine’s replication-slot lag before trusting any verdict; a partition that “fails” while the consumer is 20 minutes behind is lag, not corruption. When extraction outruns the comparison workers, the backpressure from async batching for large datasets keeps in-flight rows bounded.

Fallback chains when parity breaks

A verdict is only useful if a failing partition triggers containment before divergence propagates into cutover sign-off. The tiered halt-quarantine-resync response is owned by the fallback chain implementation reference; the states this worker feeds into are shown below.

A structural break that spans an entire column — every row changed type at once — is not a per-partition concern and belongs to structural mismatch detection rather than being triaged key by key here.

Verification

Assert the two properties the worker exists to guarantee: logically equal partitions verify with a single compare regardless of row order or numeric representation, and an injected divergence is localized to the exact key rather than smeared across the partition.

python

key = "order_id"
src_rows = [
    {"order_id": 1, "total": Decimal("20.00"), "_id": "a", "created_at": "x"},
    {"order_id": 2, "total": Decimal("5.5000"), "_id": "b", "created_at": "y"},
]
# same data, reversed order, different decimal scale, different engine metadata
tgt_rows = [
    {"order_id": 2, "total": Decimal("5.50"), "_etag": "22"},
    {"order_id": 1, "total": Decimal("20.0000"), "_etag": "11"},
]

src = fold_partition("P1", src_rows, key)
tgt = fold_partition("P1", tgt_rows, key)
assert compare_partition(src, tgt) is None          # order + scale + metadata reconcile

tgt_bad = fold_partition("P1", [dict(tgt_rows[0], total=Decimal("9.99")), tgt_rows[1]], key)
disc = compare_partition(src, tgt_bad)
assert disc is not None and disc.value_mismatch == ["2"]   # divergence pinned to one key
logger.info("parity harness passed")

Run it as a pre-cutover gate: python -m pytest test_parity.py -q must pass before dual-write promotion, and the live pass should hold a mismatch rate below your agreed tolerance (0.001% sustained for 72 hours is a common promotion bar) before the NoSQL target is declared primary.

Operational Considerations

At scale the worker is CPU-bound on canonicalization and hashing and I/O-bound on paginated reads, so tune both together. Stream rows with a bounded fetch size (fetchsize=5000 on the SQL cursor, a matching Limit on the NoSQL scan) and process each partition through a generator so memory stays O(rows-per-partition), never O(dataset) — full materialization is the OOM trap that ends most naive comparison jobs. Push down soft-delete and archive predicates (WHERE is_active = true and the target equivalent) before hashing so excluded records never enter a digest. Size partitions so a single mismatched partition re-fetch is cheap; oversized partitions make the drill-down as expensive as the scan you were trying to avoid.

Expose partitions_verified_total, partition_mismatch_rate, rows_hashed_total, cdc_lag_seconds, and compare_latency_ms so platform ops can alert on P95/P99 drift and CDC lag rather than absolute counts. Write every verdict, the hash algorithm, and the manifest to append-only, signed audit storage, and keep validation traffic on private endpoints inside the data-residency boundary so comparison payloads never cross a region unauthorized. Before the first pass runs, confirm the contracts are in place with the schema validation pre-checks stage, since an unmapped column produces a partition-wide false mismatch that looks exactly like corruption.

Frequently Asked Questions

Why not just compare row counts or a single table checksum?

A row count proves cardinality and nothing about payloads — every price could be truncated and the count still agrees. A single table-wide checksum tells you the tables differ but not where, which is useless at ten billion rows. Partition-level digests give you the best of both: a whole partition is verified by one cheap compare, and only a failing partition is drilled to the exact mismatched keys, so a verdict is both fast and actionable.

How do I compare partitions when the two engines return rows in different orders?

Make the partition digest order-independent. Compute a per-row SHA-256 over the canonicalized bytes, then fold the row digests together with XOR, which is commutative — so the same set of rows produces the same partition digest regardless of read order. Keep each row digest indexed by its reconciliation key as well, so when a partition digest disagrees you can set-difference the key indexes and name the exact rows that diverged.

My live cutover reports mismatches every run — is the data corrupt?

Usually not; it is replication lag. Under dual-write the NoSQL target trails the SQL source, so a raw snapshot compare will always disagree on the freshest rows. Align CDC offsets by logical timestamp or transaction sequence ID, compare within a tumbling window, and check replication-slot lag before trusting a verdict. Only treat a mismatch as corruption once it survives the window and exceeds your tolerance threshold — otherwise you are alerting on catch-up, not drift.

SQL to NoSQL sync validation — the parent reference that defines the equivalence contract, verdict, and discrepancy manifest this runbook produces.
Mapping relational schemas to document stores — how the target documents are shaped so these digests can compare at all.
Data equivalence modeling — decides whether two structurally distinct records are the same logical entity across engines.
Threshold tuning for tolerance — the epsilon that separates benign drift from a real parity break.
Fallback chain implementation — the halt, quarantine, and resync response triggered when a partition fails.

# How to Validate SQL vs NoSQL Data Parity

# Problem Framing

# Implementation

# Step 1 — Canonicalize each record into an engine-agnostic byte image

# Step 2 — Fold a partition into one order-independent digest

# Step 3 — Compare partitions and localize divergence

# Key Implementation Notes

# CDC window alignment for live cutovers

# Fallback chains when parity breaks

# Verification

# Operational Considerations

# Frequently Asked Questions

# Related

How to Validate SQL vs NoSQL Data Parity

Problem Framing

Implementation

Step 1 — Canonicalize each record into an engine-agnostic byte image

Step 2 — Fold a partition into one order-independent digest

Step 3 — Compare partitions and localize divergence

Key Implementation Notes

CDC window alignment for live cutovers

Fallback chains when parity breaks

Verification

Operational Considerations

Frequently Asked Questions

Related