DeepDiff or jsondiff for cross-engine JSON reconciliation?

DeepDiff for reporting, jsondiff only when you need an applyable RFC 6902 patch. DeepDiff's ignore_order and significant_digits flags express the two most common forms of benign variance directly and return classified, path-addressable discrepancies. jsondiff emits patch operations meant to transform one document into another, which is the wrong shape for a reconciliation report.

Why do numeric-type changes show up as phantom discrepancies?

Because JSON writes 100.0 while Parquet stores Decimal("100.00"), and an unconfigured differ treats them as distinct values. Route every scalar through a canonical formatter that collapses int, float, and Decimal representations of the same number, and enable ignore_numeric_type_changes so type widening never registers as divergence.

How do I diff a JSON document too large to fit in memory?

Stream it. Use ijson to parse both files incrementally into path-to-value maps, bounding memory by the size of a single leaf rather than the whole tree, then compare the maps. Trigger this path when a payload exceeds a configured byte threshold or when the in-memory differ raises MemoryError.

Should array order count as a difference?

It depends on the field, so decide per path. Fields like tags or roles are set-like and must be compared order-insensitively; an ordered event log must not. Drive the choice from an explicit order-insensitive path set that lives with the schema mapping instead of applying one global ignore_order flag to every array.

Structural Diffing & Sync Engines › JSON and Parquet Diffing Algorithms › Comparing JSON Structures with Python Diff Libraries

Comparing JSON Structures with Python Diff Libraries

This page answers one narrow, high-stakes question: when the fast digest comparison in the JSON and Parquet diffing algorithms stage flags a chunk as divergent, which Python diff library do you reach for to explain that divergence, and how do you operate it so it neither drowns operators in phantom discrepancies nor exhausts a worker’s heap? It assumes you have already canonicalized both sides into a shared intermediate representation — the deterministic normalization defined by the parent diffing stage — and that a hash-based equality filter has narrowed the payload to the small fraction of rows that actually disagree. The library you pick here is the semantic slow path: it runs rarely, but when it runs it must produce a deterministic, path-addressable explanation that a data engineer, migration specialist, or platform operator can act on.

Problem framing: a nightly reconciliation that started lying

Concretely: you run a nightly job that reconciles a 40-million-row orders export. The source system emits newline-delimited JSON; a downstream lakehouse writes the same logical rows as columnar Parquet. Row counts match, the checksum fast path clears 99.7% of chunks, but every night a few hundred rows fall through to the semantic differ — and the report is useless. Half the “differences” are "amount": 100.0 in JSON versus Decimal("100.00") in Parquet. Another slice is tags: ["a", "b"] versus tags: ["b", "a"] on a field whose order carries no meaning. Occasionally a single 900 MB nested-JSON document lands in one chunk and the differ OOM-kills the worker mid-run, leaving the partition in an ambiguous state. The task is to replace a naive DeepDiff(a, b) call with a differ that is order-aware where it should be, numerically tolerant where the schema permits, memory-bounded by construction, and deterministic enough to feed a regulated audit trail.

Implementation: a library-agnostic semantic differ with a fallback chain

The implementation below wraps three concrete strategies behind one interface. DeepDiff handles the common case with configurable order- and type-tolerance; a streaming path built on ijson handles oversized documents without materializing them; and a custom depth-first comparator with an epsilon numeric rule handles precision drift that off-the-shelf flags cannot express. Selection is driven by payload characteristics and by the same per-column epsilons managed under threshold tuning for tolerance.

python

from __future__ import annotations

import logging
import sys
from dataclasses import dataclass, field
from decimal import Decimal
from typing import Any, Iterator

import ijson  # streaming JSON parser
from deepdiff import DeepDiff

logger = logging.getLogger("json_semantic_diff")

# Fields whose array order is not semantically meaningful.
ORDER_INSENSITIVE_PATHS: frozenset[str] = frozenset({"tags", "roles", "categories"})


@dataclass(frozen=True)
class DiffConfig:
    """Per-run comparison policy, sourced from the tolerance profile."""
    significant_digits: int = 9          # numeric equality precision
    epsilon: Decimal = Decimal("1e-9")   # relative slack for the custom path
    max_diff_depth: int = 32             # cap recursion on pathological nesting
    stream_threshold_bytes: int = 150_000_000
    ignore_order: bool = True


@dataclass
class DiffOutcome:
    """Deterministic, path-addressable result the audit trail consumes."""
    strategy: str
    changed_paths: list[str] = field(default_factory=list)
    is_equal: bool = True
    error: str | None = None


def canonical(value: Any) -> Any:
    """Collapse equivalent scalars to one representation before comparison.

    1, 1.0 and Decimal("1.00") must not read as three distinct values —
    that collapse is the single largest source of phantom discrepancies.
    """
    if isinstance(value, bool):
        return value
    if isinstance(value, (int, float, Decimal)):
        return Decimal(str(value)).normalize()
    return value


def stream_json_paths(filepath: str) -> Iterator[tuple[str, Any]]:
    """Yield (path, scalar) tuples without materializing the tree.

    Uses ijson's incremental parser so a multi-gigabyte document is bounded
    by the size of a single leaf, not the whole payload.
    """
    scalar_events = {"string", "number", "boolean", "null"}
    with open(filepath, "rb") as handle:
        for prefix, event, value in ijson.parse(handle):
            if event in scalar_events:
                yield prefix, canonical(value)


def run_deepdiff(expected: Any, actual: Any, cfg: DiffConfig) -> DiffOutcome:
    """Primary path: exhaustive tree comparison with tolerance flags."""
    diff = DeepDiff(
        expected,
        actual,
        ignore_order=cfg.ignore_order,
        significant_digits=cfg.significant_digits,
        ignore_numeric_type_changes=True,
        max_diffs=10_000,
        verbose_level=2,
    )
    changed = sorted(str(p) for group in diff.values() for p in group)
    return DiffOutcome("deepdiff", changed, is_equal=not diff)


def run_streaming_diff(expected_path: str, actual_path: str) -> DiffOutcome:
    """Secondary path: compare path→value maps built by streaming both files."""
    left = dict(stream_json_paths(expected_path))
    right = dict(stream_json_paths(actual_path))
    changed = sorted(
        p for p in left.keys() | right.keys() if left.get(p) != right.get(p)
    )
    return DiffOutcome("streaming", changed, is_equal=not changed)


def run_epsilon_dfs(expected: Any, actual: Any, cfg: DiffConfig) -> DiffOutcome:
    """Tertiary path: custom DFS with relative-epsilon numeric equality."""
    changed: list[str] = []
    seen: set[int] = set()

    def within_tolerance(a: Decimal, b: Decimal) -> bool:
        scale = max(abs(a), abs(b), Decimal(1))
        return abs(a - b) <= cfg.epsilon * scale

    def walk(a: Any, b: Any, path: str, depth: int) -> None:
        if depth > cfg.max_diff_depth or id(a) in seen:
            return
        seen.add(id(a))
        if isinstance(a, dict) and isinstance(b, dict):
            for key in a.keys() | b.keys():
                walk(a.get(key), b.get(key), f"{path}.{key}", depth + 1)
        elif isinstance(a, list) and isinstance(b, list):
            ordered = not path.split(".")[-1] in ORDER_INSENSITIVE_PATHS
            la, lb = (a, b) if ordered else (sorted(map(str, a)), sorted(map(str, b)))
            for i, (x, y) in enumerate(zip(la, lb)):
                walk(x, y, f"{path}[{i}]", depth + 1)
            if len(la) != len(lb):
                changed.append(f"{path}[len]")
        else:
            ca, cb = canonical(a), canonical(b)
            both_numeric = isinstance(ca, Decimal) and isinstance(cb, Decimal)
            if both_numeric and not within_tolerance(ca, cb):
                changed.append(path)
            elif not both_numeric and ca != cb:
                changed.append(path)

    walk(expected, actual, "$", 0)
    return DiffOutcome("epsilon_dfs", sorted(changed), is_equal=not changed)


def semantic_diff(
    expected: Any, actual: Any, cfg: DiffConfig, *, paths: tuple[str, str] | None = None
) -> DiffOutcome:
    """Route to the cheapest strategy that can handle the payload."""
    sys.setrecursionlimit(max(2000, cfg.max_diff_depth * 40))
    payload_bytes = sys.getsizeof(expected) + sys.getsizeof(actual)
    try:
        if paths and payload_bytes > cfg.stream_threshold_bytes:
            logger.info("payload=%d bytes over threshold; streaming", payload_bytes)
            return run_streaming_diff(*paths)
        return run_deepdiff(expected, actual, cfg)
    except MemoryError:
        logger.warning("DeepDiff OOM; degrading to streaming path")
        if paths:
            return run_streaming_diff(*paths)
        return DiffOutcome("streaming", is_equal=False, error="no file paths for stream")
    except (TypeError, ValueError) as exc:
        logger.warning("DeepDiff type/precision failure (%s); epsilon DFS", exc)
        return run_epsilon_dfs(expected, actual, cfg)

When even the tertiary path cannot render a verdict — a circular reference or an unparseable payload — the run degrades to a structural hash comparison and routes the row to a manual queue rather than stalling. That final tier belongs to the shared fallback chain implementation; the diagram below shows how a single diff request flows through the tiers.

The strategy trade-offs — including the regulatory posture each tier must satisfy — are summarized below.

Tier	Strategy	Trigger condition	Latency profile
Primary	`DeepDiff`, `ignore_order=True`	Standard payloads under the size threshold, aligned schema	Fastest; path-level report at `verbose_level=2`
Secondary	`ijson` streaming path→value diff	`MemoryError` or payload over `stream_threshold_bytes`	+15–20%; bounded memory, tolerates key-order shifts
Tertiary	Custom DFS + epsilon comparator	Precision drift or type-coercion failure	Moderate; explicit relative-epsilon numeric rule
Quaternary	Structural SHA-256 hashing	Circular reference, unparseable payload, timeout	Cheap verdict, no explanation; routes to manual queue
Compliance / regulatory	Any tier, with immutable outcome logging	Rows on regulated tables (financial, PII, audit scope)	Emit deterministic `DiffOutcome` with strategy + epsilon to a WORM audit sink; SHA-256 satisfies FIPS-validated hashing where MD5 does not

Key implementation notes

Library selection is a payload decision, not a taste decision. DeepDiff wins the common case because its ignore_order and significant_digits flags express the two most frequent forms of benign variance directly. jsondiff earns its place only when you need an RFC 6902 JSON-Patch delta to apply rather than merely report; it is a poor primary differ for reconciliation because it emits patch operations, not classified discrepancies. A hand-rolled DFS is justified solely when your equality predicate cannot be expressed as a flag — a per-column relative epsilon, for instance.
Canonicalize before you compare, always. Routing every scalar through canonical() collapses 1, 1.0, and Decimal("1.00") to one value. Skipping this step is the single most common cause of phantom discrepancies, and it is exactly the artifact the columnar-versus-row-oriented split in structural mismatch detection is designed to keep out of the row differ.
Array order is a per-field property. Treating all arrays as ordered produces false positives on set-like fields; treating all as unordered hides real reordering bugs. Drive the decision from an explicit ORDER_INSENSITIVE_PATHS set that lives beside the cross-platform schema mapping, not from a global flag.
Numeric slack must be relative and schema-scoped. A FLOAT32 Parquet column against double-precision JSON legitimately disagrees in the last bit; a monetary column must not. The epsilon is a relative fraction of magnitude, and the per-column values are governed by the same tolerance discipline as the rest of the data equivalence modeling layer.
Compliance implication: the differ is an audit witness. On regulated tables the DiffOutcome — strategy used, epsilon applied, changed paths — is evidence. Persist it to an append-only sink, and when the verdict crosses a trust boundary, promote the quaternary hash from xxHash to a FIPS-validated SHA-256 digest. The reasoning for that promotion is worked through in the MD5 vs SHA-256 checksum comparison, part of the broader column-level checksum generation workflow.

Verification step

Assert the differ’s behavior on the exact artifacts it exists to tame: numeric-type equivalence, order-insensitive arrays, and genuine divergence. The following pytest cases fail loudly if a future refactor reintroduces phantom discrepancies.

python

from decimal import Decimal

from json_semantic_diff import DiffConfig, semantic_diff

CFG = DiffConfig()


def test_numeric_type_equivalence_is_not_a_diff():
    left = {"amount": 100.0}
    right = {"amount": Decimal("100.00")}
    assert semantic_diff(left, right, CFG).is_equal


def test_order_insensitive_field_matches():
    left = {"tags": ["a", "b"]}
    right = {"tags": ["b", "a"]}
    assert semantic_diff(left, right, CFG).is_equal


def test_real_divergence_is_reported_with_path():
    left = {"orders": [{"total": Decimal("10.00")}]}
    right = {"orders": [{"total": Decimal("11.50")}]}
    outcome = semantic_diff(left, right, CFG)
    assert not outcome.is_equal
    assert any("total" in path for path in outcome.changed_paths)

Run it directly and gate the pipeline on it: python -m pytest test_json_semantic_diff.py -q. A green run confirms that type widening and set-like reordering are absorbed while a real value change surfaces with an addressable path — the exact contract the reconciliation report depends on.

Operational considerations

The semantic differ runs on the tail of the distribution, so tune it for the worst case, not the average. Cap each task’s payload with stream_threshold_bytes and size worker memory to roughly three times the largest single canonicalized document, since DeepDiff holds both trees plus its diff structure simultaneously. Because CPython’s GIL serializes the pure-Python DFS path, distribute divergent chunks across a ProcessPoolExecutor or Spark executors rather than threads, and keep each chunk small enough that a single OOM loses one task, not a partition.

Expose four telemetry signals: the fallback-tier counter (diff.fallback.{secondary,tertiary,quaternary}) so a rising secondary rate warns you that documents are outgrowing the threshold; diff execution latency as a histogram to catch pathological nesting; the phantom-suppression ratio (chunks that canonicalize to equal after being hash-flagged) to detect canonicalization gaps; and a per-column changed-path frequency to feed epsilon retuning. On storage, never spool full diff trees to durable storage — persist only the compact DiffOutcome and reconstruct detail on demand, which keeps the audit footprint proportional to divergence, not to volume. Cost follows the same logic: every row the fast path clears is a row the differ never loads, so the cheapest optimization is a tighter canonicalization upstream, not a faster library downstream.

JSON and Parquet Diffing Algorithms — the parent stage that hashes canonicalized chunks and invokes this semantic differ only on the ones that disagree.
Threshold tuning for tolerance — how the per-column epsilons and precision limits this differ reads are chosen and version-controlled.
Fallback chain implementation — the tiered degradation strategy the quaternary hash-and-quarantine tier plugs into.
Detecting structural mismatches in Parquet files — schema-level drift detection that runs before row diffing so the semantic pass never sees table-wide layout changes.
Generating MD5 vs SHA-256 checksums for data rows — the hash-selection reasoning behind promoting the quaternary tier for regulated audit trails.

# Comparing JSON Structures with Python Diff Libraries

# Problem framing: a nightly reconciliation that started lying

# Implementation: a library-agnostic semantic differ with a fallback chain

# Key implementation notes

# Verification step

# Operational considerations

# Related