Structural Diffing & Sync Engines › Structural Mismatch Detection

Structural Mismatch Detection

Structural mismatch detection is the earliest gate in the comparison stage: the workload that proves two datasets share the same shape — column set, declared types, nesting, nullability, and partition layout — before any row values are read. It runs at the schema, footer, and manifest layers, never on payloads, so it costs kilobytes of I/O to reject a migration that would otherwise burn hours of compute producing wrong answers. For data engineers, migration specialists, Python pipeline builders, and platform operations teams moving data between Spark, Trino, Snowflake, DuckDB, and native Python runtimes, this is the difference between a fast, deterministic “these two tables are not even the same shape” and a slow, silent corruption that only surfaces three joins downstream. This reference sits inside the Structural Diffing & Sync Engines stage and specializes it for the shape-comparison problem.

The reason shape deserves its own workload — separate from value comparison — is that engines disagree on representation for reasons that carry no business meaning, and those disagreements are cheap to detect but expensive to hit. Spark infers a bare timestamp as TIMESTAMP_NTZ, Trino expects TIMESTAMP WITH TIME ZONE, Snowflake defaults to TIMESTAMP_TZ; a column reordered by a SELECT * rewrite, a field promoted from int32 to int64, a nullable flag flipped, or a partition key dropped are all structural events that a byte-level or even a row-level diff will either miss entirely or drown in noise. Catching them here, from metadata alone, keeps the expensive value-comparison paths — the JSON and Parquet diffing engine and its downstream routing — from ever running against a mismatched shape.

Three strictly isolated layers turn kilobytes of metadata into one of four structural verdicts, without ever reading row data.

Architectural Boundaries

This workload begins the moment two dataset handles are available — a source and a target file path, table reference, or manifest — and it ends when it has emitted a single structural verdict per pair: exact_match, tolerable_drift, coercible_mismatch, or irreconcilable_drift. It consumes metadata only: Parquet file footers, JSON schema documents or a sampled document, and partition manifests, plus the canonical type map that a stable cross-platform schema mapping supplies and the equivalence rules fixed upstream by data equivalence modeling. It produces a structural directive that either clears a pair for value comparison or short-circuits it into a discrepancy manifest — it never reads or mutates row data.

Three concerns stay strictly isolated inside the boundary, and keeping them decoupled is what makes the workload both cheap and horizontally scalable. The metadata extraction layer reads schema manifests, column statistics, and file footers without materializing payloads. The canonical normalization layer transforms each engine-specific representation into one deterministic intermediate representation, stripping annotations that carry no logical meaning. The diff and decision layer hashes the normalized trees, compares digests, and applies tolerance policy. Value-level differences — a truncated string, a rounding error, an out-of-range number — are explicitly out of scope here; they belong to the value-comparison engines, and this stage exists precisely so those engines never run against a shape they cannot reconcile.

The hard rule at the extraction layer is that it must never deserialize a full file. Streaming metadata parsers — pyarrow.parquet.read_metadata for the footer, a jsonschema validator or a single sampled document for JSON — are mandatory over full-file reads. Loading a multi-gigabyte dataset to compare its shape violates cluster resource quotas and scales non-linearly with data volume rather than with schema size. Working sets stay under 256 MB per worker by operating on schema trees, columnar statistics, and partition manifests alone; for Parquet, that means reading the footer and touching zero row groups.

Prerequisites

Confirm each item before wiring structural mismatch detection into a reconciliation run. Every one removes a source of phantom divergence, unbounded memory growth, or non-deterministic hashing.

Canonical type map resolved. The engine-type → canonical-type mapping is fixed by the cross-platform schema mapping reference, so bigint, int64, and long collapse to one canonical integer and cannot appear as false drift.
Nullability policy declared. A single rule decides whether a nullable-vs-required flag difference is benign or breaking — set it once, before hashing, not per column.
Ordering policy chosen. Decide up front whether column and nested-field order is semantically meaningful; if not, the normalizer sorts fields so a reordered SELECT * does not read as a mismatch.
Metadata access granted. Credentials allow reading Parquet footers and catalog schemas on both source and target without read access to row data — the workload needs footers, not payloads.
Tolerance profile selected. The severity thresholds that separate benign drift from breaking change come from the threshold tuning for tolerance reference, so the decision chain is data-criticality aware rather than all-or-nothing.
Dependency libraries pinned. pyarrow and the JSON schema toolchain are version-pinned so footer parsing and canonical byte output are reproducible across worker hosts; hashlib ships with the interpreter.

Structural detection is normally preceded by lightweight schema validation pre-checks at extraction time; this stage is the authoritative, hash-backed version that gates the comparison run itself.

Step-by-Step Implementation

The steps below build a footer-only structural validator: normalize, hash, compare, decide. Each ends with an assertion or observable output so it can be verified in isolation before the next layer is added.

1. Map engine types to a canonical enum

The root cause of false structural drift is engine-specific type spelling. Reduce every declared type to a small canonical enum so logically identical schemas normalize to identical tokens. This mirrors the column-level checksum generation philosophy — normalize first, hash second — applied to types instead of values.

python

import logging
from enum import Enum

logger = logging.getLogger(__name__)


class CanonicalType(Enum):
    STRING = "string"
    INTEGER = "integer"
    FLOAT = "float"
    BOOLEAN = "boolean"
    TIMESTAMP = "timestamp"
    DATE = "date"
    BINARY = "binary"
    ARRAY = "array"
    OBJECT = "object"
    NULL = "null"


# Engine-specific type spellings collapse to one canonical member.
TYPE_COERCION_MAP = {
    "timestamp_ntz": CanonicalType.TIMESTAMP,
    "timestamp_tz": CanonicalType.TIMESTAMP,
    "timestamp_with_time_zone": CanonicalType.TIMESTAMP,
    "datetime": CanonicalType.TIMESTAMP,
    "bigint": CanonicalType.INTEGER,
    "int64": CanonicalType.INTEGER,
    "long": CanonicalType.INTEGER,
    "float64": CanonicalType.FLOAT,
    "double": CanonicalType.FLOAT,
    "bool": CanonicalType.BOOLEAN,
    "boolean": CanonicalType.BOOLEAN,
    "varchar": CanonicalType.STRING,
    "text": CanonicalType.STRING,
    "utf8": CanonicalType.STRING,
}


def normalize_type(raw_type: str) -> CanonicalType:
    """Map an engine-specific type string to a canonical member."""
    normalized = raw_type.strip().lower()
    if normalized in TYPE_COERCION_MAP:
        return TYPE_COERCION_MAP[normalized]
    # Conservative fallback so an unmapped spelling degrades predictably.
    if "int" in normalized:
        return CanonicalType.INTEGER
    if "float" in normalized or "double" in normalized:
        return CanonicalType.FLOAT
    if "timestamp" in normalized or "date" in normalized:
        return CanonicalType.TIMESTAMP
    logger.debug("Unmapped type %r fell back to STRING", raw_type)
    return CanonicalType.STRING

Verify the map is doing its job: engine variants must converge.

python

assert normalize_type("TIMESTAMP_NTZ") is normalize_type("timestamp with time zone")
assert normalize_type("bigint") is normalize_type("int64")

2. Normalize the schema tree into a deterministic form

A schema is a nested structure. To hash it reproducibly, recursively sort keys, order array members by a stable serialization, and replace every leaf type token with its canonical value. Two schemas that differ only in key order or type spelling must serialize to identical bytes.

python

import json
from typing import Any


def normalize_schema_node(node: Any) -> Any:
    """Recursively normalize a schema node for deterministic hashing."""
    if isinstance(node, dict):
        return {k: normalize_schema_node(v) for k, v in sorted(node.items())}
    if isinstance(node, list):
        return [
            normalize_schema_node(item)
            for item in sorted(node, key=lambda x: json.dumps(x, sort_keys=True))
        ]
    if isinstance(node, str):
        # Emit the canonical token so the tree stays JSON-serializable.
        return normalize_type(node).value
    return node

Verify order-independence directly:

python

a = {"id": "bigint", "ts": "timestamp_ntz"}
b = {"ts": "timestamp_with_time_zone", "id": "int64"}
assert normalize_schema_node(a) == normalize_schema_node(b)

3. Compute a deterministic structural hash

Serialize the normalized tree with sorted keys and no incidental whitespace, then hash it. A cryptographic digest gives collision resistance strong enough that a hash match is a safe proxy for structural equality; the algorithm is pluggable so the same code serves both fast non-cryptographic screening and audit-grade comparison (see the trade-off table below). The Python hashlib documentation covers the available algorithms.

python

import hashlib


def compute_structural_hash(schema: dict, algorithm: str = "sha256") -> str:
    """Deterministic structural hash of a normalized schema tree."""
    try:
        normalized = normalize_schema_node(schema)
        canonical_bytes = json.dumps(
            normalized, sort_keys=True, separators=(",", ":")
        ).encode("utf-8")
        return hashlib.new(algorithm, canonical_bytes).hexdigest()
    except (TypeError, ValueError) as exc:
        logger.error("Schema normalization failed during hashing: %s", exc, exc_info=True)
        raise RuntimeError("Schema normalization failed during hashing") from exc

Verify determinism and order-independence at the hash level:

python

assert compute_structural_hash(a) == compute_structural_hash(b)
assert compute_structural_hash(a) == compute_structural_hash(a)  # stable across runs

4. Extract shape from footers and JSON without loading payloads

Both extractors return a plain {name: type_string} map — the input the normalizer expects — and both read metadata only. The Parquet path reads the footer and scans zero row groups; the JSON path validates against a schema document or infers from a single sampled record.

python

import pyarrow.parquet as pq


def extract_parquet_schema(file_path: str) -> dict:
    """Extract a Parquet schema from the footer — no row groups are scanned."""
    try:
        metadata = pq.read_metadata(file_path)          # footer only
        schema = metadata.schema.to_arrow_schema()
        return {field.name: str(field.type) for field in schema}
    except FileNotFoundError:
        logger.warning("Parquet file not found: %s", file_path)
        return {}
    except Exception as exc:
        logger.error("Failed to extract Parquet metadata from %s: %s", file_path, exc, exc_info=True)
        raise


def extract_json_schema(sample_path: str) -> dict:
    """Infer shape from a single sampled JSON record — never the whole corpus."""
    try:
        with open(sample_path, "r", encoding="utf-8") as handle:
            record = json.load(handle)
        return {key: type(value).__name__ for key, value in record.items()}
    except (OSError, ValueError) as exc:
        logger.error("Failed to extract JSON schema from %s: %s", sample_path, exc, exc_info=True)
        raise

5. Compare and emit a structural directive

The decision layer ties the pieces together. It hashes both sides, compares digests, and returns a typed result carrying enough context for downstream routing to act without re-reading either file.

python

from dataclasses import dataclass, field
from typing import Any


@dataclass
class StructuralDiffResult:
    source_hash: str
    target_hash: str
    is_match: bool
    delta_details: dict[str, Any] = field(default_factory=dict)
    severity: str = "INFO"  # INFO | WARNING | CRITICAL


class StructuralValidator:
    def __init__(self, max_memory_mb: int = 256, hash_algorithm: str = "sha256") -> None:
        self.max_memory_mb = max_memory_mb
        self.hash_algorithm = hash_algorithm

    def validate_pair(self, source_path: str, target_path: str, engine_hint: str = "parquet") -> StructuralDiffResult:
        extract = extract_parquet_schema if engine_hint == "parquet" else extract_json_schema
        try:
            src_hash = compute_structural_hash(extract(source_path), self.hash_algorithm)
            tgt_hash = compute_structural_hash(extract(target_path), self.hash_algorithm)
        except Exception as exc:
            logger.error("Structural validation failed for (%s, %s): %s", source_path, target_path, exc)
            return StructuralDiffResult("", "", False, {"error": str(exc)}, severity="CRITICAL")

        is_match = src_hash == tgt_hash
        return StructuralDiffResult(
            source_hash=src_hash,
            target_hash=tgt_hash,
            is_match=is_match,
            delta_details={"source_file": source_path, "target_file": target_path},
            severity="INFO" if is_match else "CRITICAL",
        )

A hash match means the shapes are provably identical and the pair is cleared for value comparison; a mismatch escalates into the tolerance decision chain below rather than failing the run outright.

Tolerance Decision Chain

Not every structural mismatch warrants halting a pipeline. A column reordered by a query rewrite or a nullable flag relaxed on an append-only table is usually benign; a type downcast that truncates precision or a dropped partition key is not. The decision layer therefore evaluates a fallback chain of increasingly permissive comparisons, with each rung’s threshold sourced from the threshold tuning for tolerance profile. When the chain routes to remediation rather than a clean stop, it hands off to the fallback chain implementation for tiered retry and deferred reconciliation.

The fallback chain widens tolerance one rung at a time: exact match proceeds, cosmetic drift warns, coercible drift evolves, and only irreconcilable drift halts.

Exact hash match → proceed immediately; shapes are identical.
Normalized match ignoring order and nullable → log a warning and continue; the divergence is cosmetic under the declared policy.
Type-coercible mismatch → route to the schema-evolution handler or a quarantine queue; the change is safe only if the target type is a widening of the source.
Irreconcilable drift → halt the pipeline, raise an alert, and preserve both raw manifests for audit.

Hash Algorithm Trade-Off

The structural hash is pluggable because the two jobs it does have different constraints. Fast in-run screening across thousands of partitions favours throughput; audit-grade comparison that will be recorded in a compliance trail favours collision resistance and a documented standard. Choose per stage, and record the choice in job metadata.

Axis	MD5	SHA-256	xxHash / BLAKE3
Digest strength	Broken; collisions are constructible	Strong, standardized	BLAKE3 cryptographic; xxHash non-cryptographic
Throughput	Fast	Moderate	Fastest (xxHash), fast + parallel (BLAKE3)
Best fit	Legacy compatibility only	Audit-grade structural fingerprints	High-volume in-run screening
Determinism across hosts	Yes	Yes	Yes (pin the library and version)
Compliance / regulatory	Not acceptable where integrity is auditable	Preferred; a NIST-approved hash function backs the audit trail	xxHash unsuitable for audit; BLAKE3 acceptable but less widely mandated
Cluster cost at scale	Low CPU	Higher CPU; amortized by footer-only I/O	Lowest CPU per digest

The default is SHA-256 for any hash that will be persisted or cited in an audit, with a non-cryptographic screen acceptable only for the transient in-run pass that never leaves the worker.

Scaling and Performance

Structural detection is I/O-bound on metadata and stateless in its hashing, which makes it embarrassingly parallel — the constraint is fan-out, not compute. Partition the work by dataset pair and distribute across a worker pool so footer reads parallelize across storage nodes; because each compute_structural_hash call is pure and shares no state, workers need no coordination or locking. The CPython global interpreter lock is not a bottleneck here: footer reads release it during I/O, and truly CPU-bound hashing of thousands of schemas should use a process pool rather than threads when it dominates.

Keep three memory-bounding rules. First, never widen the extractor to read row groups — a footer is kilobytes; a row group is not. Second, cap the per-worker working set at the configured max_memory_mb and shard partition lists so no single worker holds every manifest at once. Third, pre-compute structural hashes during off-peak windows and cache them in a low-latency key-value store keyed by dataset version; a migration window then reads a cached digest instead of re-hashing, turning structural validation of an incremental load into a single point lookup. Cached digests also make the value-comparison engines — notably JSON and Parquet diffing — skippable whenever the shape is provably unchanged since the last clean run.

Failure Modes and Diagnostic Runbook

Silent schema drift. Cause: an upstream SELECT * rewrite or an engine version bump changes column order or type spelling. Detection signal: a structural hash mismatch on a pair whose row counts are unchanged. Remediation: run the normalized-match rung; if it clears, update the accepted baseline hash, otherwise escalate to schema evolution.
Phantom mismatch from type spelling. Cause: a new engine emits an unmapped type token (e.g. TIMESTAMP_LTZ) that falls through to the STRING default. Detection signal: mismatches concentrated on one column across many otherwise-identical pairs. Remediation: add the spelling to TYPE_COERCION_MAP, re-derive the type map from cross-platform schema mapping, and re-hash.
Non-deterministic hash across hosts. Cause: an unpinned pyarrow produces a different footer type string, or JSON key order leaks into serialization. Detection signal: the same file hashes differently on two workers. Remediation: pin dependency versions and confirm sort_keys=True is set on every serialization path.
OOM during extraction. Cause: the JSON extractor was pointed at a multi-gigabyte array instead of a single sampled record, or the Parquet path was widened to read data. Detection signal: worker memory climbs with file size rather than staying flat. Remediation: restore footer-only / single-record extraction and enforce the max_memory_mb ceiling.
False clean on nested drift. Cause: a change buried in a struct or array element is masked because the normalizer sorted it away under an order-insensitive policy. Detection signal: downstream value comparison flags rows the structural gate cleared. Remediation: tighten the ordering policy for nested types, or treat nested-field order as significant in the tolerance profile.

Deep Dives

For the format-specific mechanics of this workload, detecting structural mismatches in Parquet files walks through Parquet footer root causes, per-engine drift between Spark, Trino, and DuckDB, and an operational runbook for reconciling columnar schema headers at scale.

Structural Diffing & Sync Engines — the parent stage this shape-comparison gate opens.
JSON and Parquet Diffing Algorithms — the value-comparison engine that runs only once shapes match.
Threshold Tuning for Tolerance — sources the severity thresholds the decision chain applies.
Fallback Chain Implementation — tiered retry and deferred reconciliation when drift is not a clean stop.
Schema Validation Pre-Checks — the lightweight extraction-time gate that precedes this authoritative one.

# Structural Mismatch Detection

# Architectural Boundaries

# Prerequisites

# Step-by-Step Implementation

# 1. Map engine types to a canonical enum

# 2. Normalize the schema tree into a deterministic form

# 3. Compute a deterministic structural hash

# 4. Extract shape from footers and JSON without loading payloads

# 5. Compare and emit a structural directive

# Tolerance Decision Chain

# Hash Algorithm Trade-Off

# Scaling and Performance

# Failure Modes and Diagnostic Runbook

# Deep Dives

# Related

Structural Mismatch Detection

Architectural Boundaries

Prerequisites

Step-by-Step Implementation

1. Map engine types to a canonical enum

2. Normalize the schema tree into a deterministic form

3. Compute a deterministic structural hash

4. Extract shape from footers and JSON without loading payloads

5. Compare and emit a structural directive

Tolerance Decision Chain

Hash Algorithm Trade-Off

Scaling and Performance

Failure Modes and Diagnostic Runbook

Deep Dives

Related