Cross-Platform Schema Mapping › Mapping Relational Schemas to Document Stores

Mapping Relational Schemas to Document Stores

Q: How do I stop timestamps and decimals from silently drifting during migration?

Coerce them before any driver serializes them. Force every datetime to timezone-aware UTC ISO-8601 and reject naive datetimes rather than guessing an offset, and carry monetary values through decimal.Decimal quantized to the source scale instead of widening to float. Keeping both rules in one coerce function means the same normalization runs at migration and validation time, so fingerprints match for logically equal rows.

Q: What breaks when a self-referencing foreign key is mapped naively?

Document assembly recurses forever. A column like employee.manager_id describes a chain that, if a defect makes it circular, hangs the worker or blows the stack. Flatten the hierarchy into a materialized path string, store only the immediate parent reference, and cap traversal depth so a corrupt cycle raises a bounded error you can quarantine.

This page answers one precise question: how do you translate a normalized relational schema — foreign keys, junction tables, ACID boundaries — into document-store shapes so that a reconciliation run can still prove the two sides hold the same logical data? It is the concrete, code-level task that sits under the cross-platform schema mapping reference, and it assumes you already have a mapping contract from that stage: a declared source-to-target type map, a null policy, and a chosen reconciliation key. What follows turns that contract into a deterministic transformer and a parity check you can run in production, for data engineers, migration specialists, Python pipeline builders, and platform operations teams.

The reason relational-to-document translation earns its own runbook is that the two models disagree about where an entity ends. A relational schema scatters one logical order across orders, order_items, and a customers reference; a document store wants that order as a single self-contained record with the line items embedded. Get the document boundary wrong and every downstream comparison drowns in phantom divergence — orphaned references, fragmented nested payloads, and checkpoint drift — long before the data equivalence modeling stage can decide whether two records are the same entity.

Problem Framing

You are migrating a PostgreSQL commerce schema into a document store. customers has many orders; each order has many order_items; orders also reference products through a many-to-many order_product_tags junction; and employees.manager_id is a self-referencing foreign key. The target must be one document per order, with line items embedded, the customer denormalized to a stable subset, product tags stored as a reference array, and no structure that recurses forever. If a DECIMAL(18,4) price silently truncates or a TIMESTAMPTZ loses its offset during serialization, the migration will look successful and be wrong — and you will only discover it when SQL to NoSQL sync validation starts flagging drift you cannot explain.

The rule that makes this tractable: embed what is owned and read together; reference what is shared or unbounded. A one-to-many where the children have no independent lifecycle (line items) embeds. A many-to-many, or a relationship whose fan-out is large, becomes a reference array so a single root document cannot grow without bound.

Implementation

The transformer is built in two layers so each is independently verifiable: a declarative boundary plan, then a deterministic mapper that applies it, coerces types, breaks cycles, and emits a canonical fingerprint the reconciliation layer can compare.

Step 1 — Declare the boundary plan and coerce types deterministically

The plan names, per relationship, whether it embeds or references, and the type map fixes numeric and temporal precision before anything is serialized. Decimals stay exact through Python’s decimal.Decimal; timestamps normalize to timezone-aware UTC ISO-8601 via datetime so no offset is ever dropped.

python

import logging
from dataclasses import dataclass, field
from datetime import datetime, timezone
from decimal import Decimal, ROUND_HALF_EVEN, InvalidOperation
from enum import Enum
from typing import Any, Callable, Dict, List, Optional

logging.basicConfig(
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s", level=logging.INFO
)
logger = logging.getLogger("recon.rel2doc")

_MONEY = Decimal("0.0001")  # DECIMAL(18,4) scale — must survive the round-trip


class Relation(Enum):
    EMBED = "embed"          # owned, read together, bounded fan-out
    REFERENCE = "reference"  # shared or unbounded — store ids only


@dataclass(frozen=True)
class ChildRule:
    """One relational relationship and how it projects into the document."""
    source_table: str
    join_key: str
    relation: Relation
    embed_as: str                       # target field name in the document
    projection: Optional[List[str]] = None  # None => all columns


@dataclass(frozen=True)
class BoundaryPlan:
    """The document boundary for one root entity."""
    root_table: str
    root_key: str
    children: List[ChildRule] = field(default_factory=list)


def coerce(value: Any) -> Any:
    """Deterministic, precision-preserving coercion — the root of parity.

    Decimals are quantized to a fixed scale, datetimes forced to UTC ISO-8601,
    and bytes decoded, so logically equal rows serialize to identical bytes on
    both engines. A NULL stays None; the null policy collapses it downstream.
    """
    if value is None:
        return None
    if isinstance(value, Decimal):
        try:
            return str(value.quantize(_MONEY, rounding=ROUND_HALF_EVEN))
        except InvalidOperation:
            logger.error("decimal quantize failed for %r", value)
            raise
    if isinstance(value, datetime):
        if value.tzinfo is None:
            raise ValueError(f"naive datetime {value!r}: timezone lost before mapping")
        return value.astimezone(timezone.utc).isoformat()
    if isinstance(value, bytes):
        return value.decode("utf-8", errors="replace")
    return value

Step 2 — Assemble the root document, embedding, referencing, and breaking cycles

The mapper embeds owned children, reduces many-to-many rows to a reference array, and replaces a self-referencing chain with a materialized path so document assembly can never recurse without a bound. Every value passes through coerce, and a discriminator is injected for polymorphic children so the target store stays queryable.

python

class MappingError(Exception):
    """Raised when a row cannot be projected under the boundary plan."""


def _project(row: Dict[str, Any], cols: Optional[List[str]]) -> Dict[str, Any]:
    items = row.items() if cols is None else ((c, row.get(c)) for c in cols)
    return {k: coerce(v) for k, v in items}


def materialize_path(
    node_id: Any, parent_of: Dict[Any, Any], max_depth: int = 128
) -> str:
    """Flatten a self-referencing hierarchy (e.g. employee.manager_id) to a path.

    A depth cap turns a corrupt cycle into a raised error instead of a hang.
    """
    path, cursor, seen = [], node_id, set()
    while cursor is not None:
        if cursor in seen or len(path) >= max_depth:
            raise MappingError(f"circular reference at node {cursor!r}")
        seen.add(cursor)
        path.append(str(cursor))
        cursor = parent_of.get(cursor)
    return "/" + "/".join(reversed(path))


def build_document(
    root: Dict[str, Any],
    plan: BoundaryPlan,
    child_rows: Dict[str, List[Dict[str, Any]]],
    parent_of: Optional[Dict[Any, Any]] = None,
    discriminator: Optional[Callable[[Dict[str, Any]], str]] = None,
) -> Dict[str, Any]:
    """Project one relational root plus its related rows into a document.

    child_rows maps source_table -> the already-fetched, join-filtered rows for
    this root. Embedding vs referencing is decided by the plan, not inline.
    """
    root_id = root.get(plan.root_key)
    if root_id is None:
        raise MappingError(f"root {plan.root_table} missing key {plan.root_key}")

    doc: Dict[str, Any] = {"_id": coerce(root_id), **_project(root, None)}

    for rule in plan.children:
        rows = child_rows.get(rule.source_table, [])
        if rule.relation is Relation.REFERENCE:
            doc[rule.embed_as] = sorted(coerce(r[rule.join_key]) for r in rows)
        else:  # EMBED
            embedded = []
            for r in rows:
                projected = _project(r, rule.projection)
                if discriminator is not None:
                    projected["_type"] = discriminator(r)  # keep polymorphic rows queryable
                embedded.append(projected)
            doc[rule.embed_as] = embedded

    if parent_of is not None and "manager_id" in root:
        doc["manager_path"] = materialize_path(root_id, parent_of)

    return doc

Step 3 — Fingerprint the document for cross-engine comparison

A document is only reconcilable if the same logical order produces the same digest on the source-derived side and the target-fetched side. Canonical serialization — sorted keys, coerced scalars, one null representation — feeds a SHA-256 hash from Python’s hashlib. This is the same serialization contract used by column-level checksum generation, reused here so a fingerprint computed at migration time matches one recomputed during validation.

python

import hashlib
import json


def canonical_bytes(doc: Dict[str, Any]) -> bytes:
    """Deterministic byte image of a document, stable across engines."""
    return json.dumps(
        doc, sort_keys=True, separators=(",", ":"), ensure_ascii=False, default=str
    ).encode("utf-8")


def fingerprint(doc: Dict[str, Any]) -> str:
    return hashlib.sha256(canonical_bytes(doc)).hexdigest()

Key Implementation Notes

Embed-vs-reference is a fan-out decision, not a taste decision. Embed only relationships with bounded, owned children; route many-to-many and high-cardinality one-to-many to a reference array. A quick guardrail during design: run EXPLAIN ANALYZE on the source join and, if row multiplication is large (roughly 1:50 or worse), force a reference so no single document grows without limit.
Temporal precision is lost silently or not at all. coerce refuses a naive datetime outright rather than guessing an offset, and forces UTC ISO-8601, because a dropped timezone is the classic “migration looked fine, reconciliation failed” defect. TIMESTAMPTZ and microsecond precision survive because the string is produced before any driver-level truncation.
Polymorphic rows need a discriminator. A junction such as audit_logs mixing user, system, and api events loses queryability once flattened; injecting _type keeps target-store filters selective instead of forcing full-collection scans.
Circular references become paths, not stack overflows. materialize_path flattens employee.manager_id into /ceo/vp/lead, and its depth cap converts a corrupt cycle into a raised MappingError — a bounded failure the pipeline can quarantine — rather than a hung worker.
Null semantics are a compliance surface. A missing document field is not the same as a relational NULL; decide once whether they collapse, and keep the decision in the mapping contract so audit reviewers see one explicit rule. Preserving DECIMAL scale exactly matters for the same reason — silent rounding of a monetary column is a reportable integrity defect in regulated environments.

Verification

Assert the two properties the mapper exists to guarantee: logically equal roots fingerprint identically regardless of column order or numeric representation, and a cycle fails loudly instead of hanging.

python

plan = BoundaryPlan(
    root_table="orders", root_key="order_id",
    children=[
        ChildRule("order_items", "sku", Relation.EMBED, "items"),
        ChildRule("order_product_tags", "tag_id", Relation.REFERENCE, "tag_ids"),
    ],
)
children = {
    "order_items": [{"sku": "A1", "price": Decimal("10.0000"), "qty": 2}],
    "order_product_tags": [{"tag_id": 7}, {"tag_id": 3}],
}

a = build_document({"order_id": 1, "total": Decimal("20.00")}, plan, children)
b = build_document({"total": Decimal("20.0000"), "order_id": 1}, plan, children)
assert fingerprint(a) == fingerprint(b)          # key order + decimal scale reconcile
assert a["tag_ids"] == [3, 7]                     # many-to-many became a sorted ref array

try:
    materialize_path("e1", {"e1": "e2", "e2": "e1"})  # deliberate cycle
except MappingError:
    logger.info("cycle correctly rejected")
else:
    raise AssertionError("circular reference was not detected")

Run it as a pre-deployment gate: python -m pytest test_rel2doc.py -q should pass before any bulk load starts, and the same fingerprint values should be recomputed from the target store during a live parity check. A mismatch rate above your agreed tolerance (0.1% is a common starting threshold) is the signal to halt and triage rather than promote the target.

Operational Considerations

At scale the transformer is CPU-bound on coercion and hashing and memory-bound on child fan-out, so both must be engineered together. Never materialize a full Cartesian product of a root and its children before serializing — that is the OOM trap that ends most naive pandas or pyarrow migrations. Fetch children join-filtered per root and assemble one document at a time; when embedding fan-out is large, the byte ceiling per document, not the row count, is what you tune against. Disable secondary indexes on the target during bulk load and rebuild them in the background to avoid write stalls, and align the target write pool to source throughput so bursty syncs do not starve connections.

Because structural translation alone does not prove semantic parity, run validation asynchronously alongside ingestion: sample _id values, refetch, recompute fingerprint, and log divergence patterns — missing arrays, truncated decimals, timezone shifts — to the reconciliation ledger. When extraction outruns the mapper, the backpressure supplied by async batching for large datasets keeps in-flight rows bounded. Expose documents_mapped_total, quarantine_rate, map_latency_ms, and parity_mismatch_rate so platform ops can alert on drift early. A table-wide structural break — a column that changed type across every row — is not a per-document concern and belongs to structural mismatch detection; the epsilon that decides whether a numeric delta is real or rounding comes from threshold tuning for tolerance; and when mismatch rates cross tolerance, the tiered halt-quarantine-resync response is owned by the fallback chain implementation reference rather than being hard-coded here.

Frequently Asked Questions

When should a one-to-many relationship be referenced instead of embedded?

Embed a one-to-many only when the children are owned by the root, read together with it, and bounded in number — order line items are the canonical example. Reference it when the children are shared across roots, have an independent lifecycle, or fan out large enough that a single document could grow without limit. A practical trigger is source join fan-out: if EXPLAIN ANALYZE shows row multiplication around 1:50 or worse, force a reference array so the document stays a bounded, reconcilable unit.

How do I stop timestamps and decimals from silently drifting during migration?

Coerce them before any database driver serializes them. Force every datetime to timezone-aware UTC ISO-8601 and reject naive datetimes outright rather than guessing an offset, and carry monetary values through decimal.Decimal quantized to the source scale instead of letting them widen to float. Both rules live in one coerce function so the same normalization runs at migration time and at validation time — which is exactly why a fingerprint computed on each side matches for logically equal rows.

What breaks when a self-referencing foreign key is mapped naively?

Document assembly recurses forever. A column like employee.manager_id describes a chain that, if a data-quality defect makes it circular, will hang the worker or blow the stack. Flatten the hierarchy into a materialized path string and store only the immediate parent reference, and cap traversal depth so a corrupt cycle raises a bounded MappingError you can quarantine rather than an unrecoverable hang.

Cross-Platform Schema Mapping — the parent reference that defines the mapping contract, type map, and null policy this task consumes.
SQL to NoSQL sync validation — the parity-checking stage that recomputes these fingerprints against the live target during and after cutover.
Data equivalence modeling — decides whether two structurally distinct records represent the same logical entity across engines.
Column-level checksum generation — the canonical serialization and digest contract reused here for cross-engine fingerprinting.
Structural mismatch detection — catches table-wide type and layout drift that per-document mapping should not handle.

# Mapping Relational Schemas to Document Stores

# Problem Framing

# Implementation

# Step 1 — Declare the boundary plan and coerce types deterministically

# Step 2 — Assemble the root document, embedding, referencing, and breaking cycles

# Step 3 — Fingerprint the document for cross-engine comparison

# Key Implementation Notes

# Verification

# Operational Considerations

# Frequently Asked Questions

# Related

Mapping Relational Schemas to Document Stores

Problem Framing

Implementation

Step 1 — Declare the boundary plan and coerce types deterministically

Step 2 — Assemble the root document, embedding, referencing, and breaking cycles

Step 3 — Fingerprint the document for cross-engine comparison

Key Implementation Notes

Verification

Operational Considerations

Frequently Asked Questions

Related