Schemas

Understanding schemas as quality gates for data ingestion

This tutorial explains what schemas are, why they exist, and how they act as quality gates for data ingestion.


Introduction

The primary purpose of schemas in the Nexus platform is to act as quality gates for data ingestion. They ensure that only properly structured, consistent data enters your collections.

Why Quality Gates Matter

Without quality gates:

  • Inconsistent data structures
  • Missing required files
  • Unexpected files mixed in
  • Problems discovered too late
  • Manual validation required

With quality gates (schemas):

  • Consistent data structure
  • Required files enforced
  • Unexpected files rejected
  • Problems caught early (during prevalidation)
  • Automated validation

You can also use permissive schemas (e.g. [{"pattern": ".*", "min_occurrence": 0}]) when you want to accept any structure—see Permissive vs strict schemas below for when that is appropriate.


What is a Schema?

A schema is a list of schema nodes that define the expected structure of data in a collection. Each node uses either name (exact match) or pattern (regex match), plus optional min_occurrence, max_occurrence, and children for nested structure. It specifies:

  • What files and folders are expected: Exact names or patterns
  • Which items are required: min_occurrence constraints
  • How many of each item: max_occurrence constraints
  • Patterns for matching: Regex patterns for flexible matching
python
schema = [
    {
        "pattern": "report_\\d+\\.pdf",  # Matches report_1.pdf, report_2.pdf, etc.
        "min_occurrence": 1,  # At least one report required
        "max_occurrence": None  # No maximum limit
    },
    {
        "name": "parameters.dat",  # Exact file name
        "min_occurrence": 1,  # Required
        "max_occurrence": 1  # Exactly one
    },
    {
        "pattern": "run_\\d+/",  # Folder pattern (trailing slash for folders)
        "min_occurrence": 1,  # At least one run folder required
        "children": [  # Nested structure
            {
                "pattern": "output_\\d+\\.vtk",
                "min_occurrence": 1
            }
        ]
    }
]

Schemas as Quality Gates

The Quality Gate Process

When you upload data to a collection, the schema acts as a quality gate:

  1. Prevalidation Phase: Before any files are uploaded, the system validates your data against the schema
  2. Validation Checks:
    • Are all required files present?
    • Are there any unexpected files?
    • Do file names match expected patterns?
    • Are occurrence constraints satisfied?
  3. Gate Decision:
    • Pass: Data matches schema → Upload proceeds
    • Fail: Data doesn't match schema → Upload rejected with detailed errors
python
from miura import Nexus
from miura.sdk import LocalDataSource
from miura.logging import get_logger

logger = get_logger(__name__)

with Nexus() as nexus:
    # Create collection with schema (quality gate)
    project = nexus.create_project("quality-demo")
    
    # Schema defines: "We expect report files and a parameters file"
    schema = [
        {
            "pattern": "report_\\d+\\.pdf",
            "min_occurrence": 1,  # At least one report required
            "max_occurrence": None
        },
        {
            "name": "parameters.dat",
            "min_occurrence": 1,  # Required
            "max_occurrence": 1
        }
    ]
    
    collection = project.create_collection(
        name="reports",
        schema=schema,
        metadata={"description": "Collection with quality gate"}
    )
    
    # Upload data - quality gate validates
    datasource = LocalDataSource("data/reports")
    upload_result = collection.upload(datasource=datasource)
    
    # Check quality gate results
    if upload_result.files_failed == 0:
        logger.info("Quality gate passed - all files validated successfully!")
        logger.info(f"Uploaded {upload_result.files_uploaded} files")
    else:
        logger.warning("Quality gate failed - some files didn't pass validation")
        for error in upload_result.errors:
            logger.error(f"  Validation error: {error}")
        logger.info("Fix: Ensure your data matches the schema requirements")
    # Nexus automatically closes when exiting the with block

Schema Structure

Basic Schema Node

A schema is a list of schema nodes. Each node can be:

Exact Name Match:

python
{
    "name": "parameters.dat",  # Exact file name
    "min_occurrence": 1,
    "max_occurrence": 1
}

Pattern Match:

python
{
    "pattern": "report_\\d+\\.pdf",  # Regex pattern
    "min_occurrence": 1,
    "max_occurrence": None
}

Folder with Children:

python
{
    "pattern": "run_\\d+/",  # Folder pattern (trailing slash)
    "min_occurrence": 1,
    "max_occurrence": None,
    "children": [  # Nested items
        {
            "name": "output.vtk",
            "min_occurrence": 1
        }
    ]
}

Occurrence Constraints

  • min_occurrence: Minimum number of times this item must appear
    • 1 = Required
    • 0 = Optional
  • max_occurrence: Maximum number of times this item can appear
    • 1 = Exactly one
    • None = Unlimited

Permissive vs strict schemas

You can choose how much the schema enforces structure:

Permissive schema — Accept any files and folders; no required structure. Use a single node with a regex that matches everything and no required items:

python
# Allow any paths; nothing is required (min_occurrence: 0)
permissive_schema = [{"pattern": ".*", "min_occurrence": 0}]
collection = await project.create_collection(
    collection_name="my-collection",
    schema=permissive_schema,
)

Pros:

  • Quick to get started; no validation failures due to structure
  • Flexible: any layout and file types are accepted
  • Useful for ad-hoc uploads, sandboxes, or when structure is not yet defined

Cons:

  • No quality gate: inconsistent or incomplete data can be uploaded
  • No early detection of missing or unexpected files
  • Downstream consumers cannot rely on a guaranteed layout

Strict schema — Require specific patterns and/or names with min_occurrence >= 1 and optional max_occurrence and children. Use when you need a reliable, documented layout (e.g. pipelines, shared datasets).

When to use which:

  • Use permissive for experimentation, one-off uploads, or when the collection is a staging area and validation happens elsewhere.
  • Use strict for production collections, pipelines, and whenever others depend on a consistent structure.

Step 1: Prevalidation

Before upload, the system scans your data and validates it against the schema:

Prevalidation checks:

1. Are all required files present? (min_occurrence)

2. Are there too many files? (max_occurrence)

3. Are there unexpected files? (not in schema)

4. Do file names match patterns? (pattern matching)

Step 2: Validation Results

The upload result includes validation information:

python
upload_result = {
    "status": "completed",  # or "failed" if validation fails
    "files_uploaded": 10,
    "files_failed": 2,  # Files that failed validation
    "errors": [  # Detailed validation errors
        "Missing required item: /parameters.dat",
        "Unexpected item: /unexpected_file.txt"
    ]
}

Step 3: Quality Gate Decision

  • All files pass: Upload proceeds, files are uploaded
  • Some files fail: Upload may proceed with valid files, or fail entirely (depending on configuration)
  • Critical failures: Upload is rejected (e.g., missing required files)

1. Data Integrity

Problem: Without schemas, inconsistent data can enter collections

Good: Quality gate ensures consistency

schema = {"pattern": ".*\.vtk$", "min_occurrence": 1} collection = project.create_collection(name="data", schema=schema)

Only .vtk files are accepted, structure is consistent

2. Early Problem Detection

Problem: Without schemas, problems are discovered after upload

Good: Problems caught before upload

upload_result = collection.upload(datasource) if upload_result.files_failed > 0: # Errors caught during prevalidation, before upload logger.error("Validation failed - fix data before upload")

3. Automated Validation

Problem: Without schemas, manual validation required

Good: Automated validation

schema = {"pattern": ".*\.vtk$", "min_occurrence": 1} collection = project.create_collection(name="data", schema=schema)

Validation happens automatically during upload

4. Clear Documentation

Problem: Without schemas, expected structure is unclear

Good: Schema documents expected structure

schema = {"pattern": "report_\d+\.pdf", "min_occurrence": 1}, {"name": "parameters.dat", "min_occurrence": 1}

Schema clearly documents: "We expect report PDFs and a parameters file"


Example 1: Quality Gate Enforcing Structure

python
import asyncio
from miura.api import AsyncNexus
from miura.sdk import LocalDataSource
from miura.logging import get_logger

logger = get_logger(__name__)

async def quality_gate_example():
    """Demonstrate schema as quality gate."""
    async with AsyncNexus() as nexus:
        # Create collection with strict schema (quality gate)
        project = await nexus.create_project("quality-demo")
        
        # Schema: "We expect simulation folders with specific structure"
        schema = [
            {
                "pattern": "simulation_\\d{3}/",  # simulation_001/, simulation_002/, etc.
                "min_occurrence": 1,  # At least one simulation required
                "children": [
                    {
                        "name": "parameters.dat",  # Required in each simulation
                        "min_occurrence": 1,
                        "max_occurrence": 1
                    },
                    {
                        "pattern": "output_\\d+\\.vtk",  # One or more output files
                        "min_occurrence": 1
                    }
                ]
            }
        ]
        
        collection = await project.create_collection(
            name="simulations",
            schema=schema,
            metadata={
                "description": "Simulation data with quality gate",
                "quality_gate": "enabled"
            }
        )
        
        # Upload data - quality gate validates
        datasource = LocalDataSource("data/simulations")
        upload_result = await collection.upload(datasource=datasource)
        
        # Check quality gate results
        logger.info("=== Quality Gate Results ===")
        if upload_result.files_failed == 0:
            logger.info("PASS: All files passed validation")
            logger.info(f"Uploaded {upload_result.files_uploaded} files")
        else:
            logger.warning("FAIL: Some files failed validation")
            logger.warning(f"Failed: {upload_result.files_failed} files")
            for error in upload_result.errors:
                logger.error(f"  {error}")
            logger.info("Fix: Ensure data matches schema structure")

asyncio.run(quality_gate_example())

Best Practices

1. Always Define Schemas

Schemas are quality gates - always define them:

Avoid: Creating collections without schemas

This defeats the purpose of quality gates

2. Make Requirements Clear

Use min_occurrence to clearly indicate what's required:

Avoid: Unclear requirements

schema = {"name": "parameters.dat"}, # Is this required? Unclear

3. Handle Validation Errors

Always check for validation errors:

python
upload_result = await collection.upload(datasource=datasource)

if upload_result.files_failed > 0:
    logger.warning("Quality gate detected issues:")
    for error in upload_result.errors:
        logger.error(f"  {error}")
    # Fix data or adjust schema

Next Steps


© 2026