Schemas and Schema Generation

Understanding schemas as quality gates and automatically generating them

This tutorial explains what schemas are, why they exist, and how they act as quality gates for data ingestion.


Introduction

The primary purpose of schemas in the Nexus platform is to act as quality gates for data ingestion. They ensure that only properly structured, consistent data enters your collections.

Why Quality Gates Matter

Without quality gates:

  • Inconsistent data structures
  • Missing required files
  • Unexpected files mixed in
  • Problems discovered too late
  • Manual validation required

With quality gates (schemas):

  • Consistent data structure
  • Required files enforced
  • Unexpected files rejected
  • Problems caught early (during prevalidation)
  • Automated validation

What is a Schema?

A schema is a definition of the expected structure of data in a collection. It specifies:

  • What files and folders are expected: Exact names or patterns
  • Which items are required: min_occurrence constraints
  • How many of each item: max_occurrence constraints
  • Patterns for matching: Regex patterns for flexible matching

schema = { "pattern": "report_\d+\.pdf", # Matches report_1.pdf, report_2.pdf, etc. "min_occurrence": 1, # At least one report required "max_occurrence": None # No maximum limit }, { "name": "parameters.dat", # Exact file name "min_occurrence": 1, # Required "max_occurrence": 1 # Exactly one }, { "pattern": "run_\d+/", # Folder pattern "min_occurrence": 1, # At least one run folder required "children": [ # Nested structure { "pattern": "output_\d+\.vtk", "min_occurrence": 1 } ] }


Schemas as Quality Gates

The Quality Gate Process

When you upload data to a collection, the schema acts as a quality gate:

  1. Prevalidation Phase: Before any files are uploaded, the system validates your data against the schema
  2. Validation Checks:
    • Are all required files present?
    • Are there any unexpected files?
    • Do file names match expected patterns?
    • Are occurrence constraints satisfied?
  3. Gate Decision:
    • Pass: Data matches schema → Upload proceeds
    • Fail: Data doesn't match schema → Upload rejected with detailed errors

from miura import Nexus from miura.api.datasources import LocalDataSource from miura.logging import get_logger

logger = get_logger(name)

with Nexus() as nexus: # Create collection with schema (quality gate) project = nexus.create_project("quality-demo")

# Schema defines: "We expect report files and a parameters file"
schema = [
    {
        "pattern": "report_\\d+\\.pdf",
        "min_occurrence": 1,  # At least one report required
        "max_occurrence": None
    },
    {
        "name": "parameters.dat",
        "min_occurrence": 1,  # Required
        "max_occurrence": 1
    }
]
python
    collection = project.create_collection(
        name="reports",
        schema=schema,
        metadata={"description": "Collection with quality gate"}
    )
# Upload data - quality gate validates
datasource = LocalDataSource("data/reports")
upload_result = collection.upload(datasource=datasource)


# Check quality gate results
if upload_result.get("files_failed", 0) == 0:
    logger.info("Quality gate passed - all files validated successfully!")
    logger.info(f"Uploaded {upload_result.get('files_uploaded', 0)} files")
else:
    logger.warning("Quality gate failed - some files didn't pass validation")
    for error in upload_result.get("errors", []):
        logger.error(f"  Validation error: {error}")
    logger.info("Fix: Ensure your data matches the schema requirements")
# Nexus automatically closes when exiting the with block

Schema Structure

Basic Schema Node

A schema is a list of schema nodes. Each node can be:

Exact Name Match:

python
{
    "name": "parameters.dat",  # Exact file name
    "min_occurrence": 1,
    "max_occurrence": 1
}

Pattern Match:

python
{
    "pattern": "report_\\d+\\.pdf",  # Regex pattern
    "min_occurrence": 1,
    "max_occurrence": None
}

Folder with Children:

python
{
    "pattern": "run_\\d+/",  # Folder pattern (trailing slash)
    "min_occurrence": 1,
    "max_occurrence": None,
    "children": [  # Nested items
        {
            "name": "output.vtk",
            "min_occurrence": 1
        }
    ]
}

Occurrence Constraints

  • min_occurrence: Minimum number of times this item must appear
    • 1 = Required
    • 0 = Optional
  • max_occurrence: Maximum number of times this item can appear
    • 1 = Exactly one
    • None = Unlimited

Step 1: Prevalidation

Before upload, the system scans your data and validates it against the schema:

Prevalidation checks:

1. Are all required files present? (min_occurrence)

2. Are there too many files? (max_occurrence)

3. Are there unexpected files? (not in schema)

4. Do file names match patterns? (pattern matching)

Step 2: Validation Results

The upload result includes validation information:

python
upload_result = {
    "status": "completed",  # or "failed" if validation fails
    "files_uploaded": 10,
    "files_failed": 2,  # Files that failed validation
    "errors": [  # Detailed validation errors
        "Missing required item: /parameters.dat",
        "Unexpected item: /unexpected_file.txt"
    ]
}

Step 3: Quality Gate Decision

  • All files pass: Upload proceeds, files are uploaded
  • Some files fail: Upload may proceed with valid files, or fail entirely (depending on configuration)
  • Critical failures: Upload is rejected (e.g., missing required files)

1. Data Integrity

Problem: Without schemas, inconsistent data can enter collections

Good: Quality gate ensures consistency

schema = {"pattern": ".*\.vtk$", "min_occurrence": 1} collection = project.create_collection(name="data", schema=schema)

Only .vtk files are accepted, structure is consistent

2. Early Problem Detection

Problem: Without schemas, problems are discovered after upload

Good: Problems caught before upload

upload_result = collection.upload(datasource) if upload_result.get("files_failed", 0) > 0: # Errors caught during prevalidation, before upload logger.error("Validation failed - fix data before upload")

3. Automated Validation

Problem: Without schemas, manual validation required

Good: Automated validation

schema = {"pattern": ".*\.vtk$", "min_occurrence": 1} collection = project.create_collection(name="data", schema=schema)

Validation happens automatically during upload

4. Clear Documentation

Problem: Without schemas, expected structure is unclear

Good: Schema documents expected structure

schema = {"pattern": "report_\d+\.pdf", "min_occurrence": 1}, {"name": "parameters.dat", "min_occurrence": 1}

Schema clearly documents: "We expect report PDFs and a parameters file"


Example 1: Quality Gate Enforcing Structure

python
import asyncio
from miura.aio import AsyncNexus
from miura.api.datasources import LocalDataSource
from miura.logging import get_logger

logger = get_logger(__name__)

async def quality_gate_example():
    """Demonstrate schema as quality gate."""
    async with AsyncNexus() as nexus:
    # Create collection with strict schema (quality gate)
    project = await nexus.create_project("quality-demo")
    

    # Schema: "We expect simulation folders with specific structure"
    schema = [
        {
            "pattern": "simulation_\\d{3}/",  # simulation_001/, simulation_002/, etc.
            "min_occurrence": 1,  # At least one simulation required
            "children": [
                {
                    "name": "parameters.dat",  # Required in each simulation
                    "min_occurrence": 1,
                    "max_occurrence": 1
                },
                {
                    "pattern": "output_\\d+\\.vtk",  # One or more output files
                    "min_occurrence": 1
                }
            ]
        }
    ]
    
python
        collection = await project.create_collection(
            name="simulations",
            schema=schema,
            metadata={
                "description": "Simulation data with quality gate",
                "quality_gate": "enabled"
            }
        )
    # Upload data - quality gate validates
    datasource = LocalDataSource("data/simulations")
    upload_result = await collection.upload(datasource=datasource)
    

    # Check quality gate results
    logger.info("=== Quality Gate Results ===")
    if upload_result.get("files_failed", 0) == 0:
        logger.info("PASS: All files passed validation")
        logger.info(f"Uploaded {upload_result.get('files_uploaded', 0)} files")
    else:
        logger.warning("FAIL: Some files failed validation")
        logger.warning(f"Failed: {upload_result.get('files_failed', 0)} files")
        for error in upload_result.get("errors", []):
            logger.error(f"  {error}")
        logger.info("Fix: Ensure data matches schema structure")

asyncio.run(quality_gate_example())

Example 2: Quality Gate with Auto-Generated Schema

python
import asyncio
from miura.aio import AsyncNexus
from miura.api.datasources import LocalDataSource
from miura.api import generate_schema_from_path, SchemaGenOptions
from miura.logging import get_logger

logger = get_logger(__name__)

async def auto_quality_gate_example():
    """Generate schema from existing data and use as quality gate."""
    async with AsyncNexus() as nexus:
    # Step 1: Generate schema from existing data structure
    logger.info("=== Step 1: Generating Quality Gate ===")
    data_path = "data/my-simulation"
    options = SchemaGenOptions(
        min_files_for_pattern=2,
        default_required=True,  # Make items required by default
        schema_name="auto-quality-gate"
    )
    schema = generate_schema_from_path(data_path, options=options)
    logger.info(f"Generated quality gate with {len(schema)} rules")
    

    # Step 2: Create collection with generated quality gate
    logger.info("=== Step 2: Creating Collection with Quality Gate ===")
    project = await nexus.create_project("auto-quality-demo")
    collection = await project.create_collection(
        name="simulation-data",
        schema=schema,
        metadata={
            "description": "Collection with auto-generated quality gate",
            "quality_gate_type": "auto-generated"
        }
    )
    

    # Step 3: Upload data - quality gate validates
    logger.info("=== Step 3: Uploading with Quality Gate Validation ===")
    datasource = LocalDataSource(data_path)
    upload_result = await collection.upload(datasource=datasource)
    

    # Step 4: Check results
    logger.info("=== Step 4: Quality Gate Results ===")
    if upload_result.get("files_failed", 0) == 0:
        logger.info("Quality gate passed")
    else:
        logger.warning("Quality gate failed")
        for error in upload_result.get("errors", []):
            logger.error(f"  {error}")

asyncio.run(auto_quality_gate_example())


Best Practices

1. Always Define Schemas

Schemas are quality gates - always define them:

Avoid: Creating collections without schemas

This defeats the purpose of quality gates

2. Make Requirements Clear

Use min_occurrence to clearly indicate what's required:

Avoid: Unclear requirements

schema = {"name": "parameters.dat"}, # Is this required? Unclear

3. Review Generated Schemas

When auto-generating schemas, review them:

python
import json

schema = generate_schema_from_path("data/my-simulation")
logger.info("Generated quality gate:")
logger.info(json.dumps(schema, indent=2))

Review and adjust if needed before creating collection

4. Handle Validation Errors

Always check for validation errors:

python
upload_result = await collection.upload(datasource=datasource)

if upload_result.get("files_failed", 0) > 0:
    logger.warning("Quality gate detected issues:")
    for error in upload_result.get("errors", []):
        logger.error(f"  {error}")
# Fix data or adjust schema

Schema Generation

Manually writing collection schemas can be tedious and error-prone, especially when you already have data organized in a filesystem. The schema generation feature automatically infers schemas from your existing directory structure, detecting patterns in filenames and folder names.

Basic Schema Generation

Simple Example

Generate a schema from a directory:

python
from pathlib import Path
from miura.api import generate_schema_from_path
from miura.logging import get_logger

logger = get_logger(__name__)

Generate schema from a directory

schema = generate_schema_from_path("data/my-simulation") logger.info(f"Generated schema with {len(schema)} root-level nodes")

The schema is a list of schema node dictionaries

for node in schema: logger.info(f"Node: {node.get('name')} or {node.get('pattern')}")

Using the Generated Schema

Use the generated schema to create a collection:

python
from miura import Nexus
from miura.api import generate_schema_from_path
from miura.logging import get_logger

logger = get_logger(__name__)

with Nexus() as nexus:
    # Generate schema
    schema = generate_schema_from_path("data/my-simulation")
    
    # Create project and collection
    project = nexus.create_project("my-project")
    collection = project.create_collection(
        name="my-collection",
        schema=schema,
        metadata={"description": "Auto-generated schema"}
    )
    logger.info(f"Created collection with generated schema: {collection.name}")
    # Nexus automatically closes when exiting the with block

Customizing Pattern Detection

Schema Generation Options

Customize how patterns are detected:

python
from miura.api import generate_schema_from_path, SchemaGenOptions


# Configure schema generation
options = SchemaGenOptions(
    min_files_for_pattern=3,  # Minimum files needed to detect a pattern
    default_required=False,    # Whether items are required by default
    schema_name="my-schema",   # Name for the generated schema
    similarity_threshold=0.7,  # Similarity threshold for pattern detection
    confidence_threshold=0.75  # Confidence threshold for accepting patterns
)

schema = generate_schema_from_path("data/my-simulation", options=options)

Option Reference

OptionTypeDefaultDescription
min_files_for_patternint2Minimum number of similar files needed to detect a pattern
default_requiredboolFalseWhether items are required by default
schema_namestrNoneName for the generated schema (auto-generated if None)
similarity_thresholdfloat0.7Similarity threshold for pattern detection (0.0-1.0)
confidence_thresholdfloat0.75Confidence threshold for accepting patterns (0.0-1.0)

Adjusting Thresholds

Lower thresholds detect more patterns but may be less accurate:

More conservative pattern detection

options = SchemaGenOptions( min_files_for_pattern=5, similarity_threshold=0.8, # Higher = fewer but more reliable patterns confidence_threshold=0.85 # Higher = only high-confidence patterns )

schema = generate_schema_from_path("data/my-simulation", options=options)


Complete Workflow

Generate a schema and use it to create and populate a collection:

python
import asyncio
from pathlib import Path
from miura.aio import AsyncNexus
from miura.api.datasources import LocalDataSource
from miura.api import generate_schema_from_path, SchemaGenOptions
from miura.logging import get_logger

logger = get_logger(__name__)

async def main():
    async with AsyncNexus() as nexus:
        # Step 1: Generate schema from filesystem
        logger.info("=== Generating Schema ===")
        options = SchemaGenOptions(
            min_files_for_pattern=2,
            default_required=False,
            schema_name="simulation-schema"
        )
        schema = generate_schema_from_path("data/my-simulation", options=options)
        logger.info(f"Generated schema with {len(schema)} root-level nodes")
        

        # Step 2: Create project
        logger.info("=== Creating Project ===")
        project = await nexus.create_project("simulation-project")
        logger.info(f"Created project: {project.name}")
        

        # Step 3: Create collection with generated schema
        logger.info("=== Creating Collection ===")
        collection = await project.create_collection(
            collection_name="simulation-collection",
            schema=schema,
            metadata={
                "description": "Collection with auto-generated schema",
                "source_path": "data/my-simulation"
            }
        )
        logger.info(f"Created collection: {collection.name}")

        # Step 4: Upload data
        logger.info("=== Uploading Data ===")
        data_path = Path("data/my-simulation")
        if data_path.exists():
            datasource = LocalDataSource(str(data_path))
            upload_result = await collection.upload(
                datasource=datasource,
                create_new_version=False
            )
            logger.info(f"Upload completed: {upload_result.get('files_uploaded', 0)} files")
        else:
            logger.warning(f"Data path not found: {data_path}")

asyncio.run(main())

Real-World Example

Simulation Data Structure

Consider a filesystem structure like this:

text
data/
├── simulation_001/
│   ├── parameters.dat
│   ├── results.vtk
│   └── metadata.json
├── simulation_002/
│   ├── parameters.dat
│   ├── results.vtk
│   └── metadata.json
└── simulation_003/
    ├── parameters.dat
    ├── results.vtk
    └── metadata.json

Generating the Schema

python
from miura.api import generate_schema_from_path, SchemaGenOptions
from miura.logging import get_logger
import json

logger = get_logger(__name__)

# Generate schema
options = SchemaGenOptions(
    min_files_for_pattern=2,
    default_required=True,  # Make items required
    schema_name="simulation-schema"
)
schema = generate_schema_from_path("data", options=options)

Using the Schema

from miura import Nexus from miura.api import generate_schema_from_path from miura.api.datasources import LocalDataSource from miura.logging import get_logger

logger = get_logger(name)

with Nexus() as nexus: # Generate schema schema = generate_schema_from_path("data")

# Create collection
project = nexus.create_project("simulation-project")
collection = project.create_collection(
    name="simulation-collection",
    schema=schema
)

# Upload data (will be validated against the generated schema)
datasource = LocalDataSource("data")
upload_result = collection.upload(datasource=datasource)
logger.info(f"Uploaded {upload_result.get('files_uploaded', 0)} files")
# Nexus automatically closes when exiting the with block

Best Practices

1. Review Generated Schemas

Always review the generated schema before using it:

python
import json

schema = generate_schema_from_path("data/my-simulation")
logger.info("Generated schema:")
logger.info(json.dumps(schema, indent=2))

# Review and adjust if needed before creating collection

2. Adjust Thresholds for Your Data

Different data structures may need different thresholds:

For data with many similar files (e.g., timesteps)

options = SchemaGenOptions( min_files_for_pattern=3, similarity_threshold=0.6, # Lower for many similar files confidence_threshold=0.7 )

3. Use Descriptive Schema Names

Provide meaningful schema names:

python
options = SchemaGenOptions(
    schema_name="cfd-simulation-schema-2024",
    default_required=True
)

4. Validate Before Upload

Generate the schema and validate it matches your expectations:

python
schema = generate_schema_from_path("data/my-simulation")

# Check that patterns were detected
has_patterns = any("pattern" in node for node in schema)
if has_patterns:
    logger.info("Patterns detected in generated schema")
else:
    logger.warning("No patterns detected - all items will be exact matches")

# Review the schema structure
for node in schema:
    if "pattern" in node:
        logger.info(f"Pattern: {node['pattern']}")
    else:
        logger.info(f"Exact: {node.get('name', 'unknown')}")

Schema Structure

Generated schemas are lists of schema node dictionaries:

python
[
    {
        "name": "exact-file.txt",  # Exact name (no pattern)
        "min_occurrence": 1,
        "max_occurrence": 1
    },
    {
        "pattern": "report_\\d+\\.pdf",  # Regex pattern
        "min_occurrence": 0,
        "max_occurrence": None
    },
    {
        "pattern": "simulation_\\d+/",  # Folder pattern
        "min_occurrence": 1,
        "max_occurrence": None,
        "children": [  # Nested children
            {
                "name": "parameters.dat",
                "min_occurrence": 1
            }
        ]
    }
]

Pattern Detection Examples

FilesystemGenerated Pattern
file1.txt, file2.txt, file3.txtfile\d+\.txt
report_001.pdf, report_002.pdfreport_\d{3}\.pdf
data_2024-01-01.csv, data_2024-01-02.csvdata_\d{4}-\d{2}-\d{2}\.csv
model.h5, model_bis.h5`model(?:

Next Steps


© 2025