Schemas
This tutorial explains what schemas are, why they exist, and how they act as quality gates for data ingestion.
Introduction
The primary purpose of schemas in the Nexus platform is to act as quality gates for data ingestion. They ensure that only properly structured, consistent data enters your collections.
Why Quality Gates Matter
Without quality gates:
- Inconsistent data structures
- Missing required files
- Unexpected files mixed in
- Problems discovered too late
- Manual validation required
With quality gates (schemas):
- Consistent data structure
- Required files enforced
- Unexpected files rejected
- Problems caught early (during prevalidation)
- Automated validation
You can also use permissive schemas (e.g. [{"pattern": ".*", "min_occurrence": 0}]) when you want to accept any structure—see Permissive vs strict schemas below for when that is appropriate.
What is a Schema?
A schema is a list of schema nodes that define the expected structure of data in a collection. Each node uses either name (exact match) or pattern (regex match), plus optional min_occurrence, max_occurrence, and children for nested structure. It specifies:
- What files and folders are expected: Exact names or patterns
- Which items are required:
min_occurrenceconstraints - How many of each item:
max_occurrenceconstraints - Patterns for matching: Regex patterns for flexible matching
schema = [
{
"pattern": "report_\\d+\\.pdf", # Matches report_1.pdf, report_2.pdf, etc.
"min_occurrence": 1, # At least one report required
"max_occurrence": None # No maximum limit
},
{
"name": "parameters.dat", # Exact file name
"min_occurrence": 1, # Required
"max_occurrence": 1 # Exactly one
},
{
"pattern": "run_\\d+/", # Folder pattern (trailing slash for folders)
"min_occurrence": 1, # At least one run folder required
"children": [ # Nested structure
{
"pattern": "output_\\d+\\.vtk",
"min_occurrence": 1
}
]
}
]
Schemas as Quality Gates
The Quality Gate Process
When you upload data to a collection, the schema acts as a quality gate:
- Prevalidation Phase: Before any files are uploaded, the system validates your data against the schema
- Validation Checks:
- Are all required files present?
- Are there any unexpected files?
- Do file names match expected patterns?
- Are occurrence constraints satisfied?
- Gate Decision:
- Pass: Data matches schema → Upload proceeds
- Fail: Data doesn't match schema → Upload rejected with detailed errors
from miura import Nexus
from miura.sdk import LocalDataSource
from miura.logging import get_logger
logger = get_logger(__name__)
with Nexus() as nexus:
# Create collection with schema (quality gate)
project = nexus.create_project("quality-demo")
# Schema defines: "We expect report files and a parameters file"
schema = [
{
"pattern": "report_\\d+\\.pdf",
"min_occurrence": 1, # At least one report required
"max_occurrence": None
},
{
"name": "parameters.dat",
"min_occurrence": 1, # Required
"max_occurrence": 1
}
]
collection = project.create_collection(
name="reports",
schema=schema,
metadata={"description": "Collection with quality gate"}
)
# Upload data - quality gate validates
datasource = LocalDataSource("data/reports")
upload_result = collection.upload(datasource=datasource)
# Check quality gate results
if upload_result.files_failed == 0:
logger.info("Quality gate passed - all files validated successfully!")
logger.info(f"Uploaded {upload_result.files_uploaded} files")
else:
logger.warning("Quality gate failed - some files didn't pass validation")
for error in upload_result.errors:
logger.error(f" Validation error: {error}")
logger.info("Fix: Ensure your data matches the schema requirements")
# Nexus automatically closes when exiting the with block
Schema Structure
Basic Schema Node
A schema is a list of schema nodes. Each node can be:
Exact Name Match:
{
"name": "parameters.dat", # Exact file name
"min_occurrence": 1,
"max_occurrence": 1
}
Pattern Match:
{
"pattern": "report_\\d+\\.pdf", # Regex pattern
"min_occurrence": 1,
"max_occurrence": None
}
Folder with Children:
{
"pattern": "run_\\d+/", # Folder pattern (trailing slash)
"min_occurrence": 1,
"max_occurrence": None,
"children": [ # Nested items
{
"name": "output.vtk",
"min_occurrence": 1
}
]
}
Occurrence Constraints
min_occurrence: Minimum number of times this item must appear1= Required0= Optional
max_occurrence: Maximum number of times this item can appear1= Exactly oneNone= Unlimited
Permissive vs strict schemas
You can choose how much the schema enforces structure:
Permissive schema — Accept any files and folders; no required structure. Use a single node with a regex that matches everything and no required items:
# Allow any paths; nothing is required (min_occurrence: 0)
permissive_schema = [{"pattern": ".*", "min_occurrence": 0}]
collection = await project.create_collection(
collection_name="my-collection",
schema=permissive_schema,
)
Pros:
- Quick to get started; no validation failures due to structure
- Flexible: any layout and file types are accepted
- Useful for ad-hoc uploads, sandboxes, or when structure is not yet defined
Cons:
- No quality gate: inconsistent or incomplete data can be uploaded
- No early detection of missing or unexpected files
- Downstream consumers cannot rely on a guaranteed layout
Strict schema — Require specific patterns and/or names with min_occurrence >= 1 and optional max_occurrence and children. Use when you need a reliable, documented layout (e.g. pipelines, shared datasets).
When to use which:
- Use permissive for experimentation, one-off uploads, or when the collection is a staging area and validation happens elsewhere.
- Use strict for production collections, pipelines, and whenever others depend on a consistent structure.
Step 1: Prevalidation
Before upload, the system scans your data and validates it against the schema:
Prevalidation checks:
1. Are all required files present? (min_occurrence)
2. Are there too many files? (max_occurrence)
3. Are there unexpected files? (not in schema)
4. Do file names match patterns? (pattern matching)
Step 2: Validation Results
The upload result includes validation information:
upload_result = {
"status": "completed", # or "failed" if validation fails
"files_uploaded": 10,
"files_failed": 2, # Files that failed validation
"errors": [ # Detailed validation errors
"Missing required item: /parameters.dat",
"Unexpected item: /unexpected_file.txt"
]
}
Step 3: Quality Gate Decision
- All files pass: Upload proceeds, files are uploaded
- Some files fail: Upload may proceed with valid files, or fail entirely (depending on configuration)
- Critical failures: Upload is rejected (e.g., missing required files)
1. Data Integrity
Problem: Without schemas, inconsistent data can enter collections
Good: Quality gate ensures consistency
schema = {"pattern": ".*\.vtk$", "min_occurrence": 1} collection = project.create_collection(name="data", schema=schema)
Only .vtk files are accepted, structure is consistent
2. Early Problem Detection
Problem: Without schemas, problems are discovered after upload
Good: Problems caught before upload
upload_result = collection.upload(datasource) if upload_result.files_failed > 0: # Errors caught during prevalidation, before upload logger.error("Validation failed - fix data before upload")
3. Automated Validation
Problem: Without schemas, manual validation required
Good: Automated validation
schema = {"pattern": ".*\.vtk$", "min_occurrence": 1} collection = project.create_collection(name="data", schema=schema)
Validation happens automatically during upload
4. Clear Documentation
Problem: Without schemas, expected structure is unclear
Good: Schema documents expected structure
schema = {"pattern": "report_\d+\.pdf", "min_occurrence": 1}, {"name": "parameters.dat", "min_occurrence": 1}
Schema clearly documents: "We expect report PDFs and a parameters file"
Example 1: Quality Gate Enforcing Structure
import asyncio
from miura.api import AsyncNexus
from miura.sdk import LocalDataSource
from miura.logging import get_logger
logger = get_logger(__name__)
async def quality_gate_example():
"""Demonstrate schema as quality gate."""
async with AsyncNexus() as nexus:
# Create collection with strict schema (quality gate)
project = await nexus.create_project("quality-demo")
# Schema: "We expect simulation folders with specific structure"
schema = [
{
"pattern": "simulation_\\d{3}/", # simulation_001/, simulation_002/, etc.
"min_occurrence": 1, # At least one simulation required
"children": [
{
"name": "parameters.dat", # Required in each simulation
"min_occurrence": 1,
"max_occurrence": 1
},
{
"pattern": "output_\\d+\\.vtk", # One or more output files
"min_occurrence": 1
}
]
}
]
collection = await project.create_collection(
name="simulations",
schema=schema,
metadata={
"description": "Simulation data with quality gate",
"quality_gate": "enabled"
}
)
# Upload data - quality gate validates
datasource = LocalDataSource("data/simulations")
upload_result = await collection.upload(datasource=datasource)
# Check quality gate results
logger.info("=== Quality Gate Results ===")
if upload_result.files_failed == 0:
logger.info("PASS: All files passed validation")
logger.info(f"Uploaded {upload_result.files_uploaded} files")
else:
logger.warning("FAIL: Some files failed validation")
logger.warning(f"Failed: {upload_result.files_failed} files")
for error in upload_result.errors:
logger.error(f" {error}")
logger.info("Fix: Ensure data matches schema structure")
asyncio.run(quality_gate_example())
Best Practices
1. Always Define Schemas
Schemas are quality gates - always define them:
Avoid: Creating collections without schemas
This defeats the purpose of quality gates
2. Make Requirements Clear
Use min_occurrence to clearly indicate what's required:
Avoid: Unclear requirements
schema = {"name": "parameters.dat"}, # Is this required? Unclear
3. Handle Validation Errors
Always check for validation errors:
upload_result = await collection.upload(datasource=datasource)
if upload_result.files_failed > 0:
logger.warning("Quality gate detected issues:")
for error in upload_result.errors:
logger.error(f" {error}")
# Fix data or adjust schema
Next Steps
- Uploading Data - Upload data with quality gate validation
- Fetching Data - List, retrieve, and download items
- End-to-End Workflows - Complete workflow with quality gates
Related Documentation
- API Reference - Complete API documentation
- Quick Start Guide - Get started with the Nexus API