Schemas and Schema Generation
This tutorial explains what schemas are, why they exist, and how they act as quality gates for data ingestion.
Introduction
The primary purpose of schemas in the Nexus platform is to act as quality gates for data ingestion. They ensure that only properly structured, consistent data enters your collections.
Why Quality Gates Matter
Without quality gates:
- Inconsistent data structures
- Missing required files
- Unexpected files mixed in
- Problems discovered too late
- Manual validation required
With quality gates (schemas):
- Consistent data structure
- Required files enforced
- Unexpected files rejected
- Problems caught early (during prevalidation)
- Automated validation
What is a Schema?
A schema is a definition of the expected structure of data in a collection. It specifies:
- What files and folders are expected: Exact names or patterns
- Which items are required:
min_occurrenceconstraints - How many of each item:
max_occurrenceconstraints - Patterns for matching: Regex patterns for flexible matching
schema = { "pattern": "report_\d+\.pdf", # Matches report_1.pdf, report_2.pdf, etc. "min_occurrence": 1, # At least one report required "max_occurrence": None # No maximum limit }, { "name": "parameters.dat", # Exact file name "min_occurrence": 1, # Required "max_occurrence": 1 # Exactly one }, { "pattern": "run_\d+/", # Folder pattern "min_occurrence": 1, # At least one run folder required "children": [ # Nested structure { "pattern": "output_\d+\.vtk", "min_occurrence": 1 } ] }
Schemas as Quality Gates
The Quality Gate Process
When you upload data to a collection, the schema acts as a quality gate:
- Prevalidation Phase: Before any files are uploaded, the system validates your data against the schema
- Validation Checks:
- Are all required files present?
- Are there any unexpected files?
- Do file names match expected patterns?
- Are occurrence constraints satisfied?
- Gate Decision:
- Pass: Data matches schema → Upload proceeds
- Fail: Data doesn't match schema → Upload rejected with detailed errors
from miura import Nexus from miura.api.datasources import LocalDataSource from miura.logging import get_logger
logger = get_logger(name)
with Nexus() as nexus: # Create collection with schema (quality gate) project = nexus.create_project("quality-demo")
# Schema defines: "We expect report files and a parameters file"
schema = [
{
"pattern": "report_\\d+\\.pdf",
"min_occurrence": 1, # At least one report required
"max_occurrence": None
},
{
"name": "parameters.dat",
"min_occurrence": 1, # Required
"max_occurrence": 1
}
]
collection = project.create_collection(
name="reports",
schema=schema,
metadata={"description": "Collection with quality gate"}
)
# Upload data - quality gate validates
datasource = LocalDataSource("data/reports")
upload_result = collection.upload(datasource=datasource)
# Check quality gate results
if upload_result.get("files_failed", 0) == 0:
logger.info("Quality gate passed - all files validated successfully!")
logger.info(f"Uploaded {upload_result.get('files_uploaded', 0)} files")
else:
logger.warning("Quality gate failed - some files didn't pass validation")
for error in upload_result.get("errors", []):
logger.error(f" Validation error: {error}")
logger.info("Fix: Ensure your data matches the schema requirements")
# Nexus automatically closes when exiting the with block
Schema Structure
Basic Schema Node
A schema is a list of schema nodes. Each node can be:
Exact Name Match:
{
"name": "parameters.dat", # Exact file name
"min_occurrence": 1,
"max_occurrence": 1
}
Pattern Match:
{
"pattern": "report_\\d+\\.pdf", # Regex pattern
"min_occurrence": 1,
"max_occurrence": None
}
Folder with Children:
{
"pattern": "run_\\d+/", # Folder pattern (trailing slash)
"min_occurrence": 1,
"max_occurrence": None,
"children": [ # Nested items
{
"name": "output.vtk",
"min_occurrence": 1
}
]
}
Occurrence Constraints
min_occurrence: Minimum number of times this item must appear1= Required0= Optional
max_occurrence: Maximum number of times this item can appear1= Exactly oneNone= Unlimited
Step 1: Prevalidation
Before upload, the system scans your data and validates it against the schema:
Prevalidation checks:
1. Are all required files present? (min_occurrence)
2. Are there too many files? (max_occurrence)
3. Are there unexpected files? (not in schema)
4. Do file names match patterns? (pattern matching)
Step 2: Validation Results
The upload result includes validation information:
upload_result = {
"status": "completed", # or "failed" if validation fails
"files_uploaded": 10,
"files_failed": 2, # Files that failed validation
"errors": [ # Detailed validation errors
"Missing required item: /parameters.dat",
"Unexpected item: /unexpected_file.txt"
]
}
Step 3: Quality Gate Decision
- All files pass: Upload proceeds, files are uploaded
- Some files fail: Upload may proceed with valid files, or fail entirely (depending on configuration)
- Critical failures: Upload is rejected (e.g., missing required files)
1. Data Integrity
Problem: Without schemas, inconsistent data can enter collections
Good: Quality gate ensures consistency
schema = {"pattern": ".*\.vtk$", "min_occurrence": 1} collection = project.create_collection(name="data", schema=schema)
Only .vtk files are accepted, structure is consistent
2. Early Problem Detection
Problem: Without schemas, problems are discovered after upload
Good: Problems caught before upload
upload_result = collection.upload(datasource) if upload_result.get("files_failed", 0) > 0: # Errors caught during prevalidation, before upload logger.error("Validation failed - fix data before upload")
3. Automated Validation
Problem: Without schemas, manual validation required
Good: Automated validation
schema = {"pattern": ".*\.vtk$", "min_occurrence": 1} collection = project.create_collection(name="data", schema=schema)
Validation happens automatically during upload
4. Clear Documentation
Problem: Without schemas, expected structure is unclear
Good: Schema documents expected structure
schema = {"pattern": "report_\d+\.pdf", "min_occurrence": 1}, {"name": "parameters.dat", "min_occurrence": 1}
Schema clearly documents: "We expect report PDFs and a parameters file"
Example 1: Quality Gate Enforcing Structure
import asyncio
from miura.aio import AsyncNexus
from miura.api.datasources import LocalDataSource
from miura.logging import get_logger
logger = get_logger(__name__)
async def quality_gate_example():
"""Demonstrate schema as quality gate."""
async with AsyncNexus() as nexus:
# Create collection with strict schema (quality gate)
project = await nexus.create_project("quality-demo")
# Schema: "We expect simulation folders with specific structure"
schema = [
{
"pattern": "simulation_\\d{3}/", # simulation_001/, simulation_002/, etc.
"min_occurrence": 1, # At least one simulation required
"children": [
{
"name": "parameters.dat", # Required in each simulation
"min_occurrence": 1,
"max_occurrence": 1
},
{
"pattern": "output_\\d+\\.vtk", # One or more output files
"min_occurrence": 1
}
]
}
]
collection = await project.create_collection(
name="simulations",
schema=schema,
metadata={
"description": "Simulation data with quality gate",
"quality_gate": "enabled"
}
)
# Upload data - quality gate validates
datasource = LocalDataSource("data/simulations")
upload_result = await collection.upload(datasource=datasource)
# Check quality gate results
logger.info("=== Quality Gate Results ===")
if upload_result.get("files_failed", 0) == 0:
logger.info("PASS: All files passed validation")
logger.info(f"Uploaded {upload_result.get('files_uploaded', 0)} files")
else:
logger.warning("FAIL: Some files failed validation")
logger.warning(f"Failed: {upload_result.get('files_failed', 0)} files")
for error in upload_result.get("errors", []):
logger.error(f" {error}")
logger.info("Fix: Ensure data matches schema structure")
asyncio.run(quality_gate_example())
Example 2: Quality Gate with Auto-Generated Schema
import asyncio
from miura.aio import AsyncNexus
from miura.api.datasources import LocalDataSource
from miura.api import generate_schema_from_path, SchemaGenOptions
from miura.logging import get_logger
logger = get_logger(__name__)
async def auto_quality_gate_example():
"""Generate schema from existing data and use as quality gate."""
async with AsyncNexus() as nexus:
# Step 1: Generate schema from existing data structure
logger.info("=== Step 1: Generating Quality Gate ===")
data_path = "data/my-simulation"
options = SchemaGenOptions(
min_files_for_pattern=2,
default_required=True, # Make items required by default
schema_name="auto-quality-gate"
)
schema = generate_schema_from_path(data_path, options=options)
logger.info(f"Generated quality gate with {len(schema)} rules")
# Step 2: Create collection with generated quality gate
logger.info("=== Step 2: Creating Collection with Quality Gate ===")
project = await nexus.create_project("auto-quality-demo")
collection = await project.create_collection(
name="simulation-data",
schema=schema,
metadata={
"description": "Collection with auto-generated quality gate",
"quality_gate_type": "auto-generated"
}
)
# Step 3: Upload data - quality gate validates
logger.info("=== Step 3: Uploading with Quality Gate Validation ===")
datasource = LocalDataSource(data_path)
upload_result = await collection.upload(datasource=datasource)
# Step 4: Check results
logger.info("=== Step 4: Quality Gate Results ===")
if upload_result.get("files_failed", 0) == 0:
logger.info("Quality gate passed")
else:
logger.warning("Quality gate failed")
for error in upload_result.get("errors", []):
logger.error(f" {error}")
asyncio.run(auto_quality_gate_example())
Best Practices
1. Always Define Schemas
Schemas are quality gates - always define them:
Avoid: Creating collections without schemas
This defeats the purpose of quality gates
2. Make Requirements Clear
Use min_occurrence to clearly indicate what's required:
Avoid: Unclear requirements
schema = {"name": "parameters.dat"}, # Is this required? Unclear
3. Review Generated Schemas
When auto-generating schemas, review them:
import json
schema = generate_schema_from_path("data/my-simulation")
logger.info("Generated quality gate:")
logger.info(json.dumps(schema, indent=2))
Review and adjust if needed before creating collection
4. Handle Validation Errors
Always check for validation errors:
upload_result = await collection.upload(datasource=datasource)
if upload_result.get("files_failed", 0) > 0:
logger.warning("Quality gate detected issues:")
for error in upload_result.get("errors", []):
logger.error(f" {error}")
# Fix data or adjust schema
Schema Generation
Manually writing collection schemas can be tedious and error-prone, especially when you already have data organized in a filesystem. The schema generation feature automatically infers schemas from your existing directory structure, detecting patterns in filenames and folder names.
Basic Schema Generation
Simple Example
Generate a schema from a directory:
from pathlib import Path
from miura.api import generate_schema_from_path
from miura.logging import get_logger
logger = get_logger(__name__)
Generate schema from a directory
schema = generate_schema_from_path("data/my-simulation") logger.info(f"Generated schema with {len(schema)} root-level nodes")
The schema is a list of schema node dictionaries
for node in schema: logger.info(f"Node: {node.get('name')} or {node.get('pattern')}")
Using the Generated Schema
Use the generated schema to create a collection:
from miura import Nexus
from miura.api import generate_schema_from_path
from miura.logging import get_logger
logger = get_logger(__name__)
with Nexus() as nexus:
# Generate schema
schema = generate_schema_from_path("data/my-simulation")
# Create project and collection
project = nexus.create_project("my-project")
collection = project.create_collection(
name="my-collection",
schema=schema,
metadata={"description": "Auto-generated schema"}
)
logger.info(f"Created collection with generated schema: {collection.name}")
# Nexus automatically closes when exiting the with block
Customizing Pattern Detection
Schema Generation Options
Customize how patterns are detected:
from miura.api import generate_schema_from_path, SchemaGenOptions
# Configure schema generation
options = SchemaGenOptions(
min_files_for_pattern=3, # Minimum files needed to detect a pattern
default_required=False, # Whether items are required by default
schema_name="my-schema", # Name for the generated schema
similarity_threshold=0.7, # Similarity threshold for pattern detection
confidence_threshold=0.75 # Confidence threshold for accepting patterns
)
schema = generate_schema_from_path("data/my-simulation", options=options)
Option Reference
| Option | Type | Default | Description |
|---|---|---|---|
min_files_for_pattern | int | 2 | Minimum number of similar files needed to detect a pattern |
default_required | bool | False | Whether items are required by default |
schema_name | str | None | Name for the generated schema (auto-generated if None) |
similarity_threshold | float | 0.7 | Similarity threshold for pattern detection (0.0-1.0) |
confidence_threshold | float | 0.75 | Confidence threshold for accepting patterns (0.0-1.0) |
Adjusting Thresholds
Lower thresholds detect more patterns but may be less accurate:
More conservative pattern detection
options = SchemaGenOptions( min_files_for_pattern=5, similarity_threshold=0.8, # Higher = fewer but more reliable patterns confidence_threshold=0.85 # Higher = only high-confidence patterns )
schema = generate_schema_from_path("data/my-simulation", options=options)
Complete Workflow
Generate a schema and use it to create and populate a collection:
import asyncio
from pathlib import Path
from miura.aio import AsyncNexus
from miura.api.datasources import LocalDataSource
from miura.api import generate_schema_from_path, SchemaGenOptions
from miura.logging import get_logger
logger = get_logger(__name__)
async def main():
async with AsyncNexus() as nexus:
# Step 1: Generate schema from filesystem
logger.info("=== Generating Schema ===")
options = SchemaGenOptions(
min_files_for_pattern=2,
default_required=False,
schema_name="simulation-schema"
)
schema = generate_schema_from_path("data/my-simulation", options=options)
logger.info(f"Generated schema with {len(schema)} root-level nodes")
# Step 2: Create project
logger.info("=== Creating Project ===")
project = await nexus.create_project("simulation-project")
logger.info(f"Created project: {project.name}")
# Step 3: Create collection with generated schema
logger.info("=== Creating Collection ===")
collection = await project.create_collection(
collection_name="simulation-collection",
schema=schema,
metadata={
"description": "Collection with auto-generated schema",
"source_path": "data/my-simulation"
}
)
logger.info(f"Created collection: {collection.name}")
# Step 4: Upload data
logger.info("=== Uploading Data ===")
data_path = Path("data/my-simulation")
if data_path.exists():
datasource = LocalDataSource(str(data_path))
upload_result = await collection.upload(
datasource=datasource,
create_new_version=False
)
logger.info(f"Upload completed: {upload_result.get('files_uploaded', 0)} files")
else:
logger.warning(f"Data path not found: {data_path}")
asyncio.run(main())
Real-World Example
Simulation Data Structure
Consider a filesystem structure like this:
data/
├── simulation_001/
│ ├── parameters.dat
│ ├── results.vtk
│ └── metadata.json
├── simulation_002/
│ ├── parameters.dat
│ ├── results.vtk
│ └── metadata.json
└── simulation_003/
├── parameters.dat
├── results.vtk
└── metadata.json
Generating the Schema
from miura.api import generate_schema_from_path, SchemaGenOptions
from miura.logging import get_logger
import json
logger = get_logger(__name__)
# Generate schema
options = SchemaGenOptions(
min_files_for_pattern=2,
default_required=True, # Make items required
schema_name="simulation-schema"
)
schema = generate_schema_from_path("data", options=options)
Using the Schema
from miura import Nexus from miura.api import generate_schema_from_path from miura.api.datasources import LocalDataSource from miura.logging import get_logger
logger = get_logger(name)
with Nexus() as nexus: # Generate schema schema = generate_schema_from_path("data")
# Create collection
project = nexus.create_project("simulation-project")
collection = project.create_collection(
name="simulation-collection",
schema=schema
)
# Upload data (will be validated against the generated schema)
datasource = LocalDataSource("data")
upload_result = collection.upload(datasource=datasource)
logger.info(f"Uploaded {upload_result.get('files_uploaded', 0)} files")
# Nexus automatically closes when exiting the with block
Best Practices
1. Review Generated Schemas
Always review the generated schema before using it:
import json
schema = generate_schema_from_path("data/my-simulation")
logger.info("Generated schema:")
logger.info(json.dumps(schema, indent=2))
# Review and adjust if needed before creating collection
2. Adjust Thresholds for Your Data
Different data structures may need different thresholds:
For data with many similar files (e.g., timesteps)
options = SchemaGenOptions( min_files_for_pattern=3, similarity_threshold=0.6, # Lower for many similar files confidence_threshold=0.7 )
3. Use Descriptive Schema Names
Provide meaningful schema names:
options = SchemaGenOptions(
schema_name="cfd-simulation-schema-2024",
default_required=True
)
4. Validate Before Upload
Generate the schema and validate it matches your expectations:
schema = generate_schema_from_path("data/my-simulation")
# Check that patterns were detected
has_patterns = any("pattern" in node for node in schema)
if has_patterns:
logger.info("Patterns detected in generated schema")
else:
logger.warning("No patterns detected - all items will be exact matches")
# Review the schema structure
for node in schema:
if "pattern" in node:
logger.info(f"Pattern: {node['pattern']}")
else:
logger.info(f"Exact: {node.get('name', 'unknown')}")
Schema Structure
Generated schemas are lists of schema node dictionaries:
[
{
"name": "exact-file.txt", # Exact name (no pattern)
"min_occurrence": 1,
"max_occurrence": 1
},
{
"pattern": "report_\\d+\\.pdf", # Regex pattern
"min_occurrence": 0,
"max_occurrence": None
},
{
"pattern": "simulation_\\d+/", # Folder pattern
"min_occurrence": 1,
"max_occurrence": None,
"children": [ # Nested children
{
"name": "parameters.dat",
"min_occurrence": 1
}
]
}
]
Pattern Detection Examples
| Filesystem | Generated Pattern |
|---|---|
file1.txt, file2.txt, file3.txt | file\d+\.txt |
report_001.pdf, report_002.pdf | report_\d{3}\.pdf |
data_2024-01-01.csv, data_2024-01-02.csv | data_\d{4}-\d{2}-\d{2}\.csv |
model.h5, model_bis.h5 | `model(?: |
Next Steps
- Uploading Data - Upload data with quality gate validation
- Fetching Data - List, retrieve, and download items
- End-to-End Workflows - Complete workflow with quality gates
Related Documentation
- API Reference - Complete API documentation
- Quick Start Guide - Get started with the Nexus API