Uploading Data

Upload data to collections with automatic schema validation

This tutorial covers uploading data to collections, including schema validation, data sources, and upload workflows.


Introduction

Uploading data to a collection involves:

  1. Setting a data source: Tell the collection where to find your data
  2. Prevalidation: The collection validates your data against its schema
  3. Upload: Validated files are uploaded to the collection

The schema acts as a quality gate - it ensures only properly structured data enters your collection.


What is a Schema?

A schema defines the expected structure of data in a collection. It specifies:

  • What files and folders are expected
  • Which items are required vs optional
  • Patterns for matching files (e.g., report_\d+\.pdf)
  • Occurrence constraints (min/max occurrences)

Schemas as Quality Gates

The primary purpose of schemas is to act as quality gates for data ingestion. They ensure:

  1. Data Integrity: Only data matching the expected structure is accepted
  2. Consistency: All data in a collection follows the same organizational pattern
  3. Early Detection: Problems are caught before upload, not after
  4. Documentation: The schema documents what data structure is expected

from miura import Nexus from miura.logging import get_logger

logger = get_logger(name)

with Nexus() as nexus: # Create collection with schema (quality gate) project = nexus.create_project("my-project")

# Schema defines: "We expect files matching report_\d+\.pdf"
schema = [
    {
        "pattern": "report_\\d+\\.pdf",
        "min_occurrence": 1,  # At least one report file required
        "max_occurrence": None  # No maximum limit
    }
]
python
    collection = project.create_collection(
        name="reports",
        schema=schema
    )
# Upload data - schema validates before upload
from miura.api.datasources import LocalDataSource
datasource = LocalDataSource("data/reports")
python
    upload_result = collection.upload(datasource=datasource)
# If data doesn't match schema, validation fails
if upload_result.get("files_failed", 0) > 0:
    logger.warning("Some files failed validation against schema")
    for error in upload_result.get("errors", []):
        logger.error(f"Validation error: {error}")
# Nexus automatically closes when exiting the with block

Benefits of Schema-Based Quality Gates

  1. Prevents Bad Data: Invalid files are rejected before upload
  2. Enforces Standards: Ensures all data follows the same structure
  3. Clear Expectations: Schema documents what data is acceptable
  4. Automated Validation: No manual checking needed

Simple Upload Workflow

python
import asyncio
from pathlib import Path
from miura.aio import AsyncNexus
from miura.api.datasources import LocalDataSource
from miura.logging import get_logger

logger = get_logger(__name__)

async def basic_upload():
    async with AsyncNexus() as nexus:
    # Get project and collection
    project = await nexus.get_project("my-project")
    collection = await project.get_collection("my-collection")
    

    # Create data source
    data_path = Path("data/my-simulation")
    datasource = LocalDataSource(str(data_path))
    

    # Upload data
    logger.info("Uploading data...")
    from miura.api import UploadMode
    
python
        upload_result = await collection.upload(
            datasource=datasource,
            mode=UploadMode.APPEND
        )
        
        logger.info(f"Upload completed:")
        logger.info(f"  Files uploaded: {upload_result.get('files_uploaded', 0)}")
        logger.info(f"  Files failed: {upload_result.get('files_failed', 0)}")
        logger.info(f"  Total size: {upload_result.get('total_size', 0):,} bytes")

asyncio.run(basic_upload())

Upload with Versioning

Create a new version for each upload

from miura.api import UploadMode

upload_result = await collection.upload( datasource=datasource, mode=UploadMode.REPLACE # Replaces all existing items )

logger.info(f"Upload ID: {upload_result.get('upload_id')}")


Local Data Source

Upload from a local directory:

python
from miura.api.datasources import LocalDataSource

Create local data source

data_path = Path("data/my-simulation") datasource = LocalDataSource(str(data_path))

Check data source

logger.info(f"Data source: {datasource.path}") logger.info(f"Files: {datasource.get_file_count()}") logger.info(f"Total size: {datasource.get_size_bytes():,} bytes")

Data Source Information

python
datasource = LocalDataSource("data/my-simulation")

Get file count

file_count = datasource.get_file_count() logger.info(f"Files to upload: {file_count}")

Get file list

files = datasource.get_file_list() for file in files:10: # Show first 10 logger.info(f" - {file}")

Upload Workflow

Complete Upload Process

The upload process consists of several steps:

  1. Set Data Source: Tell the collection where to find data
  2. Prevalidation: Validate data against schema
  3. Upload: Upload validated files
python
async def complete_upload_process():
    async with AsyncNexus() as nexus:
        project = await nexus.get_project("my-project")
        collection = await project.get_collection("my-collection")
    # Step 1: Create data source
    logger.info("=== Step 1: Creating Data Source ===")
    data_path = Path("data/my-simulation")
    if not data_path.exists():
        logger.error(f"Data path not found: {data_path}")
        return
    
python
        datasource = LocalDataSource(str(data_path))
        logger.info(f"Data source: {datasource.path}")
        logger.info(f"Files: {datasource.get_file_count()}")
    # Step 2: Upload (includes prevalidation)
    logger.info("=== Step 2: Uploading Data ===")
    logger.info("Prevalidation will occur automatically...")
    
python
        upload_result = await collection.upload(
            datasource=datasource,
            mode=UploadMode.APPEND
        )
    # Step 3: Check results
    logger.info("=== Step 3: Upload Results ===")
    logger.info(f"Status: {upload_result.get('status', 'unknown')}")
    logger.info(f"Files uploaded: {upload_result.get('files_uploaded', 0)}")
    logger.info(f"Files failed: {upload_result.get('files_failed', 0)}")
    logger.info(f"Total size: {upload_result.get('total_size', 0):,} bytes")
    

    # Check for validation errors
    if upload_result.get("files_failed", 0) > 0:
        logger.warning("Some files failed validation:")
        for error in upload_result.get("errors", []):
            logger.error(f"  - {error}")

asyncio.run(complete_upload_process())


Handling Validation Errors

Understanding Validation Errors

When data doesn't match the schema, validation errors occur:

python
upload_result = await collection.upload(datasource=datasource)

if upload_result.get("files_failed", 0) > 0:
    logger.warning("Validation errors detected:")
# Check validation errors
errors = upload_result.get("errors", [])
for error in errors:
    logger.error(f"Error: {error}")


# Common error types:
# - "Unexpected item: /path/to/file"
# - "Missing required item: /path/to/file"
# - "Too many occurrences: /path/to/file"

Fixing Validation Errors

  1. Review the schema: Understand what's expected
  2. Check your data: Ensure it matches the schema
  3. Adjust schema or data: Either fix your data or update the schema

Example: Handling missing required files

upload_result = await collection.upload(datasource=datasource)

if upload_result.get("files_failed", 0) > 0: errors = upload_result.get("errors", )

# Check for missing required files
missing_files = [e for e in errors if "Missing required" in e]
if missing_files:
    logger.warning("Missing required files:")
    for error in missing_files:
        logger.warning(f"  - {error}")
    logger.info("Fix: Add the missing files or update the schema")


# Check for unexpected files
unexpected_files = [e for e in errors if "Unexpected" in e]
if unexpected_files:
    logger.warning("Unexpected files (not in schema):")
    for error in unexpected_files:
        logger.warning(f"  - {error}")
    logger.info("Fix: Remove unexpected files or update schema to include them")

Example 1: Upload with Schema Validation

python
import asyncio
from pathlib import Path
from miura.aio import AsyncNexus
from miura.api.datasources import LocalDataSource
from miura.logging import get_logger

logger = get_logger(__name__)

async def upload_with_validation():
    """Upload data with schema validation."""
    async with AsyncNexus() as nexus:
    # Create project
    project = await nexus.create_project("upload-demo")
    

    # Create collection with schema (quality gate)
    schema = [
        {
            "pattern": ".*\\.vtk$",
            "min_occurrence": 1,
            "max_occurrence": None
        },
        {
            "pattern": "parameters\\.dat",
            "min_occurrence": 1,
            "max_occurrence": 1
        }
    ]
    
python
        collection = await project.create_collection(
            name="simulation-data",
            schema=schema,
            metadata={"description": "Simulation data with quality gate"}
        )
    # Upload data (schema validates automatically)
    data_path = Path("data/simulation")
    if data_path.exists():
        datasource = LocalDataSource(str(data_path))
        upload_result = await collection.upload(
            datasource=datasource,
            mode=UploadMode.APPEND
        )
        

        # Check results
        if upload_result.get("files_failed", 0) == 0:
            logger.info("All files passed schema validation!")
        else:
            logger.warning("Some files failed validation")
            for error in upload_result.get("errors", []):
                logger.error(f"  {error}")

asyncio.run(upload_with_validation())

Example 2: Upload with Auto-Generated Schema

python
import asyncio
from pathlib import Path
from miura.aio import AsyncNexus
from miura.api.datasources import LocalDataSource
from miura.api import generate_schema_from_path, SchemaGenOptions
from miura.logging import get_logger

logger = get_logger(__name__)

async def upload_with_generated_schema(): """Generate schema from filesystem and upload.""" async with AsyncNexus() as nexus: # Step 1: Generate schema from existing data logger.info("=== Generating Schema ===") data_path = Path("data/my-simulation") options = SchemaGenOptions( min_files_for_pattern=2, default_required=False, schema_name="auto-generated-schema" ) schema = generate_schema_from_path(str(data_path), options=options) logger.info(f"Generated schema with {len(schema)} root-level nodes")

    # Step 2: Create collection with generated schema
    logger.info("=== Creating Collection ===")
    project = await nexus.create_project("upload-demo")
    collection = await project.create_collection(
        name="simulation-data",
        schema=schema,
        metadata={
            "description": "Collection with auto-generated schema",
            "schema_type": "auto-generated"
        }
    )
    

    # Step 3: Upload data (validated against generated schema)
    logger.info("=== Uploading Data ===")
    datasource = LocalDataSource(str(data_path))
    upload_result = await collection.upload(
        datasource=datasource,
        mode=UploadMode.APPEND
    )
    
    logger.info(f"Upload completed: {upload_result.get('files_uploaded', 0)} files")

asyncio.run(upload_with_generated_schema())


Best Practices

1. Always Define Schemas

Schemas are quality gates - always define them:

Avoid: Creating collections without schemas

(This defeats the purpose of quality gates)

2. Test Schema Before Upload

Generate schema and review it before creating the collection:

python
import json

Generate and review schema

schema = generate_schema_from_path("data/my-simulation") logger.info("Generated schema:") logger.info(json.dumps(schema, indent=2))

Review and adjust if needed, then create collection

3. Handle Validation Errors Gracefully

Always check for validation errors:

python
upload_result = await collection.upload(datasource=datasource)

if upload_result.get("files_failed", 0) > 0:
    logger.warning("Validation errors detected")
# Handle errors appropriately
for error in upload_result.get("errors", []):
    logger.error(f"  {error}")

Next Steps


© 2025