Uploading Data

Upload data to collections with automatic schema validation

This tutorial covers uploading data to collections, including schema validation, data sources, and upload workflows.


Introduction

Uploading data to a collection involves:

  1. Setting a data source: Tell the collection where to find your data
  2. Prevalidation: The collection validates your data against its schema
  3. Upload: Validated files are uploaded to the collection

The schema acts as a quality gate - it ensures only properly structured data enters your collection.


What is a Schema?

A schema defines the expected structure of data in a collection. It specifies:

  • What files and folders are expected
  • Which items are required vs optional
  • Patterns for matching files (e.g., report_\d+\.pdf)
  • Occurrence constraints (min/max occurrences)

Schemas as Quality Gates

The primary purpose of schemas is to act as quality gates for data ingestion. They ensure:

  1. Data Integrity: Only data matching the expected structure is accepted
  2. Consistency: All data in a collection follows the same organizational pattern
  3. Early Detection: Problems are caught before upload, not after
  4. Documentation: The schema documents what data structure is expected
python
from miura import Nexus
from miura.logging import get_logger

logger = get_logger(__name__)

with Nexus() as nexus:
    # Create collection with schema (quality gate)
    project = nexus.create_project("my-project")
    
    # Schema defines: "We expect files matching report_\\d+\\.pdf"
    schema = [
        {
            "pattern": "report_\\d+\\.pdf",
            "min_occurrence": 1,  # At least one report file required
            "max_occurrence": None  # No maximum limit
        }
    ]
    
    collection = project.create_collection(
        name="reports",
        schema=schema
    )
    
    # Upload data - schema validates before upload
    from miura.sdk import LocalDataSource
    datasource = LocalDataSource("data/reports")
    upload_result = collection.upload(datasource=datasource)
    
    # If data doesn't match schema, validation fails
    if upload_result.files_failed > 0:
        logger.warning("Some files failed validation against schema")
        for error in upload_result.errors:
            logger.error(f"Validation error: {error}")
    # Nexus automatically closes when exiting the with block

Benefits of Schema-Based Quality Gates

  1. Prevents Bad Data: Invalid files are rejected before upload
  2. Enforces Standards: Ensures all data follows the same structure
  3. Clear Expectations: Schema documents what data is acceptable
  4. Automated Validation: No manual checking needed

Simple Upload Workflow

python
import asyncio
from pathlib import Path
from miura.api import AsyncNexus
from miura.sdk import LocalDataSource
from miura.logging import get_logger

logger = get_logger(__name__)

async def basic_upload():
    async with AsyncNexus() as nexus:
        # Get project and collection
        project = await nexus.get_project("my-project")
        collection = await project.get_collection("my-collection")
        
        # Create data source
        data_path = Path("data/my-simulation")
        datasource = LocalDataSource(str(data_path))
        
        # Upload data
        logger.info("Uploading data...")
        from miura.sdk import UploadMode
        
        upload_result = await collection.upload(
            datasource=datasource,
            mode=UploadMode.APPEND
        )
        
        logger.info(f"Upload completed:")
        logger.info(f"  Files uploaded: {upload_result.files_uploaded}")
        logger.info(f"  Files failed: {upload_result.files_failed}")
        logger.info(f"  Total size: {upload_result.total_size:,} bytes")

asyncio.run(basic_upload())

Upload with Versioning

Create a new version for each upload

from miura.sdk import UploadMode

upload_result = await collection.upload( datasource=datasource, mode=UploadMode.REPLACE # Replaces all existing items )

logger.info(f"Upload ID: {upload_result.upload_id}")


Local Data Source

Upload from a local directory:

python
from miura.sdk import LocalDataSource

# Create local data source
data_path = Path("data/my-simulation")
datasource = LocalDataSource(str(data_path))

# Check data source
logger.info(f"Data source: {datasource.path}")
logger.info(f"Files: {datasource.get_file_count()}")
logger.info(f"Total size: {datasource.get_size_bytes():,} bytes")

Data Source Information

python
datasource = LocalDataSource("data/my-simulation")

# Get file count
file_count = datasource.get_file_count()
logger.info(f"Files to upload: {file_count}")

# Get file list
files = datasource.get_file_list()
for file in files[:10]:  # Show first 10
    logger.info(f"  - {file}")

Upload Workflow

Complete Upload Process

The upload process consists of several steps:

  1. Set Data Source: Tell the collection where to find data
  2. Prevalidation: Validate data against schema
  3. Upload: Upload validated files
python
async def complete_upload_process():
    async with AsyncNexus() as nexus:
        project = await nexus.get_project("my-project")
        collection = await project.get_collection("my-collection")
        
        # Step 1: Create data source
        logger.info("=== Step 1: Creating Data Source ===")
        data_path = Path("data/my-simulation")
        if not data_path.exists():
            logger.error(f"Data path not found: {data_path}")
            return
        
        datasource = LocalDataSource(str(data_path))
        logger.info(f"Data source: {datasource.path}")
        logger.info(f"Files: {datasource.get_file_count()}")
        
        # Step 2: Upload (includes prevalidation)
        logger.info("=== Step 2: Uploading Data ===")
        logger.info("Prevalidation will occur automatically...")
        
        upload_result = await collection.upload(
            datasource=datasource,
            mode=UploadMode.APPEND
        )
        
        # Step 3: Check results
        logger.info("=== Step 3: Upload Results ===")
        logger.info(f"Status: {upload_result.status}")
        logger.info(f"Files uploaded: {upload_result.files_uploaded}")
        logger.info(f"Files failed: {upload_result.files_failed}")
        logger.info(f"Total size: {upload_result.total_size:,} bytes")
        
        # Check for validation errors
        if upload_result.files_failed > 0:
            logger.warning("Some files failed validation:")
            for error in upload_result.errors:
                logger.error(f"  - {error}")

asyncio.run(complete_upload_process())

Handling Validation Errors

Understanding Validation Errors

When data doesn't match the schema, validation errors occur:

python
upload_result = await collection.upload(datasource=datasource)

if upload_result.files_failed > 0:
    logger.warning("Validation errors detected:")
    # Check validation errors
    errors = upload_result.errors
    for error in errors:
        logger.error(f"Error: {error}")
    
    # Common error types:
    # - "Unexpected item: /path/to/file"
    # - "Missing required item: /path/to/file"
    # - "Too many occurrences: /path/to/file"

Fixing Validation Errors

  1. Review the schema: Understand what's expected
  2. Check your data: Ensure it matches the schema
  3. Adjust schema or data: Either fix your data or update the schema

Example: Handling missing required files

upload_result = await collection.upload(datasource=datasource)

if upload_result.files_failed > 0: errors = upload_result.errors

# Check for missing required files
missing_files = [e for e in errors if "Missing required" in e]
if missing_files:
    logger.warning("Missing required files:")
    for error in missing_files:
        logger.warning(f"  - {error}")
    logger.info("Fix: Add the missing files or update the schema")


# Check for unexpected files
unexpected_files = [e for e in errors if "Unexpected" in e]
if unexpected_files:
    logger.warning("Unexpected files (not in schema):")
    for error in unexpected_files:
        logger.warning(f"  - {error}")
    logger.info("Fix: Remove unexpected files or update schema to include them")

Example 1: Upload with Schema Validation

python
import asyncio
from pathlib import Path
from miura.api import AsyncNexus
from miura.sdk import LocalDataSource
from miura.logging import get_logger

logger = get_logger(__name__)

async def upload_with_validation():
    """Upload data with schema validation."""
    async with AsyncNexus() as nexus:
        # Create project
        project = await nexus.create_project("upload-demo")
        

        # Create collection with schema (quality gate)
        schema = [
            {
                "pattern": ".*\\.vtk$",
                "min_occurrence": 1,
                "max_occurrence": None
            },
            {
                "pattern": "parameters\\.dat",
                "min_occurrence": 1,
                "max_occurrence": 1
            }
        ]
        
        collection = await project.create_collection(
            name="simulation-data",
            schema=schema,
            metadata={"description": "Simulation data with quality gate"}
        )
        
        # Upload data (schema validates automatically)
        data_path = Path("data/simulation")
        if data_path.exists():
            datasource = LocalDataSource(str(data_path))
            upload_result = await collection.upload(
                datasource=datasource,
                mode=UploadMode.APPEND
            )
            
            # Check results
            if upload_result.files_failed == 0:
                logger.info("All files passed schema validation!")
            else:
                logger.warning("Some files failed validation")
                for error in upload_result.errors:
                    logger.error(f"  {error}")

asyncio.run(upload_with_validation())

Best Practices

1. Always Define Schemas

Schemas are quality gates - always define them:

Avoid: Creating collections without schemas

(This defeats the purpose of quality gates)

2. Review Schema Before Upload

Review your schema (and validate it against sample data if needed) before creating the collection so the quality gate matches your expectations.

3. Handle Validation Errors Gracefully

Always check for validation errors:

python
upload_result = await collection.upload(datasource=datasource)

if upload_result.files_failed > 0:
    logger.warning("Validation errors detected")
    # Handle errors appropriately
    for error in upload_result.errors:
        logger.error(f"  {error}")

Next Steps


© 2026