Uploading Data

Upload data to collections with automatic schema validation

This tutorial covers uploading data to collections, including schema validation, data sources, and upload workflows.

Introduction

Uploading data to a collection involves:

Setting a data source: Tell the collection where to find your data
Prevalidation: The collection validates your data against its schema
Upload: Validated files are uploaded to the collection

The schema acts as a quality gate - it ensures only properly structured data enters your collection.

What is a Schema?

A schema defines the expected structure of data in a collection. It specifies:

What files and folders are expected
Which items are required vs optional
Patterns for matching files (e.g., report_\d+\.pdf)
Occurrence constraints (min/max occurrences)

Schemas as Quality Gates

The primary purpose of schemas is to act as quality gates for data ingestion. They ensure:

Data Integrity: Only data matching the expected structure is accepted
Consistency: All data in a collection follows the same organizational pattern
Early Detection: Problems are caught before upload, not after
Documentation: The schema documents what data structure is expected

from miura import Nexus from miura.logging import get_logger

logger = get_logger(name)

with Nexus() as nexus: # Create collection with schema (quality gate) project = nexus.create_project("my-project")

# Schema defines: "We expect files matching report_\d+\.pdf"
schema = [
    {
        "pattern": "report_\\d+\\.pdf",
        "min_occurrence": 1,  # At least one report file required
        "max_occurrence": None  # No maximum limit
    }
]

python

    collection = project.create_collection(
        name="reports",
        schema=schema
    )

# Upload data - schema validates before upload
from miura.api.datasources import LocalDataSource
datasource = LocalDataSource("data/reports")

python

    upload_result = collection.upload(datasource=datasource)

# If data doesn't match schema, validation fails
if upload_result.get("files_failed", 0) > 0:
    logger.warning("Some files failed validation against schema")
    for error in upload_result.get("errors", []):
        logger.error(f"Validation error: {error}")
# Nexus automatically closes when exiting the with block

Benefits of Schema-Based Quality Gates

Prevents Bad Data: Invalid files are rejected before upload
Enforces Standards: Ensures all data follows the same structure
Clear Expectations: Schema documents what data is acceptable
Automated Validation: No manual checking needed

Simple Upload Workflow

python

import asyncio
from pathlib import Path
from miura.aio import AsyncNexus
from miura.api.datasources import LocalDataSource
from miura.logging import get_logger

logger = get_logger(__name__)

async def basic_upload():
    async with AsyncNexus() as nexus:

    # Get project and collection
    project = await nexus.get_project("my-project")
    collection = await project.get_collection("my-collection")
    

    # Create data source
    data_path = Path("data/my-simulation")
    datasource = LocalDataSource(str(data_path))
    

    # Upload data
    logger.info("Uploading data...")
    from miura.api import UploadMode

python

        upload_result = await collection.upload(
            datasource=datasource,
            mode=UploadMode.APPEND
        )
        
        logger.info(f"Upload completed:")
        logger.info(f"  Files uploaded: {upload_result.get('files_uploaded', 0)}")
        logger.info(f"  Files failed: {upload_result.get('files_failed', 0)}")
        logger.info(f"  Total size: {upload_result.get('total_size', 0):,} bytes")

asyncio.run(basic_upload())

Create a new version for each upload

from miura.api import UploadMode

upload_result = await collection.upload( datasource=datasource, mode=UploadMode.REPLACE # Replaces all existing items )

logger.info(f"Upload ID: {upload_result.get('upload_id')}")

Local Data Source

Upload from a local directory:

python

from miura.api.datasources import LocalDataSource

Create local data source

data_path = Path("data/my-simulation") datasource = LocalDataSource(str(data_path))

Check data source

logger.info(f"Data source: {datasource.path}") logger.info(f"Files: {datasource.get_file_count()}") logger.info(f"Total size: {datasource.get_size_bytes():,} bytes")

Data Source Information

python

datasource = LocalDataSource("data/my-simulation")

Get file count

file_count = datasource.get_file_count() logger.info(f"Files to upload: {file_count}")

Get file list

files = datasource.get_file_list() for file in files:10: # Show first 10 logger.info(f" - {file}")

Handling Validation Errors

Understanding Validation Errors

When data doesn't match the schema, validation errors occur:

python

upload_result = await collection.upload(datasource=datasource)

if upload_result.get("files_failed", 0) > 0:
    logger.warning("Validation errors detected:")

# Check validation errors
errors = upload_result.get("errors", [])
for error in errors:
    logger.error(f"Error: {error}")


# Common error types:
# - "Unexpected item: /path/to/file"
# - "Missing required item: /path/to/file"
# - "Too many occurrences: /path/to/file"

Fixing Validation Errors

Review the schema: Understand what's expected
Check your data: Ensure it matches the schema
Adjust schema or data: Either fix your data or update the schema

Example: Handling missing required files

upload_result = await collection.upload(datasource=datasource)

if upload_result.get("files_failed", 0) > 0: errors = upload_result.get("errors", )

# Check for missing required files
missing_files = [e for e in errors if "Missing required" in e]
if missing_files:
    logger.warning("Missing required files:")
    for error in missing_files:
        logger.warning(f"  - {error}")
    logger.info("Fix: Add the missing files or update the schema")


# Check for unexpected files
unexpected_files = [e for e in errors if "Unexpected" in e]
if unexpected_files:
    logger.warning("Unexpected files (not in schema):")
    for error in unexpected_files:
        logger.warning(f"  - {error}")
    logger.info("Fix: Remove unexpected files or update schema to include them")

Example 1: Upload with Schema Validation

python

import asyncio
from pathlib import Path
from miura.aio import AsyncNexus
from miura.api.datasources import LocalDataSource
from miura.logging import get_logger

logger = get_logger(__name__)

async def upload_with_validation():
    """Upload data with schema validation."""
    async with AsyncNexus() as nexus:

    # Create project
    project = await nexus.create_project("upload-demo")
    

    # Create collection with schema (quality gate)
    schema = [
        {
            "pattern": ".*\\.vtk$",
            "min_occurrence": 1,
            "max_occurrence": None
        },
        {
            "pattern": "parameters\\.dat",
            "min_occurrence": 1,
            "max_occurrence": 1
        }
    ]

python

        collection = await project.create_collection(
            name="simulation-data",
            schema=schema,
            metadata={"description": "Simulation data with quality gate"}
        )

    # Upload data (schema validates automatically)
    data_path = Path("data/simulation")
    if data_path.exists():
        datasource = LocalDataSource(str(data_path))
        upload_result = await collection.upload(
            datasource=datasource,
            mode=UploadMode.APPEND
        )
        

        # Check results
        if upload_result.get("files_failed", 0) == 0:
            logger.info("All files passed schema validation!")
        else:
            logger.warning("Some files failed validation")
            for error in upload_result.get("errors", []):
                logger.error(f"  {error}")

asyncio.run(upload_with_validation())

Example 2: Upload with Auto-Generated Schema

python

import asyncio
from pathlib import Path
from miura.aio import AsyncNexus
from miura.api.datasources import LocalDataSource
from miura.api import generate_schema_from_path, SchemaGenOptions
from miura.logging import get_logger

logger = get_logger(__name__)

async def upload_with_generated_schema(): """Generate schema from filesystem and upload.""" async with AsyncNexus() as nexus: # Step 1: Generate schema from existing data logger.info("=== Generating Schema ===") data_path = Path("data/my-simulation") options = SchemaGenOptions( min_files_for_pattern=2, default_required=False, schema_name="auto-generated-schema" ) schema = generate_schema_from_path(str(data_path), options=options) logger.info(f"Generated schema with {len(schema)} root-level nodes")

    # Step 2: Create collection with generated schema
    logger.info("=== Creating Collection ===")
    project = await nexus.create_project("upload-demo")
    collection = await project.create_collection(
        name="simulation-data",
        schema=schema,
        metadata={
            "description": "Collection with auto-generated schema",
            "schema_type": "auto-generated"
        }
    )
    

    # Step 3: Upload data (validated against generated schema)
    logger.info("=== Uploading Data ===")
    datasource = LocalDataSource(str(data_path))
    upload_result = await collection.upload(
        datasource=datasource,
        mode=UploadMode.APPEND
    )
    
    logger.info(f"Upload completed: {upload_result.get('files_uploaded', 0)} files")

asyncio.run(upload_with_generated_schema())

Best Practices

1. Always Define Schemas

Schemas are quality gates - always define them:

Avoid: Creating collections without schemas

(This defeats the purpose of quality gates)

2. Test Schema Before Upload

Generate schema and review it before creating the collection:

python

import json

Generate and review schema

schema = generate_schema_from_path("data/my-simulation") logger.info("Generated schema:") logger.info(json.dumps(schema, indent=2))

Review and adjust if needed, then create collection

3. Handle Validation Errors Gracefully

Always check for validation errors:

python

upload_result = await collection.upload(datasource=datasource)

if upload_result.get("files_failed", 0) > 0:
    logger.warning("Validation errors detected")

# Handle errors appropriately
for error in upload_result.get("errors", []):
    logger.error(f"  {error}")

Next Steps

Schema Generation Tutorial - Generate schemas automatically
Getting Items Tutorial - Retrieve uploaded items
Downloading Items Tutorial - Download items from collections
End-to-End Example - Complete workflow

API Reference - Complete API documentation
Quick Start Guide - Get started with the Nexus API

Schemas and Schema Generation

Understanding schemas as quality gates and automatically generating them

Fetching Data

List, iterate, retrieve, and download collection items

Uploading Data

Introduction

What is a Schema?

Schemas as Quality Gates

Benefits of Schema-Based Quality Gates

Simple Upload Workflow

Upload with Versioning

Create a new version for each upload

Local Data Source

Create local data source

Check data source

Data Source Information

Get file count

Get file list

Upload Workflow

Complete Upload Process

Handling Validation Errors

Understanding Validation Errors

Fixing Validation Errors

Example: Handling missing required files

Example 1: Upload with Schema Validation

Example 2: Upload with Auto-Generated Schema

Best Practices

1. Always Define Schemas

Avoid: Creating collections without schemas

(This defeats the purpose of quality gates)

2. Test Schema Before Upload

Generate and review schema

Review and adjust if needed, then create collection

3. Handle Validation Errors Gracefully

Next Steps