Uploading Data

Upload data to collections with automatic schema validation

This tutorial covers uploading data to collections, including schema validation, data sources, and upload workflows.

Introduction

Uploading data to a collection involves:

Setting a data source: Tell the collection where to find your data
Prevalidation: The collection validates your data against its schema
Upload: Validated files are uploaded to the collection

The schema acts as a quality gate - it ensures only properly structured data enters your collection.

What is a Schema?

A schema defines the expected structure of data in a collection. It specifies:

What files and folders are expected
Which items are required vs optional
Patterns for matching files (e.g., report_\d+\.pdf)
Occurrence constraints (min/max occurrences)

Schemas as Quality Gates

The primary purpose of schemas is to act as quality gates for data ingestion. They ensure:

Data Integrity: Only data matching the expected structure is accepted
Consistency: All data in a collection follows the same organizational pattern
Early Detection: Problems are caught before upload, not after
Documentation: The schema documents what data structure is expected

python

from miura import Nexus
from miura.logging import get_logger

logger = get_logger(__name__)

with Nexus() as nexus:
    # Create collection with schema (quality gate)
    project = nexus.create_project("my-project")
    
    # Schema defines: "We expect files matching report_\\d+\\.pdf"
    schema = [
        {
            "pattern": "report_\\d+\\.pdf",
            "min_occurrence": 1,  # At least one report file required
            "max_occurrence": None  # No maximum limit
        }
    ]
    
    collection = project.create_collection(
        name="reports",
        schema=schema
    )
    
    # Upload data - schema validates before upload
    from miura.sdk import LocalDataSource
    datasource = LocalDataSource("data/reports")
    upload_result = collection.upload(datasource=datasource)
    
    # If data doesn't match schema, validation fails
    if upload_result.files_failed > 0:
        logger.warning("Some files failed validation against schema")
        for error in upload_result.errors:
            logger.error(f"Validation error: {error}")
    # Nexus automatically closes when exiting the with block

Benefits of Schema-Based Quality Gates

Prevents Bad Data: Invalid files are rejected before upload
Enforces Standards: Ensures all data follows the same structure
Clear Expectations: Schema documents what data is acceptable
Automated Validation: No manual checking needed

Simple Upload Workflow

python

import asyncio
from pathlib import Path
from miura.api import AsyncNexus
from miura.sdk import LocalDataSource
from miura.logging import get_logger

logger = get_logger(__name__)

async def basic_upload():
    async with AsyncNexus() as nexus:
        # Get project and collection
        project = await nexus.get_project("my-project")
        collection = await project.get_collection("my-collection")
        
        # Create data source
        data_path = Path("data/my-simulation")
        datasource = LocalDataSource(str(data_path))
        
        # Upload data
        logger.info("Uploading data...")
        from miura.sdk import UploadMode
        
        upload_result = await collection.upload(
            datasource=datasource,
            mode=UploadMode.APPEND
        )
        
        logger.info(f"Upload completed:")
        logger.info(f"  Files uploaded: {upload_result.files_uploaded}")
        logger.info(f"  Files failed: {upload_result.files_failed}")
        logger.info(f"  Total size: {upload_result.total_size:,} bytes")

asyncio.run(basic_upload())

Upload with Versioning

Create a new version for each upload

from miura.sdk import UploadMode

upload_result = await collection.upload( datasource=datasource, mode=UploadMode.REPLACE # Replaces all existing items )

logger.info(f"Upload ID: {upload_result.upload_id}")

Local Data Source

Upload from a local directory:

python

from miura.sdk import LocalDataSource

# Create local data source
data_path = Path("data/my-simulation")
datasource = LocalDataSource(str(data_path))

# Check data source
logger.info(f"Data source: {datasource.path}")
logger.info(f"Files: {datasource.get_file_count()}")
logger.info(f"Total size: {datasource.get_size_bytes():,} bytes")

Data Source Information

python

datasource = LocalDataSource("data/my-simulation")

# Get file count
file_count = datasource.get_file_count()
logger.info(f"Files to upload: {file_count}")

# Get file list
files = datasource.get_file_list()
for file in files[:10]:  # Show first 10
    logger.info(f"  - {file}")

Upload Workflow

Complete Upload Process

The upload process consists of several steps:

Set Data Source: Tell the collection where to find data
Prevalidation: Validate data against schema
Upload: Upload validated files

python

async def complete_upload_process():
    async with AsyncNexus() as nexus:
        project = await nexus.get_project("my-project")
        collection = await project.get_collection("my-collection")
        
        # Step 1: Create data source
        logger.info("=== Step 1: Creating Data Source ===")
        data_path = Path("data/my-simulation")
        if not data_path.exists():
            logger.error(f"Data path not found: {data_path}")
            return
        
        datasource = LocalDataSource(str(data_path))
        logger.info(f"Data source: {datasource.path}")
        logger.info(f"Files: {datasource.get_file_count()}")
        
        # Step 2: Upload (includes prevalidation)
        logger.info("=== Step 2: Uploading Data ===")
        logger.info("Prevalidation will occur automatically...")
        
        upload_result = await collection.upload(
            datasource=datasource,
            mode=UploadMode.APPEND
        )
        
        # Step 3: Check results
        logger.info("=== Step 3: Upload Results ===")
        logger.info(f"Status: {upload_result.status}")
        logger.info(f"Files uploaded: {upload_result.files_uploaded}")
        logger.info(f"Files failed: {upload_result.files_failed}")
        logger.info(f"Total size: {upload_result.total_size:,} bytes")
        
        # Check for validation errors
        if upload_result.files_failed > 0:
            logger.warning("Some files failed validation:")
            for error in upload_result.errors:
                logger.error(f"  - {error}")

asyncio.run(complete_upload_process())

Handling Validation Errors

Understanding Validation Errors

When data doesn't match the schema, validation errors occur:

python

upload_result = await collection.upload(datasource=datasource)

if upload_result.files_failed > 0:
    logger.warning("Validation errors detected:")
    # Check validation errors
    errors = upload_result.errors
    for error in errors:
        logger.error(f"Error: {error}")
    
    # Common error types:
    # - "Unexpected item: /path/to/file"
    # - "Missing required item: /path/to/file"
    # - "Too many occurrences: /path/to/file"

Fixing Validation Errors

Review the schema: Understand what's expected
Check your data: Ensure it matches the schema
Adjust schema or data: Either fix your data or update the schema

Example: Handling missing required files

upload_result = await collection.upload(datasource=datasource)

if upload_result.files_failed > 0: errors = upload_result.errors

# Check for missing required files
missing_files = [e for e in errors if "Missing required" in e]
if missing_files:
    logger.warning("Missing required files:")
    for error in missing_files:
        logger.warning(f"  - {error}")
    logger.info("Fix: Add the missing files or update the schema")


# Check for unexpected files
unexpected_files = [e for e in errors if "Unexpected" in e]
if unexpected_files:
    logger.warning("Unexpected files (not in schema):")
    for error in unexpected_files:
        logger.warning(f"  - {error}")
    logger.info("Fix: Remove unexpected files or update schema to include them")

Example 1: Upload with Schema Validation

python

import asyncio
from pathlib import Path
from miura.api import AsyncNexus
from miura.sdk import LocalDataSource
from miura.logging import get_logger

logger = get_logger(__name__)

async def upload_with_validation():
    """Upload data with schema validation."""
    async with AsyncNexus() as nexus:
        # Create project
        project = await nexus.create_project("upload-demo")
        

        # Create collection with schema (quality gate)
        schema = [
            {
                "pattern": ".*\\.vtk$",
                "min_occurrence": 1,
                "max_occurrence": None
            },
            {
                "pattern": "parameters\\.dat",
                "min_occurrence": 1,
                "max_occurrence": 1
            }
        ]
        
        collection = await project.create_collection(
            name="simulation-data",
            schema=schema,
            metadata={"description": "Simulation data with quality gate"}
        )
        
        # Upload data (schema validates automatically)
        data_path = Path("data/simulation")
        if data_path.exists():
            datasource = LocalDataSource(str(data_path))
            upload_result = await collection.upload(
                datasource=datasource,
                mode=UploadMode.APPEND
            )
            
            # Check results
            if upload_result.files_failed == 0:
                logger.info("All files passed schema validation!")
            else:
                logger.warning("Some files failed validation")
                for error in upload_result.errors:
                    logger.error(f"  {error}")

asyncio.run(upload_with_validation())

Best Practices

1. Always Define Schemas

Schemas are quality gates - always define them:

Avoid: Creating collections without schemas

(This defeats the purpose of quality gates)

2. Review Schema Before Upload

Review your schema (and validate it against sample data if needed) before creating the collection so the quality gate matches your expectations.

3. Handle Validation Errors Gracefully

Always check for validation errors:

python

upload_result = await collection.upload(datasource=datasource)

if upload_result.files_failed > 0:
    logger.warning("Validation errors detected")
    # Handle errors appropriately
    for error in upload_result.errors:
        logger.error(f"  {error}")

Next Steps

Schemas - Learn about schemas and quality gates
Getting Items Tutorial - Retrieve uploaded items
Downloading Items Tutorial - Download items from collections
End-to-End Example - Complete workflow

API Reference - Complete API documentation
Quick Start Guide - Get started with the Nexus API

Schemas

Understanding schemas as quality gates for data ingestion

Fetching Data

List, iterate, retrieve, and download collection items