Uploading Data
This tutorial covers uploading data to collections, including schema validation, data sources, and upload workflows.
Introduction
Uploading data to a collection involves:
- Setting a data source: Tell the collection where to find your data
- Prevalidation: The collection validates your data against its schema
- Upload: Validated files are uploaded to the collection
The schema acts as a quality gate - it ensures only properly structured data enters your collection.
What is a Schema?
A schema defines the expected structure of data in a collection. It specifies:
- What files and folders are expected
- Which items are required vs optional
- Patterns for matching files (e.g.,
report_\d+\.pdf) - Occurrence constraints (min/max occurrences)
Schemas as Quality Gates
The primary purpose of schemas is to act as quality gates for data ingestion. They ensure:
- Data Integrity: Only data matching the expected structure is accepted
- Consistency: All data in a collection follows the same organizational pattern
- Early Detection: Problems are caught before upload, not after
- Documentation: The schema documents what data structure is expected
from miura import Nexus from miura.logging import get_logger
logger = get_logger(name)
with Nexus() as nexus: # Create collection with schema (quality gate) project = nexus.create_project("my-project")
# Schema defines: "We expect files matching report_\d+\.pdf"
schema = [
{
"pattern": "report_\\d+\\.pdf",
"min_occurrence": 1, # At least one report file required
"max_occurrence": None # No maximum limit
}
]
collection = project.create_collection(
name="reports",
schema=schema
)
# Upload data - schema validates before upload
from miura.api.datasources import LocalDataSource
datasource = LocalDataSource("data/reports")
upload_result = collection.upload(datasource=datasource)
# If data doesn't match schema, validation fails
if upload_result.get("files_failed", 0) > 0:
logger.warning("Some files failed validation against schema")
for error in upload_result.get("errors", []):
logger.error(f"Validation error: {error}")
# Nexus automatically closes when exiting the with block
Benefits of Schema-Based Quality Gates
- Prevents Bad Data: Invalid files are rejected before upload
- Enforces Standards: Ensures all data follows the same structure
- Clear Expectations: Schema documents what data is acceptable
- Automated Validation: No manual checking needed
Simple Upload Workflow
import asyncio
from pathlib import Path
from miura.aio import AsyncNexus
from miura.api.datasources import LocalDataSource
from miura.logging import get_logger
logger = get_logger(__name__)
async def basic_upload():
async with AsyncNexus() as nexus:
# Get project and collection
project = await nexus.get_project("my-project")
collection = await project.get_collection("my-collection")
# Create data source
data_path = Path("data/my-simulation")
datasource = LocalDataSource(str(data_path))
# Upload data
logger.info("Uploading data...")
from miura.api import UploadMode
upload_result = await collection.upload(
datasource=datasource,
mode=UploadMode.APPEND
)
logger.info(f"Upload completed:")
logger.info(f" Files uploaded: {upload_result.get('files_uploaded', 0)}")
logger.info(f" Files failed: {upload_result.get('files_failed', 0)}")
logger.info(f" Total size: {upload_result.get('total_size', 0):,} bytes")
asyncio.run(basic_upload())
Upload with Versioning
Create a new version for each upload
from miura.api import UploadMode
upload_result = await collection.upload( datasource=datasource, mode=UploadMode.REPLACE # Replaces all existing items )
logger.info(f"Upload ID: {upload_result.get('upload_id')}")
Local Data Source
Upload from a local directory:
from miura.api.datasources import LocalDataSource
Create local data source
data_path = Path("data/my-simulation") datasource = LocalDataSource(str(data_path))
Check data source
logger.info(f"Data source: {datasource.path}") logger.info(f"Files: {datasource.get_file_count()}") logger.info(f"Total size: {datasource.get_size_bytes():,} bytes")
Data Source Information
datasource = LocalDataSource("data/my-simulation")
Get file count
file_count = datasource.get_file_count() logger.info(f"Files to upload: {file_count}")
Get file list
files = datasource.get_file_list() for file in files:10: # Show first 10 logger.info(f" - {file}")
Upload Workflow
Complete Upload Process
The upload process consists of several steps:
- Set Data Source: Tell the collection where to find data
- Prevalidation: Validate data against schema
- Upload: Upload validated files
async def complete_upload_process():
async with AsyncNexus() as nexus:
project = await nexus.get_project("my-project")
collection = await project.get_collection("my-collection")
# Step 1: Create data source
logger.info("=== Step 1: Creating Data Source ===")
data_path = Path("data/my-simulation")
if not data_path.exists():
logger.error(f"Data path not found: {data_path}")
return
datasource = LocalDataSource(str(data_path))
logger.info(f"Data source: {datasource.path}")
logger.info(f"Files: {datasource.get_file_count()}")
# Step 2: Upload (includes prevalidation)
logger.info("=== Step 2: Uploading Data ===")
logger.info("Prevalidation will occur automatically...")
upload_result = await collection.upload(
datasource=datasource,
mode=UploadMode.APPEND
)
# Step 3: Check results
logger.info("=== Step 3: Upload Results ===")
logger.info(f"Status: {upload_result.get('status', 'unknown')}")
logger.info(f"Files uploaded: {upload_result.get('files_uploaded', 0)}")
logger.info(f"Files failed: {upload_result.get('files_failed', 0)}")
logger.info(f"Total size: {upload_result.get('total_size', 0):,} bytes")
# Check for validation errors
if upload_result.get("files_failed", 0) > 0:
logger.warning("Some files failed validation:")
for error in upload_result.get("errors", []):
logger.error(f" - {error}")
asyncio.run(complete_upload_process())
Handling Validation Errors
Understanding Validation Errors
When data doesn't match the schema, validation errors occur:
upload_result = await collection.upload(datasource=datasource)
if upload_result.get("files_failed", 0) > 0:
logger.warning("Validation errors detected:")
# Check validation errors
errors = upload_result.get("errors", [])
for error in errors:
logger.error(f"Error: {error}")
# Common error types:
# - "Unexpected item: /path/to/file"
# - "Missing required item: /path/to/file"
# - "Too many occurrences: /path/to/file"
Fixing Validation Errors
- Review the schema: Understand what's expected
- Check your data: Ensure it matches the schema
- Adjust schema or data: Either fix your data or update the schema
Example: Handling missing required files
upload_result = await collection.upload(datasource=datasource)
if upload_result.get("files_failed", 0) > 0: errors = upload_result.get("errors", )
# Check for missing required files
missing_files = [e for e in errors if "Missing required" in e]
if missing_files:
logger.warning("Missing required files:")
for error in missing_files:
logger.warning(f" - {error}")
logger.info("Fix: Add the missing files or update the schema")
# Check for unexpected files
unexpected_files = [e for e in errors if "Unexpected" in e]
if unexpected_files:
logger.warning("Unexpected files (not in schema):")
for error in unexpected_files:
logger.warning(f" - {error}")
logger.info("Fix: Remove unexpected files or update schema to include them")
Example 1: Upload with Schema Validation
import asyncio
from pathlib import Path
from miura.aio import AsyncNexus
from miura.api.datasources import LocalDataSource
from miura.logging import get_logger
logger = get_logger(__name__)
async def upload_with_validation():
"""Upload data with schema validation."""
async with AsyncNexus() as nexus:
# Create project
project = await nexus.create_project("upload-demo")
# Create collection with schema (quality gate)
schema = [
{
"pattern": ".*\\.vtk$",
"min_occurrence": 1,
"max_occurrence": None
},
{
"pattern": "parameters\\.dat",
"min_occurrence": 1,
"max_occurrence": 1
}
]
collection = await project.create_collection(
name="simulation-data",
schema=schema,
metadata={"description": "Simulation data with quality gate"}
)
# Upload data (schema validates automatically)
data_path = Path("data/simulation")
if data_path.exists():
datasource = LocalDataSource(str(data_path))
upload_result = await collection.upload(
datasource=datasource,
mode=UploadMode.APPEND
)
# Check results
if upload_result.get("files_failed", 0) == 0:
logger.info("All files passed schema validation!")
else:
logger.warning("Some files failed validation")
for error in upload_result.get("errors", []):
logger.error(f" {error}")
asyncio.run(upload_with_validation())
Example 2: Upload with Auto-Generated Schema
import asyncio
from pathlib import Path
from miura.aio import AsyncNexus
from miura.api.datasources import LocalDataSource
from miura.api import generate_schema_from_path, SchemaGenOptions
from miura.logging import get_logger
logger = get_logger(__name__)
async def upload_with_generated_schema(): """Generate schema from filesystem and upload.""" async with AsyncNexus() as nexus: # Step 1: Generate schema from existing data logger.info("=== Generating Schema ===") data_path = Path("data/my-simulation") options = SchemaGenOptions( min_files_for_pattern=2, default_required=False, schema_name="auto-generated-schema" ) schema = generate_schema_from_path(str(data_path), options=options) logger.info(f"Generated schema with {len(schema)} root-level nodes")
# Step 2: Create collection with generated schema
logger.info("=== Creating Collection ===")
project = await nexus.create_project("upload-demo")
collection = await project.create_collection(
name="simulation-data",
schema=schema,
metadata={
"description": "Collection with auto-generated schema",
"schema_type": "auto-generated"
}
)
# Step 3: Upload data (validated against generated schema)
logger.info("=== Uploading Data ===")
datasource = LocalDataSource(str(data_path))
upload_result = await collection.upload(
datasource=datasource,
mode=UploadMode.APPEND
)
logger.info(f"Upload completed: {upload_result.get('files_uploaded', 0)} files")
asyncio.run(upload_with_generated_schema())
Best Practices
1. Always Define Schemas
Schemas are quality gates - always define them:
Avoid: Creating collections without schemas
(This defeats the purpose of quality gates)
2. Test Schema Before Upload
Generate schema and review it before creating the collection:
import json
Generate and review schema
schema = generate_schema_from_path("data/my-simulation") logger.info("Generated schema:") logger.info(json.dumps(schema, indent=2))
Review and adjust if needed, then create collection
3. Handle Validation Errors Gracefully
Always check for validation errors:
upload_result = await collection.upload(datasource=datasource)
if upload_result.get("files_failed", 0) > 0:
logger.warning("Validation errors detected")
# Handle errors appropriately
for error in upload_result.get("errors", []):
logger.error(f" {error}")
Next Steps
- Schema Generation Tutorial - Generate schemas automatically
- Getting Items Tutorial - Retrieve uploaded items
- Downloading Items Tutorial - Download items from collections
- End-to-End Example - Complete workflow
Related Documentation
- API Reference - Complete API documentation
- Quick Start Guide - Get started with the Nexus API