Uploading Data
This tutorial covers uploading data to collections, including schema validation, data sources, and upload workflows.
Introduction
Uploading data to a collection involves:
- Setting a data source: Tell the collection where to find your data
- Prevalidation: The collection validates your data against its schema
- Upload: Validated files are uploaded to the collection
The schema acts as a quality gate - it ensures only properly structured data enters your collection.
What is a Schema?
A schema defines the expected structure of data in a collection. It specifies:
- What files and folders are expected
- Which items are required vs optional
- Patterns for matching files (e.g.,
report_\d+\.pdf) - Occurrence constraints (min/max occurrences)
Schemas as Quality Gates
The primary purpose of schemas is to act as quality gates for data ingestion. They ensure:
- Data Integrity: Only data matching the expected structure is accepted
- Consistency: All data in a collection follows the same organizational pattern
- Early Detection: Problems are caught before upload, not after
- Documentation: The schema documents what data structure is expected
from miura import Nexus
from miura.logging import get_logger
logger = get_logger(__name__)
with Nexus() as nexus:
# Create collection with schema (quality gate)
project = nexus.create_project("my-project")
# Schema defines: "We expect files matching report_\\d+\\.pdf"
schema = [
{
"pattern": "report_\\d+\\.pdf",
"min_occurrence": 1, # At least one report file required
"max_occurrence": None # No maximum limit
}
]
collection = project.create_collection(
name="reports",
schema=schema
)
# Upload data - schema validates before upload
from miura.sdk import LocalDataSource
datasource = LocalDataSource("data/reports")
upload_result = collection.upload(datasource=datasource)
# If data doesn't match schema, validation fails
if upload_result.files_failed > 0:
logger.warning("Some files failed validation against schema")
for error in upload_result.errors:
logger.error(f"Validation error: {error}")
# Nexus automatically closes when exiting the with block
Benefits of Schema-Based Quality Gates
- Prevents Bad Data: Invalid files are rejected before upload
- Enforces Standards: Ensures all data follows the same structure
- Clear Expectations: Schema documents what data is acceptable
- Automated Validation: No manual checking needed
Simple Upload Workflow
import asyncio
from pathlib import Path
from miura.api import AsyncNexus
from miura.sdk import LocalDataSource
from miura.logging import get_logger
logger = get_logger(__name__)
async def basic_upload():
async with AsyncNexus() as nexus:
# Get project and collection
project = await nexus.get_project("my-project")
collection = await project.get_collection("my-collection")
# Create data source
data_path = Path("data/my-simulation")
datasource = LocalDataSource(str(data_path))
# Upload data
logger.info("Uploading data...")
from miura.sdk import UploadMode
upload_result = await collection.upload(
datasource=datasource,
mode=UploadMode.APPEND
)
logger.info(f"Upload completed:")
logger.info(f" Files uploaded: {upload_result.files_uploaded}")
logger.info(f" Files failed: {upload_result.files_failed}")
logger.info(f" Total size: {upload_result.total_size:,} bytes")
asyncio.run(basic_upload())
Upload with Versioning
Create a new version for each upload
from miura.sdk import UploadMode
upload_result = await collection.upload( datasource=datasource, mode=UploadMode.REPLACE # Replaces all existing items )
logger.info(f"Upload ID: {upload_result.upload_id}")
Local Data Source
Upload from a local directory:
from miura.sdk import LocalDataSource
# Create local data source
data_path = Path("data/my-simulation")
datasource = LocalDataSource(str(data_path))
# Check data source
logger.info(f"Data source: {datasource.path}")
logger.info(f"Files: {datasource.get_file_count()}")
logger.info(f"Total size: {datasource.get_size_bytes():,} bytes")
Data Source Information
datasource = LocalDataSource("data/my-simulation")
# Get file count
file_count = datasource.get_file_count()
logger.info(f"Files to upload: {file_count}")
# Get file list
files = datasource.get_file_list()
for file in files[:10]: # Show first 10
logger.info(f" - {file}")
Upload Workflow
Complete Upload Process
The upload process consists of several steps:
- Set Data Source: Tell the collection where to find data
- Prevalidation: Validate data against schema
- Upload: Upload validated files
async def complete_upload_process():
async with AsyncNexus() as nexus:
project = await nexus.get_project("my-project")
collection = await project.get_collection("my-collection")
# Step 1: Create data source
logger.info("=== Step 1: Creating Data Source ===")
data_path = Path("data/my-simulation")
if not data_path.exists():
logger.error(f"Data path not found: {data_path}")
return
datasource = LocalDataSource(str(data_path))
logger.info(f"Data source: {datasource.path}")
logger.info(f"Files: {datasource.get_file_count()}")
# Step 2: Upload (includes prevalidation)
logger.info("=== Step 2: Uploading Data ===")
logger.info("Prevalidation will occur automatically...")
upload_result = await collection.upload(
datasource=datasource,
mode=UploadMode.APPEND
)
# Step 3: Check results
logger.info("=== Step 3: Upload Results ===")
logger.info(f"Status: {upload_result.status}")
logger.info(f"Files uploaded: {upload_result.files_uploaded}")
logger.info(f"Files failed: {upload_result.files_failed}")
logger.info(f"Total size: {upload_result.total_size:,} bytes")
# Check for validation errors
if upload_result.files_failed > 0:
logger.warning("Some files failed validation:")
for error in upload_result.errors:
logger.error(f" - {error}")
asyncio.run(complete_upload_process())
Handling Validation Errors
Understanding Validation Errors
When data doesn't match the schema, validation errors occur:
upload_result = await collection.upload(datasource=datasource)
if upload_result.files_failed > 0:
logger.warning("Validation errors detected:")
# Check validation errors
errors = upload_result.errors
for error in errors:
logger.error(f"Error: {error}")
# Common error types:
# - "Unexpected item: /path/to/file"
# - "Missing required item: /path/to/file"
# - "Too many occurrences: /path/to/file"
Fixing Validation Errors
- Review the schema: Understand what's expected
- Check your data: Ensure it matches the schema
- Adjust schema or data: Either fix your data or update the schema
Example: Handling missing required files
upload_result = await collection.upload(datasource=datasource)
if upload_result.files_failed > 0: errors = upload_result.errors
# Check for missing required files
missing_files = [e for e in errors if "Missing required" in e]
if missing_files:
logger.warning("Missing required files:")
for error in missing_files:
logger.warning(f" - {error}")
logger.info("Fix: Add the missing files or update the schema")
# Check for unexpected files
unexpected_files = [e for e in errors if "Unexpected" in e]
if unexpected_files:
logger.warning("Unexpected files (not in schema):")
for error in unexpected_files:
logger.warning(f" - {error}")
logger.info("Fix: Remove unexpected files or update schema to include them")
Example 1: Upload with Schema Validation
import asyncio
from pathlib import Path
from miura.api import AsyncNexus
from miura.sdk import LocalDataSource
from miura.logging import get_logger
logger = get_logger(__name__)
async def upload_with_validation():
"""Upload data with schema validation."""
async with AsyncNexus() as nexus:
# Create project
project = await nexus.create_project("upload-demo")
# Create collection with schema (quality gate)
schema = [
{
"pattern": ".*\\.vtk$",
"min_occurrence": 1,
"max_occurrence": None
},
{
"pattern": "parameters\\.dat",
"min_occurrence": 1,
"max_occurrence": 1
}
]
collection = await project.create_collection(
name="simulation-data",
schema=schema,
metadata={"description": "Simulation data with quality gate"}
)
# Upload data (schema validates automatically)
data_path = Path("data/simulation")
if data_path.exists():
datasource = LocalDataSource(str(data_path))
upload_result = await collection.upload(
datasource=datasource,
mode=UploadMode.APPEND
)
# Check results
if upload_result.files_failed == 0:
logger.info("All files passed schema validation!")
else:
logger.warning("Some files failed validation")
for error in upload_result.errors:
logger.error(f" {error}")
asyncio.run(upload_with_validation())
Best Practices
1. Always Define Schemas
Schemas are quality gates - always define them:
Avoid: Creating collections without schemas
(This defeats the purpose of quality gates)
2. Review Schema Before Upload
Review your schema (and validate it against sample data if needed) before creating the collection so the quality gate matches your expectations.
3. Handle Validation Errors Gracefully
Always check for validation errors:
upload_result = await collection.upload(datasource=datasource)
if upload_result.files_failed > 0:
logger.warning("Validation errors detected")
# Handle errors appropriately
for error in upload_result.errors:
logger.error(f" {error}")
Next Steps
- Schemas - Learn about schemas and quality gates
- Getting Items Tutorial - Retrieve uploaded items
- Downloading Items Tutorial - Download items from collections
- End-to-End Example - Complete workflow
Related Documentation
- API Reference - Complete API documentation
- Quick Start Guide - Get started with the Nexus API