Audit System and Batch Processing Guide

This guide covers Inferno's enterprise-grade audit logging system and comprehensive batch processing capabilities with cron scheduling, retry logic, and monitoring.

Audit System Overview

The audit system provides comprehensive logging of all operations, security events, and system changes with encryption, compression, and multi-channel alerting capabilities.

Key Features

Comprehensive Logging: All operations, access attempts, and configuration changes
Encryption: AES-256 encryption for sensitive audit data
Compression: Efficient storage with configurable algorithms
Multi-channel Alerting: Real-time notifications via email, Slack, webhooks
Compliance Reports: Automated generation of audit reports
Tamper Detection: Cryptographic integrity validation
Retention Policies: Configurable data retention and archival

Audit System Configuration

Enable Audit Logging

# Enable audit system with encryption
inferno audit enable --encryption

# Enable with specific channels
inferno audit enable --channels email,slack,webhook

# Configure audit settings
inferno audit configure \
  --encryption-key-file /secure/audit.key \
  --compression gzip \
  --retention-days 365 \
  --max-size-gb 50

Configuration File

# .inferno.toml
[audit]
enabled = true
log_level = "info"  # trace, debug, info, warn, error
storage_path = "/var/log/inferno/audit"

[audit.encryption]
enabled = true
algorithm = "aes256"
key_file = "/secure/audit.key"
rotate_keys_days = 90

[audit.compression]
enabled = true
algorithm = "gzip"  # gzip, zstd, lz4
level = 6
threshold_kb = 64

[audit.retention]
max_age_days = 365
max_size_gb = 100
archive_to_s3 = true
cleanup_interval_hours = 24

[audit.alerts]
enabled = true
channels = ["email", "slack", "webhook"]
severity_threshold = "warn"

[audit.alerts.email]
smtp_server = "smtp.company.com"
recipients = ["security@company.com"]
template = "security_alert"

[audit.alerts.slack]
webhook_url = "https://hooks.slack.com/services/..."
channel = "#security-alerts"

[audit.alerts.webhook]
url = "https://api.company.com/security/alerts"
headers = { "Authorization" = "Bearer token" }

Audit Operations

View Audit Logs

# View recent audit entries
inferno audit logs

# Filter by user and time
inferno audit logs --user admin --since 24h

# Filter by action type
inferno audit logs --action model_load,inference_request

# Export to file
inferno audit export --format json --output audit_report.json

# Search logs
inferno audit search --query "failed login" --since 7d

Audit Log Management

# Check audit system status
inferno audit status

# Rotate audit logs
inferno audit rotate

# Verify log integrity
inferno audit verify --check-encryption

# Generate compliance report
inferno audit report --period monthly --format pdf

Alert Configuration

# Test alert channels
inferno audit test-alerts

# Configure alert rules
inferno audit alert-rule create \
  --name "multiple_failed_logins" \
  --condition "failed_logins > 5 in 10m" \
  --severity high \
  --channels email,slack

# List alert rules
inferno audit alert-rules list

# Disable specific alerts
inferno audit alert-rule disable multiple_failed_logins

Audit Event Types

Authentication Events

{
  "timestamp": "2024-03-15T10:30:00Z",
  "event_type": "authentication",
  "action": "login_success",
  "user": "admin",
  "source_ip": "192.168.1.100",
  "user_agent": "InfernoClient/1.0",
  "session_id": "sess_123456",
  "details": {
    "auth_method": "api_key",
    "permissions": ["admin:read", "admin:write"]
  }
}

Model Operations

{
  "timestamp": "2024-03-15T10:30:00Z",
  "event_type": "model_operation",
  "action": "model_load",
  "user": "service_account",
  "resource": "llama-7b-q4.gguf",
  "details": {
    "model_size_mb": 3800,
    "backend_type": "gguf",
    "load_time_ms": 2500,
    "gpu_enabled": true
  },
  "result": "success"
}

Security Events

{
  "timestamp": "2024-03-15T10:30:00Z",
  "event_type": "security",
  "action": "access_denied",
  "user": "guest",
  "source_ip": "192.168.1.200",
  "resource": "/admin/config",
  "details": {
    "required_permission": "admin:read",
    "user_permissions": ["inference:read"],
    "reason": "insufficient_permissions"
  },
  "severity": "medium"
}

System Events

{
  "timestamp": "2024-03-15T10:30:00Z",
  "event_type": "system",
  "action": "config_change",
  "user": "admin",
  "resource": "server_config",
  "details": {
    "changes": {
      "max_concurrent_requests": {"old": 50, "new": 100},
      "cache_enabled": {"old": false, "new": true}
    }
  },
  "result": "success"
}

Batch Processing System

The batch processing system provides production-ready job management with cron scheduling, retry logic, and comprehensive monitoring.

Key Features

Cron Scheduling: Full cron expression support for complex schedules
Job Queue: Persistent job queue with priority handling
Retry Logic: Configurable retry policies with exponential backoff
Resource Management: CPU and memory limits per job
Monitoring: Real-time job status and progress tracking
Dependencies: Support for job chains and dependencies

Batch Queue Configuration

Enable Batch Processing

# Enable batch queue system
inferno batch-queue enable

# Configure queue settings
inferno batch-queue configure \
  --max-concurrent 5 \
  --retry-attempts 3 \
  --default-timeout 3600

Configuration File

[batch_queue]
enabled = true
storage_path = "/var/lib/inferno/queue"
max_concurrent_jobs = 5
default_timeout_seconds = 3600

[batch_queue.retry]
max_attempts = 3
base_delay_seconds = 30
max_delay_seconds = 3600
backoff_multiplier = 2.0

[batch_queue.resources]
default_cpu_limit = 2.0
default_memory_limit_gb = 4.0
monitor_interval_seconds = 30

[batch_queue.scheduling]
enable_cron = true
timezone = "UTC"
max_scheduled_jobs = 100

[batch_queue.notifications]
enabled = true
channels = ["email", "webhook"]
notify_on = ["job_failed", "job_completed"]

Batch Job Operations

Create and Manage Jobs

# Create a simple batch job
inferno batch-queue create \
  --name "model_validation" \
  --command "validate /models/*.gguf" \
  --timeout 1800

# Create scheduled job with cron expression
inferno batch-queue create \
  --name "nightly_conversion" \
  --command "convert model /staging/model.pt /production/model.gguf" \
  --schedule "0 2 * * *" \
  --enabled

# Create job with dependencies
inferno batch-queue create \
  --name "post_processing" \
  --command "optimize /production/model.gguf" \
  --depends-on "nightly_conversion"

# Create job with resource limits
inferno batch-queue create \
  --name "heavy_processing" \
  --command "benchmark --model large-model.gguf" \
  --cpu-limit 4.0 \
  --memory-limit 8GB \
  --priority high

Job Management

# List all jobs
inferno batch-queue list

# List running jobs
inferno batch-queue list --status running

# View job details
inferno batch-queue show job_123456

# Cancel a job
inferno batch-queue cancel job_123456

# Retry a failed job
inferno batch-queue retry job_123456

# Pause/resume queue
inferno batch-queue pause
inferno batch-queue resume

Job Monitoring

# Monitor job progress
inferno batch-queue monitor job_123456

# View job logs
inferno batch-queue logs job_123456

# Get queue statistics
inferno batch-queue stats

# Export job history
inferno batch-queue export --format csv --output jobs.csv

Cron Scheduling

Cron Expression Format

┌───────────── minute (0 - 59)
│ ┌─────────── hour (0 - 23)
│ │ ┌───────── day of month (1 - 31)
│ │ │ ┌─────── month (1 - 12)
│ │ │ │ ┌───── day of week (0 - 6) (Sunday to Saturday)
│ │ │ │ │
* * * * *

Common Cron Examples

# Every minute
inferno batch-queue create --schedule "* * * * *"

# Every hour at minute 0
inferno batch-queue create --schedule "0 * * * *"

# Daily at 2 AM
inferno batch-queue create --schedule "0 2 * * *"

# Weekly on Sunday at 3 AM
inferno batch-queue create --schedule "0 3 * * 0"

# Monthly on the 1st at midnight
inferno batch-queue create --schedule "0 0 1 * *"

# Every 15 minutes
inferno batch-queue create --schedule "*/15 * * * *"

# Weekdays at 9 AM
inferno batch-queue create --schedule "0 9 * * 1-5"

# Custom: Every 2 hours between 9 AM and 5 PM on weekdays
inferno batch-queue create --schedule "0 9-17/2 * * 1-5"

Advanced Scheduling

# Multiple schedules for the same job
inferno batch-queue create \
  --name "multi_schedule_job" \
  --command "maintenance" \
  --schedule "0 2 * * *" \
  --additional-schedule "0 14 * * 6,0"

# Conditional scheduling
inferno batch-queue create \
  --name "conditional_job" \
  --command "cleanup if disk_usage > 80%" \
  --schedule "0 */4 * * *" \
  --condition "disk_usage_percent > 80"

Job Types and Templates

Model Processing Jobs

# Model conversion job
inferno batch-queue create \
  --name "convert_models" \
  --template model_conversion \
  --input-dir /staging/models \
  --output-dir /production/models \
  --format gguf \
  --schedule "0 1 * * *"

# Model validation job
inferno batch-queue create \
  --name "validate_models" \
  --template model_validation \
  --model-dir /production/models \
  --schedule "0 3 * * *"

Inference Batch Jobs

# Batch inference job
inferno batch-queue create \
  --name "batch_inference" \
  --template batch_inference \
  --model "llama-7b-q4" \
  --input-file prompts.txt \
  --output-file results.json \
  --batch-size 10

Maintenance Jobs

# Cache cleanup job
inferno batch-queue create \
  --name "cache_cleanup" \
  --template cache_maintenance \
  --max-age 7d \
  --max-size 10GB \
  --schedule "0 4 * * *"

# Log rotation job
inferno batch-queue create \
  --name "log_rotation" \
  --template log_maintenance \
  --max-files 10 \
  --compress true \
  --schedule "0 5 * * 0"

Job Configuration Templates

Model Conversion Template

# templates/model_conversion.yaml
name: "model_conversion"
description: "Convert models between formats"
parameters:
  - name: "input_dir"
    type: "string"
    required: true
  - name: "output_dir"
    type: "string"
    required: true
  - name: "format"
    type: "enum"
    values: ["gguf", "onnx", "pytorch"]
    default: "gguf"
command: |
  for model in ${input_dir}/*; do
    inferno convert model "$model" "${output_dir}/$(basename "$model")" --format ${format}
  done
resources:
  cpu_limit: 2.0
  memory_limit_gb: 4.0
timeout_seconds: 7200
retry:
  max_attempts: 2
  backoff_multiplier: 1.5

Batch Inference Template

# templates/batch_inference.yaml
name: "batch_inference"
description: "Process multiple inference requests"
parameters:
  - name: "model"
    type: "string"
    required: true
  - name: "input_file"
    type: "string"
    required: true
  - name: "output_file"
    type: "string"
    required: true
  - name: "batch_size"
    type: "integer"
    default: 10
command: |
  inferno batch \
    --model ${model} \
    --input ${input_file} \
    --output ${output_file} \
    --batch-size ${batch_size}
resources:
  cpu_limit: 4.0
  memory_limit_gb: 8.0
  gpu_required: true
timeout_seconds: 3600

Monitoring and Alerting

Job Monitoring

# Real-time job monitoring
inferno batch-queue monitor --real-time

# Performance dashboard
inferno batch-queue dashboard

# Resource usage monitoring
inferno batch-queue resources --job job_123456

Alert Configuration

# Configure job alerts
inferno batch-queue alert-rule create \
  --name "job_failure_rate" \
  --condition "failure_rate > 0.10 in 1h" \
  --action "notify_admin"

# Configure resource alerts
inferno batch-queue alert-rule create \
  --name "high_resource_usage" \
  --condition "cpu_usage > 90% for 5m" \
  --action "scale_down"

Integration Examples

Python Batch Client

from inferno_client import BatchQueue
from datetime import datetime, timedelta

# Initialize batch queue client
queue = BatchQueue("http://localhost:8080")

# Create a scheduled job
job = queue.create_job(
    name="daily_model_sync",
    command="rsync -av /staging/models/ /production/models/",
    schedule="0 1 * * *",
    timeout=3600,
    retry_attempts=3
)

# Monitor job progress
def monitor_job(job_id):
    while True:
        status = queue.get_job_status(job_id)
        print(f"Job {job_id}: {status['status']} - {status['progress']}%")

        if status['status'] in ['completed', 'failed', 'cancelled']:
            break

        time.sleep(10)

# Schedule a one-time job
future_time = datetime.now() + timedelta(hours=2)
queue.schedule_job(
    name="future_processing",
    command="process_data --large-dataset",
    run_at=future_time,
    resources={
        "cpu_limit": 8.0,
        "memory_limit_gb": 16.0
    }
)

Kubernetes CronJob Integration

apiVersion: batch/v1
kind: CronJob
metadata:
  name: inferno-model-sync
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: model-sync
            image: inferno:latest
            command:
            - inferno
            - batch-queue
            - create
            - --name
            - k8s-model-sync
            - --command
            - sync_models --source s3://models --dest /models
            resources:
              requests:
                cpu: 1
                memory: 2Gi
              limits:
                cpu: 2
                memory: 4Gi
          restartPolicy: OnFailure

Best Practices

Audit System

Enable encryption for sensitive audit data
Configure appropriate retention policies
Set up multiple alert channels for redundancy
Regularly verify audit log integrity
Archive old logs to long-term storage

Batch Processing

Use appropriate resource limits for jobs
Implement proper error handling and logging
Test cron expressions before deployment
Monitor job performance and adjust as needed
Use job dependencies for complex workflows

Security

Encrypt audit logs containing sensitive data
Use secure channels for alert notifications
Implement proper access controls for job management
Regularly rotate encryption keys
Monitor for unusual job patterns or failures

Performance

Optimize job resource allocation
Use compression for large audit logs
Schedule resource-intensive jobs during off-peak hours
Monitor queue performance and adjust concurrency limits
Archive completed job data periodically

FilesExpand file tree

AUDIT_AND_BATCH_GUIDE.md

Latest commit

History

AUDIT_AND_BATCH_GUIDE.md

File metadata and controls

Audit System and Batch Processing Guide

Audit System Overview

Key Features

Audit System Configuration

Enable Audit Logging

Configuration File

Audit Operations

View Audit Logs

Audit Log Management

Alert Configuration

Audit Event Types

Authentication Events

Model Operations

Security Events

System Events

Batch Processing System

Key Features

Batch Queue Configuration

Enable Batch Processing

Configuration File

Batch Job Operations

Create and Manage Jobs

Job Management

Job Monitoring

Cron Scheduling

Cron Expression Format

Common Cron Examples

Advanced Scheduling

Job Types and Templates

Model Processing Jobs

Inference Batch Jobs

Maintenance Jobs

Job Configuration Templates

Model Conversion Template

Batch Inference Template

Monitoring and Alerting

Job Monitoring

Alert Configuration

Integration Examples

Python Batch Client

Kubernetes CronJob Integration

Best Practices

Audit System

Batch Processing

Security

Performance