Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 115 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ Extract only what you need while maintaining referential integrity.
- ✅ **Multiple records**: Extract multiple records in one operation
- ✅ **Timeframe filtering**: Filter specific tables by date ranges
- ✅ **PK remapping**: Auto-remaps auto-generated primary keys for clean imports
- ✅ **Natural key support**: Idempotent SQL generation for tables without unique constraints
- ✅ **DDL generation**: Optionally include CREATE DATABASE/SCHEMA/TABLE statements for self-contained dumps
- ✅ **Progress bar**: Visual progress indicator for dump operations
- ✅ **Schema caching**: SQLite-based caching for improved performance
Expand Down Expand Up @@ -181,6 +182,8 @@ pgslice --host localhost --database mydb --dump users --pks 42 \
--log-level DEBUG 2>debug.log
```

**Transaction Safety**: All generated SQL dumps are wrapped in `BEGIN`/`COMMIT` transactions by default. If any part of the import fails, everything automatically rolls back, leaving your database unchanged.

### Schema Exploration

```bash
Expand All @@ -202,6 +205,118 @@ pgslice> tables
pgslice> describe film
```

## Idempotent Imports with Natural Keys

### What Are Natural Keys?

**Natural keys** are columns (or combinations of columns) that uniquely identify a record by its business meaning, even without explicit database constraints. They represent the "real-world" identifier for your data.

Examples:
- `roles.name` - Role names like "Admin", "User", "Guest" are naturally unique
- `statuses.code` - Status codes like "ACTIVE", "INACTIVE", "PENDING"
- `(tenant_id, setting_key)` - Configuration settings in multi-tenant systems
- `countries.iso_code` - ISO country codes like "US", "CA", "UK"

### Why Use Natural Keys?

By default, pgslice remaps auto-generated primary keys (SERIAL, IDENTITY) to avoid conflicts when importing. However, this can create duplicate records if you reimport the same dump multiple times:

```sql
-- First import: Creates record with new id=1
INSERT INTO roles (name) VALUES ('Admin');

-- Second import: Creates duplicate with new id=2 (no UNIQUE constraint to prevent it!)
INSERT INTO roles (name) VALUES ('Admin');
```

The `--natural-keys` flag solves this by generating **idempotent SQL** - scripts that check "does a record with this natural key already exist?" before inserting. Run the same dump multiple times safely with no duplicates.

### When to Use `--natural-keys`

Use this flag when:
- ✅ Tables have auto-generated PKs (SERIAL, IDENTITY columns)
- ✅ You need to reimport dumps multiple times (development, testing, CI/CD)
- ✅ Tables lack explicit UNIQUE constraints on natural key columns
- ✅ You need composite natural keys (multiple columns for uniqueness)
- ✅ Auto-detection fails or you want explicit control

### Usage Examples

```bash
# Single-column natural key (common for reference/lookup tables)
pgslice --host localhost --database mydb --dump users --pks 42 \
--natural-keys "roles=name"

# With schema prefix (explicit schema)
pgslice --host localhost --database mydb --dump users --pks 42 \
--natural-keys "public.roles=name"

# Composite natural key (multiple columns define uniqueness)
pgslice --host localhost --database mydb --dump customers --pks 1 \
--natural-keys "tenant_settings=tenant_id,setting_key"

# Multiple tables (semicolon-separated)
pgslice --host localhost --database mydb --dump orders --pks 123 \
--natural-keys "roles=name;statuses=code;countries=iso_code"

# Complex example with mixed single and composite keys
pgslice --host localhost --database mydb --dump products --pks 456 \
--natural-keys "roles=name;tenant_configs=tenant_id,config_key;categories=slug"
```

**Format**: `--natural-keys "schema.table=col1,col2;other_table=col1;..."`
- Tables separated by `;`
- Columns separated by `,`
- Schema prefix optional (defaults to `public`)

### Auto-Detection

pgslice automatically detects natural keys in this priority order:

1. **Manual specification** (highest priority) - Your `--natural-keys` flag
2. **Common column names** - Recognizes patterns like:
- Exact matches: `name`, `code`, `slug`, `email`, `username`, `key`, `identifier`, `handle`
- Suffix patterns: `*_code`, `*_key`, `*_identifier`, `*_slug`
3. **Reference table heuristic** - Small tables (2-3 columns) with one non-PK text column
4. **Error if none found** - Suggests using `--natural-keys` manually

For most reference tables (roles, statuses, categories), auto-detection works automatically. Use manual specification for:
- Tables with unconventional column names
- Composite natural keys
- When you want explicit control

### How It Works

When natural keys are specified, pgslice generates sophisticated CTE-based SQL that:

1. Checks if records with matching natural keys already exist
2. Only inserts records that don't exist yet
3. Maps old primary keys to new (or existing) primary keys for foreign key resolution
4. Ensures idempotency - running multiple times produces the same result

Example generated SQL structure:
```sql
WITH to_insert AS (
-- Values to potentially insert
SELECT * FROM (VALUES (...)) AS v(...)
),
existing AS (
-- Find records that already exist by natural key
SELECT t.id, ti.old_id
FROM roles t
INNER JOIN to_insert ti ON t.name IS NOT DISTINCT FROM ti.name
),
inserted AS (
-- Insert only new records (skip existing)
INSERT INTO roles (name, permissions)
SELECT name, permissions FROM to_insert
WHERE old_id NOT IN (SELECT old_id FROM existing)
RETURNING id, name
)
-- Map old IDs to new IDs for FK resolution
...
```

## Configuration

Key environment variables (see `.env.example` for full reference):
Expand Down
85 changes: 85 additions & 0 deletions src/pgslice/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,74 @@ def parse_main_timeframe(spec: str) -> MainTableTimeframe:
)


def parse_natural_keys(spec: str) -> dict[str, list[str]]:
"""
Parse natural keys specification.

Format: schema.table=col1,col2;other_table=col1
Example: public.roles=name;public.statuses=code

Args:
spec: Natural keys specification string

Returns:
Dict mapping "schema.table" to list of column names

Raises:
InvalidTimeframeError: If specification is invalid

Examples:
>>> parse_natural_keys("public.roles=name")
{"public.roles": ["name"]}

>>> parse_natural_keys("public.roles=name;statuses=code")
{"public.roles": ["name"], "statuses": ["code"]}

>>> parse_natural_keys("roles=name,code")
{"roles": ["name", "code"]}
"""
result: dict[str, list[str]] = {}

# Split by semicolon to get individual table specifications
table_specs = spec.split(";")

for table_spec in table_specs:
table_spec = table_spec.strip()
if not table_spec:
continue

# Split by = to get table and columns
if "=" not in table_spec:
raise InvalidTimeframeError(
f"Invalid natural key format: {table_spec}. "
"Expected: table=col1,col2 or schema.table=col1"
)

table_part, columns_part = table_spec.split("=", 1)
table_part = table_part.strip()
columns_part = columns_part.strip()

if not table_part or not columns_part:
raise InvalidTimeframeError(
f"Invalid natural key format: {table_spec}. "
"Both table and columns must be specified"
)

# Split columns by comma
columns = [col.strip() for col in columns_part.split(",")]
columns = [col for col in columns if col] # Remove empty strings

if not columns:
raise InvalidTimeframeError(
f"Invalid natural key format: {table_spec}. "
"At least one column must be specified"
)

result[table_part] = columns

return result


def fetch_pks_by_timeframe(
conn_manager: ConnectionManager,
table: str,
Expand Down Expand Up @@ -407,6 +475,15 @@ def main() -> int:
"-o",
help="Output file path (default: stdout)",
)
dump_group.add_argument(
"--natural-keys",
help=(
"Manually specify natural keys for tables without unique constraints. "
"Format: 'schema.table=col1,col2;other_table=col1'. "
"Enables idempotent INSERTs for tables with auto-generated PKs. "
"Example: 'public.roles=name;public.statuses=code'"
),
)

# Other arguments
parser.add_argument(
Expand Down Expand Up @@ -456,6 +533,14 @@ def main() -> int:
if args.log_level:
config.log_level = args.log_level

# Parse natural keys if provided
if args.natural_keys:
try:
config.natural_keys = parse_natural_keys(args.natural_keys)
except InvalidTimeframeError as e:
sys.stderr.write(f"Error: {e}\n")
return 1

# Validate CLI dump mode arguments
if args.dump and not args.pks and not args.timeframe:
sys.stderr.write(
Expand Down
1 change: 1 addition & 0 deletions src/pgslice/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ class AppConfig:
sql_batch_size: int = 100
output_dir: Path = Path.home() / ".pgslice" / "dumps"
create_schema: bool = False
natural_keys: dict[str, list[str]] | None = None


def load_config() -> AppConfig:
Expand Down
4 changes: 3 additions & 1 deletion src/pgslice/dumper/dump_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,9 @@ def dump(
# Step 3: Generate SQL (using animated spinner)
with animated_spinner(spinner, pbar.set_description, "Generating SQL"):
generator = SQLGenerator(
introspector, batch_size=self.config.sql_batch_size
introspector,
batch_size=self.config.sql_batch_size,
natural_keys=self.config.natural_keys,
)
sql = generator.generate_batch(
sorted_records,
Expand Down
Loading