False AlreadyExists 409 Errors During Parallel Batch Inserts (3.60.0+)

# False AlreadyExists 409 Errors During Parallel Batch Inserts (3.60.0+)

## Summary

The google-cloud-spanner Python client library version **3.60.0** and **3.61.0** exhibit a critical bug where unique primary keys are incorrectly reported as already existing during parallel batch insert operations, causing spurious `AlreadyExists: 409` errors despite rigorous pre-insert validation confirming all keys are unique.

## Environment

- **google-cloud-spanner versions tested:**
  - 3.59.0 ✅ (working)
  - 3.60.0 ❌ (bug introduced)
  - 3.61.0 ❌ (bug persists)
- **Python version:** 3.10.14
- **Operating System:** Docker container (Debian-based, Python 3.10.14)
- **Workload characteristics:**
  - 8 parallel workers
  - ~33,378 total rows across all workers
  - ~4,172 rows per worker
  - Batch insert operations using `database.batch().insert()`
  - UUID column is PRIMARY KEY

## Bug Description

### Observed Behavior

During parallel batch insert operations, the Spanner client randomly throws **false** `AlreadyExists: 409` errors claiming that primary keys already exist in the table:

```
google.api_core.exceptions.AlreadyExists: 409 Row [5b45e22f-2f3c-4e69-9dfb-8cae1e9aedc8] in table TABLE_NAME already exists
```

### Critical Evidence

**Our validation PASSES before Spanner insert:**
```
✓ Chunk validation passed: 8 workers, 79 total chunks, 33378 unique UUIDs
```

This proves:
1. **All 33,378 UUIDs are unique** when sent to Spanner
2. **No duplicates exist** in the data we're inserting
3. **The AlreadyExists errors are false** - the bug is in the Spanner client library

### Regression Test Results

We performed controlled regression testing by building identical Docker images with only the `google-cloud-spanner` version changed:

| Version | AlreadyExists Errors | Total Errors | Result |
|---------|---------------------|--------------|--------|
| 3.59.0  | 0                   | 2            | ✅ **PASSED** |
| 3.60.0  | 4                   | 13           | ❌ **FAILED** |
| 3.61.0  | 2                   | 11           | ❌ **FAILED** |

**Conclusion:** Bug was introduced between versions 3.59.0 and 3.60.0.

## Reproduction Steps

### 1. Environment Setup

```python
from google.cloud import spanner
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
import uuid

client = spanner.Client(project='your-project')
instance = client.instance('your-instance')
database = instance.database('your-database')
```

### 2. UUID Generation with Validation

```python
def generate_dataframe_with_uuids(num_rows):
    """Generate DataFrame with unique UUIDs."""
    df = pd.DataFrame({
        'UUID': [str(uuid.uuid4()) for _ in range(num_rows)],
        'DATA': [f'row_{i}' for i in range(num_rows)],
        # ... other columns
    })

    # Validate uniqueness (THIS PASSES)
    uuids = df['UUID'].tolist()
    unique_uuids = set(uuids)
    assert len(uuids) == len(unique_uuids), "Duplicates detected in generation!"

    return df
```

### 3. Parallel Batch Insert

```python
def insert_chunk(chunk_df, table_name):
    """Insert a chunk using batch insert."""
    with database.batch() as batch:
        batch.insert(
            table_name,
            columns=chunk_df.columns.tolist(),
            values=chunk_df.values.tolist()
        )

def parallel_insert(df, table_name, num_workers=8):
    """Perform parallel batch inserts."""
    chunks = np.array_split(df, num_workers)

    with ThreadPoolExecutor(max_workers=num_workers) as executor:
        futures = [
            executor.submit(insert_chunk, chunk, table_name)
            for chunk in chunks
        ]
        for future in futures:
            future.result()  # Will raise AlreadyExists with 3.60.0+
```

### 4. Trigger the Bug

```bash
# Using google-cloud-spanner==3.60.0 or 3.61.0
python parallel_insert_script.py
# Result: Random AlreadyExists: 409 errors

# Using google-cloud-spanner==3.59.0
python parallel_insert_script.py
# Result: Success, no errors
```

## Expected Behavior

All unique UUIDs should insert successfully without `AlreadyExists` errors. Our validation confirms that:
1. All UUIDs are generated uniquely
2. No duplicates exist in the dataset
3. Each UUID should be inserted exactly once

## Actual Behavior (3.60.0+)

1. UUID generation produces unique values (validation passes)
2. Parallel batch insert operations randomly fail with **false** `AlreadyExists: 409` errors
3. Error claims UUID already exists in table, but validation proves it doesn't
4. **Hypothesis:** Spanner client is either:
   - Incorrectly sending duplicate insert mutations for the same row
   - Mishandling transaction/batch boundaries in parallel operations
   - Retrying failed operations without proper deduplication
   - Double-processing mutations due to issues with mutation buffering in parallel contexts

## Code Excerpts

### Our UUID Validation (Always Passes)

From `app/src/app_common/gcp/spanner.py`:

```python
def df_batch_insert(self, name, df):
    """Insert DataFrame with UUID validation."""
    logger.debug(f"df_batch_insert: {name=}")

    # Validate no duplicate UUIDs in this batch
    if 'UUID' in df.columns:
        uuids = df['UUID'].tolist()
        unique_uuids = set(uuids)
        if len(uuids) != len(unique_uuids):
            duplicates = [uuid for uuid in unique_uuids if uuids.count(uuid) > 1]
            error_msg = f'DUPLICATE UUIDs DETECTED IN BATCH! Duplicates: {duplicates[:5]}'
            get_run_logger().error(error_msg)
            raise ValueError(error_msg)

        # Log batch identity for debugging
        get_run_logger().debug(
            f"Batch for {name}: {len(df)} rows, "
            f"first UUID: {uuids[0]}, last UUID: {uuids[-1]}"
        )

    with self.database.batch() as batch:
        batch.insert(
            name,
            columns=df.columns.tolist(),
            values=df.values.tolist()
        )
```

### Cross-Worker Validation (Also Passes)

```python
# Validate chunks have no overlapping UUIDs across workers
if 'UUID' in df.columns:
    all_uuids = []
    for worker_idx, worker_chunks in enumerate(chunks):
        for chunk_idx, chunk in enumerate(worker_chunks):
            chunk_uuids = chunk['UUID'].tolist()
            all_uuids.extend(chunk_uuids)

    unique_count = len(set(all_uuids))
    total_count = len(all_uuids)

    if unique_count != total_count:
        duplicates = [u for u in set(all_uuids) if all_uuids.count(u) > 1]
        error_msg = f'DUPLICATE UUIDs DETECTED ACROSS CHUNKS! Duplicates: {duplicates[:5]}'
        get_run_logger().error(error_msg)
        raise ValueError(error_msg)

    get_run_logger().info(
        f"✓ Chunk validation passed: {len(chunks)} workers, "
        f"{sum(len(w) for w in chunks)} total chunks, "
        f"{unique_count} unique UUIDs"
    )
```

## Log Evidence

### With google-cloud-spanner==3.59.0 (Working)

```
✓ Chunk validation passed: 8 workers, 79 total chunks, 33378 unique UUIDs
Batch for TABLE_NAME: 4172 rows, first UUID: 5b45e22f-..., last UUID: 8a3f...
Batch for TABLE_NAME: 4173 rows, first UUID: 9c2d..., last UUID: 7b1e...
...
[All 8 workers complete successfully]
AlreadyExists Errors: 0
Result: ✅ PASSED
```

### With google-cloud-spanner==3.60.0 (Buggy)

```
✓ Chunk validation passed: 8 workers, 79 total chunks, 33378 unique UUIDs
Batch for TABLE_NAME: 4172 rows, first UUID: 5b45e22f-..., last UUID: 8a3f...
...
ERROR - 409 Row [5b45e22f-2f3c-4e69-9dfb-8cae1e9aedc8] in table TABLE_NAME already exists
ERROR - google.api_core.exceptions.AlreadyExists: 409 Row [5b45e22f-...] in table already exists
AlreadyExists Errors: 4
Result: ❌ FAILED
```

**Key Observation:** The UUID `5b45e22f-2f3c-4e69-9dfb-8cae1e9aedc8` appears in the error despite:
1. Being validated as unique before insert
2. Only being generated once in our code
3. Being part of a single worker's batch
4. **Never being sent to Spanner more than once by our code**


## Additional Context

### Our Testing Framework

We've developed a comprehensive regression testing framework to validate Spanner versions:
- Automated Docker builds with specific Spanner versions
- Controlled test execution with identical datasets
- Detailed logging and error analysis


### Workaround

```python
# In requirements.txt
google-cloud-spanner==3.59.0  # Pin to last known good version
```

## Request to Google Team

1. **Investigate changes** between 3.59.0 and 3.60.0 related to:
   - Batch insert implementation
   - Mutation handling in concurrent contexts
   - Transaction boundary management
   - Retry/idempotency logic

2. **Provide guidance** on:
   - Recommended patterns for parallel batch inserts
   - Best practices for concurrent Spanner operations
   - Whether this is a known issue with a fix in progress

3. **Timeline** for fix in upcoming releases

## Version Information

```python
>>> import google.cloud.spanner
>>> google.cloud.spanner.__version__
'3.60.0'  # or '3.61.0' - both exhibit the bug

>>> import sys
>>> sys.version
'3.10.14 (main, ...) [GCC 12.2.0]'
```

---

**Note:** We've validated this extensively through automated regression testing and are confident this is a **client library bug producing false AlreadyExists errors**. Our pre-insert validation conclusively proves all primary keys are unique when sent to Spanner, yet the client reports them as duplicates. This is not a user code issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

False AlreadyExists 409 Errors During Parallel Batch Inserts (3.60.0+) #1475

False AlreadyExists 409 Errors During Parallel Batch Inserts (3.60.0+)

Summary

Environment

Bug Description

Observed Behavior

Critical Evidence

Regression Test Results

Reproduction Steps

1. Environment Setup

2. UUID Generation with Validation

3. Parallel Batch Insert

4. Trigger the Bug

Expected Behavior

Actual Behavior (3.60.0+)

Code Excerpts

Our UUID Validation (Always Passes)

Cross-Worker Validation (Also Passes)

Log Evidence

With google-cloud-spanner==3.59.0 (Working)

With google-cloud-spanner==3.60.0 (Buggy)

Additional Context

Our Testing Framework

Workaround

Request to Google Team

Version Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Version	AlreadyExists Errors	Total Errors	Result
3.59.0	0	2	✅ PASSED
3.60.0	4	13	❌ FAILED
3.61.0	2	11	❌ FAILED

False AlreadyExists 409 Errors During Parallel Batch Inserts (3.60.0+) #1475

Description

False AlreadyExists 409 Errors During Parallel Batch Inserts (3.60.0+)

Summary

Environment

Bug Description

Observed Behavior

Critical Evidence

Regression Test Results

Reproduction Steps

1. Environment Setup

2. UUID Generation with Validation

3. Parallel Batch Insert

4. Trigger the Bug

Expected Behavior

Actual Behavior (3.60.0+)

Code Excerpts

Our UUID Validation (Always Passes)

Cross-Worker Validation (Also Passes)

Log Evidence

With google-cloud-spanner==3.59.0 (Working)

With google-cloud-spanner==3.60.0 (Buggy)

Additional Context

Our Testing Framework

Workaround

Request to Google Team

Version Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions