-
Notifications
You must be signed in to change notification settings - Fork 99
Description
False AlreadyExists 409 Errors During Parallel Batch Inserts (3.60.0+)
Summary
The google-cloud-spanner Python client library version 3.60.0 and 3.61.0 exhibit a critical bug where unique primary keys are incorrectly reported as already existing during parallel batch insert operations, causing spurious AlreadyExists: 409 errors despite rigorous pre-insert validation confirming all keys are unique.
Environment
- google-cloud-spanner versions tested:
- 3.59.0 ✅ (working)
- 3.60.0 ❌ (bug introduced)
- 3.61.0 ❌ (bug persists)
- Python version: 3.10.14
- Operating System: Docker container (Debian-based, Python 3.10.14)
- Workload characteristics:
- 8 parallel workers
- ~33,378 total rows across all workers
- ~4,172 rows per worker
- Batch insert operations using
database.batch().insert() - UUID column is PRIMARY KEY
Bug Description
Observed Behavior
During parallel batch insert operations, the Spanner client randomly throws false AlreadyExists: 409 errors claiming that primary keys already exist in the table:
google.api_core.exceptions.AlreadyExists: 409 Row [5b45e22f-2f3c-4e69-9dfb-8cae1e9aedc8] in table TABLE_NAME already exists
Critical Evidence
Our validation PASSES before Spanner insert:
✓ Chunk validation passed: 8 workers, 79 total chunks, 33378 unique UUIDs
This proves:
- All 33,378 UUIDs are unique when sent to Spanner
- No duplicates exist in the data we're inserting
- The AlreadyExists errors are false - the bug is in the Spanner client library
Regression Test Results
We performed controlled regression testing by building identical Docker images with only the google-cloud-spanner version changed:
| Version | AlreadyExists Errors | Total Errors | Result |
|---|---|---|---|
| 3.59.0 | 0 | 2 | ✅ PASSED |
| 3.60.0 | 4 | 13 | ❌ FAILED |
| 3.61.0 | 2 | 11 | ❌ FAILED |
Conclusion: Bug was introduced between versions 3.59.0 and 3.60.0.
Reproduction Steps
1. Environment Setup
from google.cloud import spanner
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
import uuid
client = spanner.Client(project='your-project')
instance = client.instance('your-instance')
database = instance.database('your-database')2. UUID Generation with Validation
def generate_dataframe_with_uuids(num_rows):
"""Generate DataFrame with unique UUIDs."""
df = pd.DataFrame({
'UUID': [str(uuid.uuid4()) for _ in range(num_rows)],
'DATA': [f'row_{i}' for i in range(num_rows)],
# ... other columns
})
# Validate uniqueness (THIS PASSES)
uuids = df['UUID'].tolist()
unique_uuids = set(uuids)
assert len(uuids) == len(unique_uuids), "Duplicates detected in generation!"
return df3. Parallel Batch Insert
def insert_chunk(chunk_df, table_name):
"""Insert a chunk using batch insert."""
with database.batch() as batch:
batch.insert(
table_name,
columns=chunk_df.columns.tolist(),
values=chunk_df.values.tolist()
)
def parallel_insert(df, table_name, num_workers=8):
"""Perform parallel batch inserts."""
chunks = np.array_split(df, num_workers)
with ThreadPoolExecutor(max_workers=num_workers) as executor:
futures = [
executor.submit(insert_chunk, chunk, table_name)
for chunk in chunks
]
for future in futures:
future.result() # Will raise AlreadyExists with 3.60.0+4. Trigger the Bug
# Using google-cloud-spanner==3.60.0 or 3.61.0
python parallel_insert_script.py
# Result: Random AlreadyExists: 409 errors
# Using google-cloud-spanner==3.59.0
python parallel_insert_script.py
# Result: Success, no errorsExpected Behavior
All unique UUIDs should insert successfully without AlreadyExists errors. Our validation confirms that:
- All UUIDs are generated uniquely
- No duplicates exist in the dataset
- Each UUID should be inserted exactly once
Actual Behavior (3.60.0+)
- UUID generation produces unique values (validation passes)
- Parallel batch insert operations randomly fail with false
AlreadyExists: 409errors - Error claims UUID already exists in table, but validation proves it doesn't
- Hypothesis: Spanner client is either:
- Incorrectly sending duplicate insert mutations for the same row
- Mishandling transaction/batch boundaries in parallel operations
- Retrying failed operations without proper deduplication
- Double-processing mutations due to issues with mutation buffering in parallel contexts
Code Excerpts
Our UUID Validation (Always Passes)
From app/src/app_common/gcp/spanner.py:
def df_batch_insert(self, name, df):
"""Insert DataFrame with UUID validation."""
logger.debug(f"df_batch_insert: {name=}")
# Validate no duplicate UUIDs in this batch
if 'UUID' in df.columns:
uuids = df['UUID'].tolist()
unique_uuids = set(uuids)
if len(uuids) != len(unique_uuids):
duplicates = [uuid for uuid in unique_uuids if uuids.count(uuid) > 1]
error_msg = f'DUPLICATE UUIDs DETECTED IN BATCH! Duplicates: {duplicates[:5]}'
get_run_logger().error(error_msg)
raise ValueError(error_msg)
# Log batch identity for debugging
get_run_logger().debug(
f"Batch for {name}: {len(df)} rows, "
f"first UUID: {uuids[0]}, last UUID: {uuids[-1]}"
)
with self.database.batch() as batch:
batch.insert(
name,
columns=df.columns.tolist(),
values=df.values.tolist()
)Cross-Worker Validation (Also Passes)
# Validate chunks have no overlapping UUIDs across workers
if 'UUID' in df.columns:
all_uuids = []
for worker_idx, worker_chunks in enumerate(chunks):
for chunk_idx, chunk in enumerate(worker_chunks):
chunk_uuids = chunk['UUID'].tolist()
all_uuids.extend(chunk_uuids)
unique_count = len(set(all_uuids))
total_count = len(all_uuids)
if unique_count != total_count:
duplicates = [u for u in set(all_uuids) if all_uuids.count(u) > 1]
error_msg = f'DUPLICATE UUIDs DETECTED ACROSS CHUNKS! Duplicates: {duplicates[:5]}'
get_run_logger().error(error_msg)
raise ValueError(error_msg)
get_run_logger().info(
f"✓ Chunk validation passed: {len(chunks)} workers, "
f"{sum(len(w) for w in chunks)} total chunks, "
f"{unique_count} unique UUIDs"
)Log Evidence
With google-cloud-spanner==3.59.0 (Working)
✓ Chunk validation passed: 8 workers, 79 total chunks, 33378 unique UUIDs
Batch for TABLE_NAME: 4172 rows, first UUID: 5b45e22f-..., last UUID: 8a3f...
Batch for TABLE_NAME: 4173 rows, first UUID: 9c2d..., last UUID: 7b1e...
...
[All 8 workers complete successfully]
AlreadyExists Errors: 0
Result: ✅ PASSED
With google-cloud-spanner==3.60.0 (Buggy)
✓ Chunk validation passed: 8 workers, 79 total chunks, 33378 unique UUIDs
Batch for TABLE_NAME: 4172 rows, first UUID: 5b45e22f-..., last UUID: 8a3f...
...
ERROR - 409 Row [5b45e22f-2f3c-4e69-9dfb-8cae1e9aedc8] in table TABLE_NAME already exists
ERROR - google.api_core.exceptions.AlreadyExists: 409 Row [5b45e22f-...] in table already exists
AlreadyExists Errors: 4
Result: ❌ FAILED
Key Observation: The UUID 5b45e22f-2f3c-4e69-9dfb-8cae1e9aedc8 appears in the error despite:
- Being validated as unique before insert
- Only being generated once in our code
- Being part of a single worker's batch
- Never being sent to Spanner more than once by our code
Additional Context
Our Testing Framework
We've developed a comprehensive regression testing framework to validate Spanner versions:
- Automated Docker builds with specific Spanner versions
- Controlled test execution with identical datasets
- Detailed logging and error analysis
Workaround
# In requirements.txt
google-cloud-spanner==3.59.0 # Pin to last known good versionRequest to Google Team
-
Investigate changes between 3.59.0 and 3.60.0 related to:
- Batch insert implementation
- Mutation handling in concurrent contexts
- Transaction boundary management
- Retry/idempotency logic
-
Provide guidance on:
- Recommended patterns for parallel batch inserts
- Best practices for concurrent Spanner operations
- Whether this is a known issue with a fix in progress
-
Timeline for fix in upcoming releases
Version Information
>>> import google.cloud.spanner
>>> google.cloud.spanner.__version__
'3.60.0' # or '3.61.0' - both exhibit the bug
>>> import sys
>>> sys.version
'3.10.14 (main, ...) [GCC 12.2.0]'Note: We've validated this extensively through automated regression testing and are confident this is a client library bug producing false AlreadyExists errors. Our pre-insert validation conclusively proves all primary keys are unique when sent to Spanner, yet the client reports them as duplicates. This is not a user code issue.