Broadside: support partitioning by submitted time feature#4774
Broadside: support partitioning by submitted time feature#4774mauriceyap wants to merge 6 commits intomasterfrom
Conversation
Greptile SummaryThis PR adds a Key changes and issues found:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant I as Ingester
participant PD as PostgresDatabase
participant PG as PostgreSQL
rect rgb(220, 240, 255)
Note over PD,PG: InitialiseSchema with PartitionBySubmitted
PD->>PG: "partition_up.sql: rename job to job_unpartitioned, create partitioned job"
PD->>PG: "CREATE TABLE job_pYYYYMMDD per jobAgeDays + today+2 buffer"
PD->>PG: "CREATE TABLE job_default PARTITION OF job DEFAULT"
PD->>PG: "INSERT INTO job SELECT * FROM job_unpartitioned"
PD->>PG: "DROP TABLE job_unpartitioned"
end
rect rgb(255, 240, 220)
Note over I,PG: ExecuteIngestionQueryBatch partitioned path
I->>PD: "ExecuteIngestionQueryBatch(queries)"
PD->>PD: "buildJobSubmittedMap + queriesToInstructionSet"
Note over PD: "Phase 1 sequential"
PD->>PG: "CopyFrom job (createJobsPartitioned)"
PD->>PG: "CopyFrom job_spec (createJobSpecsPartitioned)"
Note over PD: "Phase 2 parallel"
par
PD->>PG: "UPDATE job via temp table (updateJobsPartitioned)"
and
PD->>PG: "CopyFrom job_run (createJobRunsPartitioned)"
and
PD->>PG: "CopyFrom job_error (createJobErrorsPartitioned)"
end
Note over PD: "Phase 3 via LookoutDb"
PD->>PG: "UPDATE job_run (LookoutDb.UpdateJobRuns)"
end
rect rgb(220, 255, 220)
Note over PD,PG: PopulateHistoricalJobs partitioned path
loop "per age bucket (ageDays in JobAgeDays)"
PD->>PD: "baseTime = Now().Truncate(24h) minus ageDays"
PD->>PD: "partitionName = job_pYYYYMMDD"
PD->>PG: "INSERT INTO job_pYYYYMMDD WHERE i%nBuckets=bucketIdx"
PD->>PG: "INSERT INTO job_spec WHERE i%nBuckets=bucketIdx"
PD->>PG: "INSERT INTO job_run WHERE i%nBuckets=bucketIdx"
PD->>PG: "INSERT INTO job_error WHERE i%nBuckets=bucketIdx"
end
end
rect rgb(255, 220, 220)
Note over PD,PG: TearDown with PartitionBySubmitted
PD->>PG: "TRUNCATE all tables"
PD->>PG: "partition_down.sql: DROP TABLE job CASCADE, recreate unpartitioned job"
PD->>PG: "DROP COLUMN submitted from job_run, job_spec, job_error"
end
|
cf023e3 to
9850377
Compare
323fb9b to
fd8bbdd
Compare
|
@Mergifyio refresh |
✅ Pull request refreshed |
|
@Mergifyio update |
☑️ Nothing to do, the required conditions are not metDetails
|
|
@Mergifyio refresh |
✅ Pull request refreshed |
|
@Mergifyio rebase |
❌ Unable to rebase: Mergify can't impersonate
|
This adds a feature toggle to Broadside which range-partitions the Postgres `job` table by the `submitted` timestamp with daily partitions. This is mutually exclusive with the existing `hotColdSplit` toggle. When enabled: - The job table PK becomes (job_id, submitted) - Child tables (job_run, job_spec, jon_error) gain a submitted column - Ingestion bypasses `LookoutDb` (used in production), and instead uses direct SQL with submitted in WHERE clauses for partition pruning (this removes the risk of breaking anything running in production, at the cost of duplicating functionality - I think this is a good trade-off) - Historical bulk inserts split per age bucket so each INSERT targets a single partition This also fixes a bug where the generate_series SQL function hardcoded submitted to `NOW() - 24 hours`, ignoring the configured `jobAgeDays`. It now uses a LATERAL subquery to vary the base time per age bucket. Signed-off-by: Maurice Yap <mauriceyap@hotmail.co.uk>
Signed-off-by: Maurice Yap <mauriceyap@hotmail.co.uk>
6729360 to
c6c8bc4
Compare
This adds a feature toggle to Broadside which range-partitions the Postgres
jobtable by thesubmittedtimestamp with daily partitions. This is mutually exclusive with the existinghotColdSplittoggle.When enabled:
LookoutDb(used in production), and instead uses direct SQL with submitted in WHERE clauses for partition pruning (this removes the risk of breaking anything running in production, at the cost of duplicating functionality - I think this is a good trade-off)This also fixes a bug where the generate_series SQL function hardcoded submitted to
NOW() - 24 hours, ignoring the configuredjobAgeDays. It now uses a LATERAL subquery to vary the base time per age bucket.