Skip to content

Conversation

@aliculPix4D
Copy link

lets sync one more time...

analytically and others added 30 commits February 25, 2025 10:30
The maxValidFromFile function was incorrectly using the hardcoded uidMap
path regardless of the filename parameter passed to it. This caused the
MaxValidIds method to always read from /proc/self/uid_map twice instead
of reading from both uid_map and gid_map files.

Fixed by modifying maxValidFromFile to properly use the fname parameter
both when opening the file and when generating error messages.

Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
Co-authored-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
Co-authored-by: Taylor Silva <dev@taydev.net>
In the case where we failed to create a container we would log the error
twice in the web node and also mark it as failed twice. The second call
to mark the container as failed would itself fail and return an error
because the container is already marked as failed.

Signed-off-by: Taylor Silva <dev@taydev.net>
We found UUID collisions occuring in our cluster. We thought it was
impossible but I decided to look into the library we were using for UUID
generation and found this issue: nu7hatch/gouuid#28

Which would explain some of the errors I've seen where the only possible
explanation was that we got the same UUID twice.

Signed-off-by: Taylor Silva <dev@taydev.net>
Fix bug in maxValidFromFile using hardcoded path
on systems where the test was slow to run it seemed that this testing of
the batch duration would flake a lot. I think there was also a general
race condition from accessing the NewRelicBatch array since it does get
cleared out in a go routine. I left NewRelicBatch public still because
it makes the testing setup easier, but now there's a Batch() function
that can be used to safely access the array.

Signed-off-by: Taylor Silva <dev@taydev.net>
Use github.com/google/uuid to generate UUIDv4
Makes process killing concurrent by launching Kill operations in
separate goroutines. The buffered error channel ensures we don't block
goroutines while waiting for errors to be processed. Returns on first
error to maintain original behavior. Each Kill operation executes
independently, allowing multiple processes to be killed simultaneously
while still respecting grace periods and proper error handling.

Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
Co-authored-by: Taylor Silva <dev@taydev.net>
…om-go-jose-go-jose-v3-vulnerability

Update module github.com/go-jose/go-jose/v3 to v3.0.4 [SECURITY]
We can use the built-in `min` and `max` functions since Go 1.21.

Reference: https://go.dev/ref/spec#Min_and_max

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
We found that if a container wasn't correctly made, specifically if the
`/etc/hosts` file of the container wasn't populated with the containers
IP, container cleanup would fail with this error:

```
resume container traffic: error getting container IP: ip not found for container handle: 0ce036f8-aa2a-4bea-acbd-8ee48e5fa63b
```

We can ignore this error when deleting a container because it tells us
there are no iptables rules to remove anyways. We can probably continue
deleting the container and will delete it successfully.

Signed-off-by: Taylor Silva <dev@taydev.net>
…lookup-errors

worker runtime: make container deletion more robust
Replace min/max helpers with built-in min/max
- Fix variable source diffs printing to stdout instead of provided writer
- Enhance practicallyDifferent to properly handle nil values
- nit: change interface{} to any, slowly modernizing go code

Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
Co-authored-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
Co-authored-by: Taylor Silva <dev@taydev.net>
The error message when encountering an invalid placement strategy incorrectly
referenced the strategy variable (the accumulating result) rather than the
specific invalid strategy string that caused the error. This resulted in
misleading error messages that showed all strategies added so far instead
of just the invalid one.

Changed the error message to reference the specific invalid strategy string
for clearer troubleshooting.

nit: used newer `range int` feature in Go to make loop more readable

Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
Co-authored-by: Taylor Silva <dev@taydev.net>
Replace mutex-based implementation with Go's atomic types to reduce contention
and improve performance under high concurrency. This change eliminates lock
overhead while maintaining thread safety.

Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
Co-authored-by: Taylor Silva <dev@taydev.net>
…reds

Ensure the read lock is properly released using defer to guarantee thread safety
when iterating over credentials. This prevents a possible race condition if
multiple goroutines access the Tracker concurrently.

nit: modernize code by using any instead of interface{}

Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
Co-authored-by: Taylor Silva <dev@taydev.net>
The scanner.scanResources method was creating a goroutine for each
resource without any concurrency limits. For deployments with many
resources, this could lead to resource exhaustion on the web node and
possibly the database.

This commit introduces a concurrency limit of 50 simultaneous resource
checks. The implementation spawns 50 go routines and works through
resources passed down a channel.

Additionally, improved context handling was added to ensure proper
cleanup when the scanner is being shut down or the context is otherwise
canceled. This prevents orphaned goroutines from continuing to run after
their parent process has been signaled to stop.

These changes make the scanner more robust and predictable under high
load while maintaining the existing parallel checking behavior.

Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
Co-authored-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
…-test

CI: only run baggageclaimcmd test on linux
1. Fixes improper use of defer Close(rows) inside loops in the JobsToSchedule
  method which would cause resource leaks, as deferred calls would only
  execute when the function returns, not at the end of loop iterations.

2. Removes redundant database query in GetPendingBuilds method that was
  re-fetching job information already available in the job struct.

3. Adds missing defer Close(rows) calls in multiple methods

4. Ensures proper error handling when scanning database rows.

Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
Co-authored-by: Taylor Silva <dev@taydev.net>
This commit adds implementations for the context-aware methods
in the logConn wrapper that were previously missing. The DbConn
interface requires several context-aware methods (ExecContext,
QueryContext, QueryRowContext, PrepareContext, BeginTx) which
were not explicitly implemented in logConn.

Without these implementations, any context-aware database operations
would bypass the logging functionality, leading to inconsistent
logging behavior. The fix ensures that all database operations,
both with and without context, are properly logged.

Additionally, this commit completes the logDbTx implementation by
adding missing context methods to maintain consistency throughout
the entire logging wrapper system.

Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
Co-authored-by: Taylor Silva <dev@taydev.net>
Refactored the CleanUpInvalidTaskCaches function to use a single SQL
operation instead of two separate queries. This change improves code
readability and efficiency by using a subquery directly within the
DELETE statement to identify and remove inactive task caches in a
single database operation.

Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
Co-authored-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
Co-authored-by: Taylor Silva <dev@taydev.net>
- Apply early continue pattern in constrainingCandidates() to reduce nesting and
 improve code clarity
- Pre-allocate slice capacity in orderJobs() to avoid multiple allocations during
 append operations

These changes maintain identical functionality while making the code more
maintainable and slightly more efficient.

Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
Co-authored-by: Taylor Silva <dev@taydev.net>
The Get and Set methods of ContainerSpec incorrectly used strings.SplitN with
arguments in the wrong order (separator and string were swapped). This bug would
prevent proper parsing of environment variables for tracing propagation.

- Fixed strings.SplitN call in Get method
- Fixed strings.SplitN call in Set method

Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
Co-authored-by: Taylor Silva <dev@taydev.net>
Fix several issues in pipeline cycle detection:
- Reset visited state for each root job to prevent false positives
- Replace numeric constants with JobState enum for better clarity
- Ensure immediate error propagation when cycle is detected

Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
Co-authored-by: Taylor Silva <dev@taydev.net>
taylorsilva and others added 8 commits March 5, 2025 19:07
fly: improve performance of fly watch
web: Update frontend graphics dependencies
Signed-off-by: Maria Shaldybin <maria.shaldybin@broadcom.com>
 Mount /sys/fs/cgroup as cgroup2 type if supported
We found that when the worker is restarted that all containers would be
present, but any tasks running in them would disappear. This would result
in TaskNotFound errors, mostly in resource checks.

Resource checks are more prone to this type of error because we re-use
their containers over a long period of time. The other types of
containers we make (get, put, task) usually don't hang around that long.

We initially create a Task during container creation and assume that
task would still be there when we go to exec the actual executable that
we wanted to run (e.g. /opt/resource/{in,out,check}). For long running
containers this may not be true but we can gracefully recover in this
scenario.

Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
Gracefully recover from containerd TaskNotFound errors
@aliculPix4D aliculPix4D requested a review from a team as a code owner March 7, 2025 10:17
taylorsilva and others added 13 commits March 7, 2025 11:39
We've had a few PR's that are specific to our linux-containerd
implementation: concourse#9017 and concourse#9094

After these PR's folks in the community found that they couldn't compile
Concourse for non-Linux platforms. This commit tries to make it easier
for future contributors to make changes to the containerd-linux runtime
and not accidently break things for non-linux platforms.

Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
worker: split linux and non-linux features
Signed-off-by: IvanChalukov <ichalukov@gmail.com>
Previously we could spawn as many in-memory checks and they wouldn't be
checked for duplication until the atc/db/lock acquired a Mutex and
tried to get a lock from the database.

I'm _guessing_, based on the info from
concourse#8638 that the contention
for this lock would get really high. I'm guessing this and other users
cluster are very high usage clusters. It's possible that lidar may send
off multiple checks for the same resource which would result in multiple
go routines running and fighting over the same lock. Lidar would send
off multiple in-memory checks if every time it runs it still finds that
a resource has exceeded it's check interval.

I'm not 100% sure this is the problem and I don't have an easy way to
test this. This is all based on my reading of the code as it is today
and the pprof memory graph reported from users that showed a lot of
in-memory checks waiting on the db/lock Mutex.

Signed-off-by: Taylor Silva <dev@taydev.net>
this code isn't being used anywhere. Dead code.

Signed-off-by: Taylor Silva <dev@taydev.net>
and removed deprecated calls to rand.Seed() which is a no-op starting in
Go 1.24

Signed-off-by: Taylor Silva <dev@taydev.net>
web: avoid duplicate in-memory checks
…laim

worker: fix baggageclaim tests on windows
Fix `--team` flag in `order-pipelines` command
Signed-off-by: IvanChalukov <ichalukov@gmail.com>
…che-team

Add `--team` flag to `clear-resource-cache` command
@aliculPix4D aliculPix4D merged commit 83194c2 into master Mar 10, 2025
10 checks passed
@aliculPix4D aliculPix4D deleted the merge-upstream-2025-03-07 branch March 10, 2025 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants