Before we go down a rabbit hole, are buildProgram and createKernel parallelized, where for the same context, we can run several kernel compilations concurrently and see benefits from parallel execution under-the-hood? We've been working on cutting down our load times, and while we achieved much via preloading our kernels, we still have a few load-time kernels where compilation introduces a noticeable bottleneck.
If cl compilation is believed to work but not well-tested, happy to contribute to the test suite.
(Been awhile, @mikeseven !)