-
Notifications
You must be signed in to change notification settings - Fork 40
Description
Expected Behaviour
The agent batch logs (i.e. the ones that can be viewed by clicking on the View Log button in an agent header in a job) do not contain a lot (or any ideally) gRPC failures.
Current Behaviour
The agent batch logs are full of the following errors:
Exception on log tailing task (<log-id>): Status(StatusCode="Internal", Detail="Error starting gRPC call. HttpRequestException: The HTTP/2 server reset the stream. HTTP/2 error code 'PROTOCOL_ERROR' (0x1). (HttpProtocolError) HttpProtocolException: The HTTP/2 server reset the stream. HTTP/2 error code 'PROTOCOL_ERROR' (0x1). (HttpProtocolError)", DebugException="System.Net.Http.HttpRequestException: The HTTP/2 server reset the stream. HTTP/2 error code 'PROTOCOL_ERROR' (0x1). (HttpProtocolError)")
This also sometimes causes jobs to fail completely, if it causes the following unhandled exception, which is also gRPC related:
Unhandled exception. System.AggregateException: An error occurred while writing to logger(s). (Object reference not set to an instance of an object.)
---> System.NullReferenceException: Object reference not set to an instance of an object.
at EpicGames.Core.LoggerScopeCollection.GetProperties()+MoveNext() in <workspace>\Engine\Source\Programs\Shared\EpicGames.Core\Log.cs:line 831
at EpicGames.Core.LogEvent.MergedPropertyList.AddRange(IEnumerable`1 properties) in <workspace>\Engine\Source\Programs\Shared\EpicGames.Core\LogEvent.cs:line 106
at EpicGames.Core.LogEvent.AddProperties(IEnumerable`1 properties) in <workspace>\Engine\Source\Programs\Shared\EpicGames.Core\LogEvent.cs:line 177
at EpicGames.Core.DefaultLogger.Log[TState](LogLevel logLevel, EventId eventId, TState state, Exception exception, Func`3 formatter) in <workspace>\Engine\Source\Programs\Shared\EpicGames.Core\Log.cs:line 1283
at Microsoft.Extensions.Logging.Logger.<Log>g__LoggerLog|14_0[TState](LogLevel logLevel, EventId eventId, ILogger logger, Exception exception, Func`3 formatter, List`1& exceptions, TState& state)
--- End of inner exception stack trace ---
at Microsoft.Extensions.Logging.Logger.ThrowLoggingError(List`1 exceptions)
at Microsoft.Extensions.Logging.Logger.Log[TState](LogLevel logLevel, EventId eventId, TState state, Exception exception, Func`3 formatter)
at Microsoft.Extensions.Logging.LoggerMessage.<>c__DisplayClass8_0.<Define>g__Log|0(ILogger logger, Exception exception)
at Grpc.Net.Client.Internal.GrpcCallLog.ErrorExceedingDeadline(ILogger logger, Exception ex)
at Grpc.Net.Client.Internal.GrpcCall`2.DeadlineExceededCallback(Object state)
at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
at System.Threading.TimerQueue.FireNextTimers()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
Driver finished with exit code -532462766
The failures seem to be more frequent when doing long-running jobs such as a cook with a shader compilation.
I can provide logs with more info if necessary, but the errors are not much clearer.
Possible Solution
I have tried changing the gRPC target groups' protocols to be GRPC instead of HTTP2 as they currently are, because HTTP/2 ping frames are required for the keep-alive signal. This has not fixed the issue, but I think it may lead to the correct solution, as it seems to be a networking issue, which I haven't experienced using other Horde setups (such as docker compose)
Steps to Reproduce
- Deploy a Horde server using the CGD Toolkit
- Connect agents to it
- Note that mine are regular EC2 machines controlled using the
AwsRecyclestrategy, but this shouldn't matter, as they are running the same Horde agent that any machine would.
- Note that mine are regular EC2 machines controlled using the
- Run a packaged build job (helps to have a longer job)
- Look at an agent's batch logs
Cloud Game Development Toolkit version
latest