Skip to content

Bug: Horde failing some gRPC calls #855

@imjuniper

Description

@imjuniper

Expected Behaviour

The agent batch logs (i.e. the ones that can be viewed by clicking on the View Log button in an agent header in a job) do not contain a lot (or any ideally) gRPC failures.

Current Behaviour

The agent batch logs are full of the following errors:

Exception on log tailing task (<log-id>): Status(StatusCode="Internal", Detail="Error starting gRPC call. HttpRequestException: The HTTP/2 server reset the stream. HTTP/2 error code 'PROTOCOL_ERROR' (0x1). (HttpProtocolError) HttpProtocolException: The HTTP/2 server reset the stream. HTTP/2 error code 'PROTOCOL_ERROR' (0x1). (HttpProtocolError)", DebugException="System.Net.Http.HttpRequestException: The HTTP/2 server reset the stream. HTTP/2 error code 'PROTOCOL_ERROR' (0x1). (HttpProtocolError)")

This also sometimes causes jobs to fail completely, if it causes the following unhandled exception, which is also gRPC related:

Unhandled exception. System.AggregateException: An error occurred while writing to logger(s). (Object reference not set to an instance of an object.)
 ---> System.NullReferenceException: Object reference not set to an instance of an object.
   at EpicGames.Core.LoggerScopeCollection.GetProperties()+MoveNext() in <workspace>\Engine\Source\Programs\Shared\EpicGames.Core\Log.cs:line 831
   at EpicGames.Core.LogEvent.MergedPropertyList.AddRange(IEnumerable`1 properties) in <workspace>\Engine\Source\Programs\Shared\EpicGames.Core\LogEvent.cs:line 106
   at EpicGames.Core.LogEvent.AddProperties(IEnumerable`1 properties) in <workspace>\Engine\Source\Programs\Shared\EpicGames.Core\LogEvent.cs:line 177
   at EpicGames.Core.DefaultLogger.Log[TState](LogLevel logLevel, EventId eventId, TState state, Exception exception, Func`3 formatter) in <workspace>\Engine\Source\Programs\Shared\EpicGames.Core\Log.cs:line 1283
   at Microsoft.Extensions.Logging.Logger.<Log>g__LoggerLog|14_0[TState](LogLevel logLevel, EventId eventId, ILogger logger, Exception exception, Func`3 formatter, List`1& exceptions, TState& state)
   --- End of inner exception stack trace ---
   at Microsoft.Extensions.Logging.Logger.ThrowLoggingError(List`1 exceptions)
   at Microsoft.Extensions.Logging.Logger.Log[TState](LogLevel logLevel, EventId eventId, TState state, Exception exception, Func`3 formatter)
   at Microsoft.Extensions.Logging.LoggerMessage.<>c__DisplayClass8_0.<Define>g__Log|0(ILogger logger, Exception exception)
   at Grpc.Net.Client.Internal.GrpcCallLog.ErrorExceedingDeadline(ILogger logger, Exception ex)
   at Grpc.Net.Client.Internal.GrpcCall`2.DeadlineExceededCallback(Object state)
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.TimerQueue.FireNextTimers()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
Driver finished with exit code -532462766

The failures seem to be more frequent when doing long-running jobs such as a cook with a shader compilation.

I can provide logs with more info if necessary, but the errors are not much clearer.

Possible Solution

I have tried changing the gRPC target groups' protocols to be GRPC instead of HTTP2 as they currently are, because HTTP/2 ping frames are required for the keep-alive signal. This has not fixed the issue, but I think it may lead to the correct solution, as it seems to be a networking issue, which I haven't experienced using other Horde setups (such as docker compose)

Steps to Reproduce

  • Deploy a Horde server using the CGD Toolkit
  • Connect agents to it
    • Note that mine are regular EC2 machines controlled using the AwsRecycle strategy, but this shouldn't matter, as they are running the same Horde agent that any machine would.
  • Run a packaged build job (helps to have a longer job)
  • Look at an agent's batch logs

Cloud Game Development Toolkit version

latest

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriageto be triaged by project maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions