It looks like the judge has a hard character limit of 3000, which is not communicated to the agent during any of the tasks.. Looking at this in practice, it seems that for several of the tasks the judge is almost guaranteed to get a cut off response (the blog post, or email analysis, for example), causing artifically lower scores for these tasks.