Skip to content

Very large payloads cause issues (add a Content-Length upper limit?) #7

@magbb

Description

@magbb

We should filter records with very large payloads, e.g. text/html with more than 100 MB, often containing invalid HTML created by server-side scripts. Parsing such files takes forever.

It could make sense to have a fixed limit in bytes on the Content-Length field of the HTTP header, perhaps one limit per Content-Type. This can be implemented in the filter_warc() function.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions