We should filter records with very large payloads, e.g. text/html with more than 100 MB, often containing invalid HTML created by server-side scripts. Parsing such files takes forever.
It could make sense to have a fixed limit in bytes on the Content-Length field of the HTTP header, perhaps one limit per Content-Type. This can be implemented in the filter_warc() function.
We should filter records with very large payloads, e.g. text/html with more than 100 MB, often containing invalid HTML created by server-side scripts. Parsing such files takes forever.
It could make sense to have a fixed limit in bytes on the Content-Length field of the HTTP header, perhaps one limit per Content-Type. This can be implemented in the filter_warc() function.