Skip to content

find() does not support prefix= causing unnecessary full directory listing #354

@dhrp

Description

@dhrp

TL;DR

tosfs glob performance for a pattern like /some-huge-folder/some-file.* can be greatly improved. Currently all files under /some-huge-folder/ would be listed and then filtered locally by fsspec; in python, but this can also be done server side.

Background

This issue was first discovered in gcfs; which by itself does support prefix=, but fsspec itself needs the support added. For that reason [fsspec/filesystem_spec#1995] proposes passing the literal filename stem before the first wildcard as prefix= to find, so that the storage backend can apply server-side filtering. For a glob like tos://bucket/data/2024/report*.csv, the stem report_ is extracted and passed as prefix="report_" to find. Backends that support prefix= (gcsfs, adlfs) can push a ?prefix=report filter down to the storage API and return a fraction of the listing. Backends that don't, silently ignore it via **kwargs.

This optimization avoids loading the full directory listing — which for large buckets can mean hundreds of paginated API calls and seconds of latency — when only a handful of files match.

The problem with tosfs

tosfs.TosFileSystem.find() currently raises a ValueError if both prefix and maxdepth are passed together:

Because _glob always calls _find(..., maxdepth=depth, ...) with an integer depth (≥ 1 for any path with at least one /), this guard fires on every prefixed glob pattern. The upstream optimization would raise an exception.

The guard appears to be an implementation workaround rather than an API constraint. When maxdepth is given, the current code delegates immediately to super().find() (the fsspec base class), which doesn't forward prefix= to the backend:

Rather than implement a native depth+prefix path, the combination was disallowed. However, the underlying TOS SDK's list_objects_type2 accepts prefix and delimiter as fully independent, orthogonal parameters — there is no API-level mutual exclusivity.

What a native fix would look like

For the common case of maxdepth=1, a single list_objects_type2 call with both prefix and delimiter="/" set would:

Return only objects whose keys begin with /<user_prefix> — server-side filtered
Stop at the first / after the root key via delimiter — no recursion needed
Require exactly one (or a few, for pagination) API round-trips instead of a full listing sweep
This mirrors how _ls_dirs_and_files already combines key + "/" + prefix when constructing the prefix for TOS API calls.

Development

In order to reap the full benefits tosfs and s3fs need be fixed to support prefix= (in combination with max_depth and with_dirs). Once implemented fsspec can be improved to send the prefix. gcsfs (Google) and az (Azure) will immediately benefit and oci can easily benefit as well.

What do you think?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions