Skip to content
29 changes: 29 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -440,6 +440,8 @@ cf.clear_locks()
```bash
# list cloud and local directories
cloudfiles ls gs://bucket-folder/
# resumable list cloud and local directories to sqlite db
cloudfiles ls --sqlite example.db gs://bucket-folder/
# parallel file transfer, no decompression
cloudfiles -p 2 cp --progress -r s3://bkt/ gs://bkt2/
# change compression type to brotli
Expand Down Expand Up @@ -497,6 +499,33 @@ cloudfiles ls -e "gs://bucket/prefix[ab]"
# cloudfiles ls gs://bucket/prefixb
```

### `ls` sqlite

When there is a very large directory in the cloud, sometimes we want to download the file listing and file sizes locally for ease of searching and comparison. This feature allows you to initiate resumable downloads of a given bucket or directory on Google Cloud Storage or Amazone S3 endpoints. The `file`, `mem`, and `https` protocols do not support resumption (except for the Google Storage REST API which does have support).

```bash
cloudfiles ls --sqlite example.db gs://bucket/prefix --progress
```

This will start downloading the data to `example.db` in the local directory. You can then search for files in the `files` table.

```sql
SELECT sum(size) FROM files WHERE path LIKE '%example.jpg';
```

For example, one plausible use for this technique is to check whether a copy of a large dataset has missing files. Download `original.db` and `copy.db` and then you can do a set difference.

```sql
$ sqlite3

> ATTACH DATABASE 'original.db' AS original_data;
> ATTACH DATABASE 'copy.db' AS copied_data;

> SELECT path from original_data.files
EXCEPT
SELECT path from copied_data.files;
```

### `alias` for Alternative S3 Endpoints

You can set your own protocols for S3 compatible endpoints by creating dynamic or persistent aliases. CloudFiles comes with two official s3 endpoints that are important for the Seung Lab, `matrix://` and `tigerdata://` which point to Princeton S3 endpoints. Official aliases can't be overridden.
Expand Down
15 changes: 13 additions & 2 deletions cloudfiles/cloudfiles.py
Original file line number Diff line number Diff line change
Expand Up @@ -1138,7 +1138,12 @@ def touch(
# )

def list(
self, prefix:str = "", flat:bool = False
self,
prefix:str = "",
flat:bool = False,
size:bool = False,
return_resume_token:bool = False,
resume_token:Optional[str] = None,
) -> Generator[str,None,None]:
"""
List files with the given prefix.
Expand All @@ -1160,7 +1165,13 @@ def list(
Return: generated sequence of file paths relative to cloudpath
"""
with self._get_connection() as conn:
for f in conn.list_files(prefix, flat):
for f in conn.list_files(
prefix=prefix,
flat=flat,
size=size,
resume_token=resume_token,
return_resume_token=return_resume_token,
):
yield f

def transfer_to(
Expand Down
Loading
Loading