Skip to content
This repository was archived by the owner on Dec 14, 2023. It is now read-only.
This repository was archived by the owner on Dec 14, 2023. It is now read-only.

Fix disk space "leak" in extract-and-vector, limit disk usage on all services #809

@pypt

Description

@pypt

extract-and-vector workers tend to fill up /var/tmp with gigabytes of pretty much identical files which are of the size of either 0 or 3332489:

$ docker exec -it 689b33c92426 bash
mediacloud@689b33c92426:/var/tmp$ ls -la
total 3314808
drwxrwxrwt 1 root       root         36864 Sep 24 16:07 .
drwxr-xr-x 1 root       root          4096 Jul 23 13:38 ..
-rw------- 1 root       root       3332489 Aug 31 06:34 jieba.cache
<...>
-rw------- 1 mediacloud mediacloud 3332489 Sep 13 03:20 tmp0fductxs
-rw------- 1 mediacloud mediacloud       0 Sep 23 11:17 tmp0gvvyssa
-rw------- 1 mediacloud mediacloud 3332489 Sep  9 19:46 tmp0hnsk5dl
<...>
-rw------- 1 mediacloud mediacloud 3332489 Sep 22 00:18 tmp0u38habl
-rw------- 1 mediacloud mediacloud       0 Sep 24 03:20 tmp0uaqfvu8
-rw------- 1 mediacloud mediacloud 3332489 Sep 12 05:47 tmp0uu31qqo
<...>
-rw------- 1 mediacloud mediacloud 3332489 Sep  4 08:31 tmp15uwsawk
-rw------- 1 mediacloud mediacloud       0 Sep 24 04:45 tmp163pb8nu
-rw------- 1 mediacloud mediacloud 3332489 Sep 16 20:46 tmp16nra4na
<...>
-rw------- 1 mediacloud mediacloud 3332489 Sep 12 08:41 tmp1toho273
-rw------- 1 mediacloud mediacloud       0 Sep 22 21:48 tmp1uc_jdij

It took me a while to notice that a temporary file with a random name and a temporary file with a not-so-random name have identical file sizes:

-rw------- 1 root       root       3332489 Aug 31 06:34 jieba.cache
<...>
-rw------- 1 mediacloud mediacloud 3332489 Sep 13 03:20 tmp0fductxs

Jieba is a Python library which does Chinese language tokenization for us. Given that it uses a dictionary to do that, it has to pre-load some stuff:

# Prebuild Jieba dictionary cache
COPY bin/build_jieba_dict_cache.py /
RUN \
/build_jieba_dict_cache.py && \
rm /build_jieba_dict_cache.py && \
true

but it seems that the resulting /var/tmp/jieba.cache does not become accessible by the users as that file gets created with root:root owner and 600 permissions while its users run as mediacloud:mediacloud, so Jieba resorts to rebuilding that cache file on every call.

@jtotoole, could you:

  1. Fix jieba.cache's file permissions at build time so that Jieba library could access it; probably you just need to run that cache creation script with a different user in Dockerfile
  2. Limit the storage that gets used by all service containers in production's docker-compose.yml where appropriate - you'll probably need storage_opt for that
    • We try to put at least a liberal cap on all services' resources so that if it goes rogue with CPU / RAM usage, it doesn't have impact on the host machine and doesn't make other services crash. It now turns out that we can run out of disk space too, and while systems monitoring would be a good yet reactive way to deal with that, we need to be proactive about it too and not let containers burn through the host machine's root partition (or if they do, the disk space limitation should be isolated to the container).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions