gh-114953: Support for multithreaded XZ compression in lzma.py by mkomet · Pull Request #114954 · python/cpython

mkomet · 2024-02-03T11:15:37Z

Issue: LZMA MultiThreading XZ compression support #114953

ghost · 2024-02-03T11:15:40Z

All commit authors signed the Contributor License Agreement.

bedevere-app · 2024-02-03T11:15:44Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

erlend-aasland · 2024-02-03T11:58:13Z

Could you create some benchmarks for this? Also, please align the code to follow PEP-7.

mkomet · 2024-02-03T13:18:19Z

Could you create some benchmarks for this? Also, please align the code to follow PEP-7.

@erlend-aasland

Aligned to PEP7.

Regarding the benchmarks, ran benchmark on Ubuntu 16 image (similar to Phoronix https://openbenchmarking.org/test/pts/compress-xz).

Attached results running on 12 core i7-10750H (2.60GHz), with 16 GiB RAM

xz_bench.py

"""
./python xz_bench.py
"""

import json
import lzma
import time

# Taken from:
# https://old-releases.ubuntu.com/releases/16.04.3/ubuntu-16.04.3-server-i386.img
with open("ubuntu-16.04.3-server-i386.img", "rb") as reader:
    data = reader.read()

# First 20 MiB
data = data[: (20 * (2 << 20))]
threads_list = [1, 0, 2, 4]

compression_results = {
    thread: {"times": [], "ratios": []}
    for thread in threads_list
}

for thread in threads_list:
    for preset in range(10):
        start_time = time.time()
        compressed_data = lzma.compress(data, preset=preset, threads=thread)
        end_time = time.time()

        compression_results[thread]["times"].append(end_time - start_time)

        ratio = len(data) / len(compressed_data)
        compression_results[thread]["ratios"].append(ratio)

with open("compression_results.json", "w") as result_file:
    json.dump(compression_results, result_file)

plot.py

"""
python3 -m pip install matplotlib
python3 plot.py
"""
import json

import matplotlib.pyplot as plt

with open("compression_results.json", "r") as result_file:
    loaded_results = json.load(result_file)

presets = list(range(10))

# Create line graph for compression times
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
for thread, data in loaded_results.items():
    times = data["times"]
    plt.plot(presets, times, label=f"Threads={thread}")

plt.xlabel("Preset")
plt.ylabel("Time (s)")
plt.title("XZ Compression Benchmark - Compression Times (20 MiB)")
plt.legend()

# Create line graph for compression ratios
plt.subplot(1, 2, 2)
for thread, data in loaded_results.items():
    ratios = data["ratios"]
    plt.plot(presets, ratios, label=f"Threads={thread}")

plt.xlabel("Preset")
plt.ylabel("Compression Ratio")
plt.title("XZ Compression Benchmark - Compression Ratios (20 MiB)")
plt.legend()

plt.tight_layout()
plt.show()

Are there any other benchmarks you would like to see run?
Or alternatively, is there a way to integrate the benchmarks into the tests themselves?

Signed-off-by: Meir Komet <mskomet1@gmail.com>

thesamesam

Could you wire this up for the decompressor too?

mkomet · 2024-02-10T22:28:57Z

@thesamesam
Multi-threaded decompression is not implemented in the liblzma lib as of yet.
To quote the xz(1) man page:

Threaded decompression hasn't been implemented yet. It will only work on files that contain multiple blocks with size information in block headers. All files compressed in multi-threaded mode meet this condition, but files compressed in single-threaded mode don't even if --block-size=size is used.

(A possible implementation option would be to support multi-threaded decompression in the cpython _lzmamodule.c liblzma wrapper functions, but I don't think that's a wise rabbit-hole to venture down :) )

thesamesam · 2024-02-11T02:59:47Z

Your xz is old, I think. It does support it now. What version do you have?

EDIT: Specifically, this was introduced in xz-utils-5.4.0, released on 2022-12-13. The release notes say, among other things:

Added threaded .xz decompressor lzma_stream_decoder_mt().
It can use multiple threads with .xz files that have multiple
Blocks with size information in Block Headers. The threaded
encoder in xz has always created such files.

erlend-aasland · 2024-02-12T13:32:29Z

Please update the Autoconf script to detect the required functions/symbols, add C pre-processor defines for mt encoding/decoding, and use these to guard the newly added functionality. That way we won't break users with older XZ libs.

erlend-aasland · 2024-02-12T13:34:19Z

Modules/_lzmamodule.c

+    threads: int(c_default="1") = 1
+        Number of threads to use for compression (only relevant with FORMAT_XZ).


Would it be better to keep the Python API unchanged, and implicitly use the cores available?

Hmm. This would usually be an improvement. There is a little bit of risk in the way of someone running multiple instances in parallel.

(Also, is it possible to do a chunked compress (which allows parallel decomp) with just one thread? What happens when you tell encoder_mt to do 1 thread? Do we want to expose that potential difference?)

IIRC, yes, it's possible to use the chunked/parallel compressor with one thread. xz 5.6.0 will default to this for its command line interface at least, because it enables decompression in parallel for the archives.

In that case it may be beneficial to default to 1 thread with chunks. If you still want to have a "use the non-chunked thing" option, maybe we can pass some silly value like -1 threads to Compressor_init_xz.

Something along the lines of Artoria2e5@40029fb. (Treat as a long comment, not as a serious commit.)

See also:

[Bug]: xz: Reduced the number of threads from ... to not exceed the memory usage limit of ... MiB tukaani-project/xz#89 (comment)

erlend-aasland · 2024-02-12T13:36:25Z

Your xz is old, I think. It does support it now. What version do you have?

EDIT: Specifically, this was introduced in xz-utils-5.4.0, released on 2022-12-13.

FTR:

xz.git git:(master)$ git tag --contains 4cce3e27f529af33e0e7749a8cbcec59954946b5 | head -n 1
v5.3.3alpha

thesamesam · 2024-02-12T13:39:50Z

Yeah, they do odd/even for development (not ABI stable) and stable/prod releases, so 5.4.0 was the first stable version it was available in.

erlend-aasland · 2024-02-12T13:42:17Z

Yeah, they do odd/even for development (not ABI stable) and stable/prod releases, so 5.4.0 was the first stable version it was available in.

Right; I forgot that was a thing. GNU LilyPond used to do that as well (and they probably still do; haven't checked in years.)

thesamesam · 2025-04-27T21:20:37Z

@mkomet Are you interested in finishing this off (at least for threaded compression, we can revisit for parallel decompression)? If not, I can put it on my list to try get to, but I can't promise that will be immediately.

jave27 · 2025-05-07T15:25:19Z

Note that XZ Utils < 5.8.1 has a CVE specifically about this function: https://nvd.nist.gov/vuln/detail/CVE-2025-31115. So if this goes in, that will need to be considered for public releases.

thesamesam · 2025-12-07T19:32:47Z

@mkomet Are you interested in finishing this off (at least for threaded compression, we can revisit for parallel decompression)? If not, I can put it on my list to try get to, but I can't promise that will be immediately.

I see there's #140706 as a successor now (thank you!)

bedevere-app bot added the awaiting review label Feb 3, 2024

bedevere-app bot mentioned this pull request Feb 3, 2024

LZMA MultiThreading XZ compression support #114953

Open

mkomet force-pushed the komet/lzma-mt branch 3 times, most recently from 760d2c3 to 3848356 Compare February 3, 2024 11:27

mkomet force-pushed the komet/lzma-mt branch from 3848356 to dc55f0f Compare February 3, 2024 13:11

pythongh-114953: Support for multithreaded XZ compression in lzma.py

b796221

Signed-off-by: Meir Komet <mskomet1@gmail.com>

mkomet force-pushed the komet/lzma-mt branch from e41182e to b796221 Compare February 3, 2024 15:54

thesamesam reviewed Feb 5, 2024

View reviewed changes

erlend-aasland reviewed Feb 12, 2024

View reviewed changes

		threads: int(c_default="1") = 1
		Number of threads to use for compression (only relevant with FORMAT_XZ).

Uh oh!

Conversation

mkomet commented Feb 3, 2024 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Feb 3, 2024 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bedevere-app bot commented Feb 3, 2024

Uh oh!

erlend-aasland commented Feb 3, 2024

Uh oh!

mkomet commented Feb 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thesamesam left a comment

Choose a reason for hiding this comment

Uh oh!

mkomet commented Feb 10, 2024

Uh oh!

thesamesam commented Feb 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erlend-aasland commented Feb 12, 2024

Uh oh!

erlend-aasland Feb 12, 2024

Choose a reason for hiding this comment

Uh oh!

Artoria2e5 Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thesamesam Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

Artoria2e5 Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erlend-aasland Feb 27, 2024

Choose a reason for hiding this comment

Uh oh!

erlend-aasland commented Feb 12, 2024

Uh oh!

thesamesam commented Feb 12, 2024

Uh oh!

erlend-aasland commented Feb 12, 2024

Uh oh!

thesamesam commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jave27 commented May 7, 2025

Uh oh!

thesamesam commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mkomet commented Feb 3, 2024 •

edited by bedevere-app bot

Loading

ghost commented Feb 3, 2024 •

edited by ghost

Loading

mkomet commented Feb 3, 2024 •

edited

Loading

thesamesam commented Feb 11, 2024 •

edited

Loading

Artoria2e5 Feb 13, 2024 •

edited

Loading

Artoria2e5 Feb 13, 2024 •

edited

Loading

thesamesam commented Apr 27, 2025 •

edited

Loading