Skip to content

Selectable MP3 export format#852

Open
robustini wants to merge 9 commits intoace-step:mainfrom
robustini:mp3-320k-44k1
Open

Selectable MP3 export format#852
robustini wants to merge 9 commits intoace-step:mainfrom
robustini:mp3-320k-44k1

Conversation

@robustini
Copy link
Contributor

@robustini robustini commented Mar 16, 2026

Replaces the implicit MP3 export path in torchaudio/ffmpeg by adding the option to select the MP3 format.
If the UI does not support these settings, generation will default to the previous settings.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added MP3 export support with customizable bitrate and sample rate configuration
    • Enabled batch audio file processing capabilities
    • Introduced audio format conversion utility for easy format switching
    • Added audio processing effects including fade and normalization

Replace the implicit torchaudio/ffmpeg MP3 export path with an explicit
ffmpeg/libmp3lame encode step that forces 320 kbps CBR at 44.1 kHz.

Why:
- The previous MP3 export path did not set an explicit bitrate, so the final
  output depended on backend defaults and could end up around 128 kbps.
- 128 kbps MP3 is a major audible bottleneck for music generation quality.
- 44.1 kHz is the native and more appropriate distribution target for music.
  It fully covers the audible band and avoids spending bitrate budget on
  ultrasonic bandwidth that is not perceptually useful in MP3 delivery.
- 48 kHz is mainly beneficial for video, broadcast, and production pipelines,
  but at a fixed lossy bitrate it does not provide a meaningful quality
  advantage for music distribution and can be slightly less efficient than
  44.1 kHz.
- For MP3 specifically, raising bitrate from 128 kbps to 320 kbps yields a
  much larger real-world quality improvement than keeping a 48 kHz sample rate.

What changed:
- Resample MP3 output to 44.1 kHz before encoding.
- Export MP3 through an explicit ffmpeg subprocess using libmp3lame.
- Force 320 kbps CBR instead of relying on backend defaults.
- Keep the existing save path for other formats unchanged.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces a minimal audio utility with a full AudioSaver implementation (multi-format saving including MP3 with ffmpeg, WAV32 fallback), adds helpers (apply_fade, normalize_audio, hashing/UUID), batch/convert utilities, MP3-focused tests, and forwards mp3 export options through generation flow.

Changes

Cohort / File(s) Summary
Audio Utilities Implementation
acestep/audio_utils.py
Introduces AudioSaver and global _default_saver; multi-format save (flac, wav, wav32, mp3, opus, aac), _save_mp3 using ffmpeg subprocess, torchaudio/soundfile backend selection and fallbacks, save_batch, convert_audio, and helper functions (apply_fade, normalize_audio, hashing/UUID utilities, save_audio wrapper).
Audio Utilities Tests
acestep/audio_utils_test.py
Adds MP3-focused tests verifying _save_mp3 usage via save_audio, forwarding of mp3_bitrate/mp3_sample_rate, legacy defaults and overrides, subprocess invocation expectations, and failure propagation (no silent fallback).
Generation & Orchestration
acestep/inference.py
Adds mp3_bitrate and mp3_sample_rate to GenerationConfig; builds a kwargs dict for dit_handler.generate_music, filters supported keys via inspect.signature, logs dropped keys, and forwards MP3 export settings into audio_saver.save_audio when saving generated audio.

Sequence Diagram(s)

sequenceDiagram
    participant Orchestrator as "generate_music (inference)"
    participant DiT as "dit_handler.generate_music"
    participant AudioSaver as "AudioSaver.save_audio"
    participant Backend as "torchaudio / soundfile"
    participant FFmpeg as "ffmpeg subprocess"
    participant FS as "FileSystem"

    Orchestrator->>Orchestrator: assemble generation params (incl. mp3_bitrate/mp3_sample_rate)
    Orchestrator->>DiT: call generate_music(filtered_kwargs)
    DiT-->>Orchestrator: returns audio tensor
    Orchestrator->>AudioSaver: save_audio(audio_tensor, format, mp3_* settings)
    AudioSaver->>Backend: prepare/write intermediate WAV or direct write
    alt format == "mp3"
        AudioSaver->>FFmpeg: run ffmpeg (libmp3lame, bitrate, sample_rate)
        FFmpeg->>FS: write .mp3
    else
        Backend->>FS: write target audio file
    end
    FS-->>AudioSaver: saved path
    AudioSaver-->>Orchestrator: saved path / status
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • ChuxiJ

Poem

🐇 I hopped through samples, fades, and streams,
Tuned bitrates, resampled dreams,
FFmpeg hummed to stitch the song,
WAV32, batches — hopping along,
A rabbit cheers: the audio's strong. 🎶

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: exposing MP3 encoding parameters (bitrate and sample rate) to make them user-selectable, which is the core objective of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can suggest fixes for GitHub Check annotations.

Configure the reviews.tools.github-checks setting to adjust the time to wait for GitHub Checks to complete.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
acestep/audio_utils.py (1)

189-201: Make the MP3 override visible in the public API docs.

save_audio still reads as if the caller-controlled sample_rate is always honored, but the MP3 branch now always emits 44.1 kHz/320 kbps and can raise RuntimeError when ffmpeg is missing or the encode fails. Please capture that in the method and wrapper docstrings so callers do not have to learn it from implementation details.

As per coding guidelines, "Docstrings are mandatory for all new/modified modules, classes, and functions. Docstrings must be concise and include purpose plus key inputs/outputs and raised exceptions when relevant."

Also applies to: 555-567

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/audio_utils.py` around lines 189 - 201, Update the save_audio
docstring (and the corresponding public wrapper docstrings noted around the
later section) to clearly state that when format == 'mp3' the function overrides
the requested sample_rate and forces encoding at 44.1 kHz and 320 kbps, and that
this branch may raise RuntimeError if ffmpeg is not available or the MP3 encode
fails; mention the inputs/outputs as already present and add the raised
exceptions line for clarity. Reference the save_audio function name in the docs
so callers see the MP3 behavior and error conditions in the public API
documentation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@acestep/audio_utils.py`:
- Around line 1-571: This file is too large and mixes DSP helpers, I/O,
batch/export, and hashing/UUID utilities; split it into focused modules (e.g.,
audio_dsp.py for apply_fade/normalize_audio, audio_io.py for
AudioSaver/_save_mp3/save_audio/save_batch/convert_audio, and audio_hash.py for
get_lora_weights_hash/get_audio_file_hash/generate_uuid_from_params/generate_uuid_from_audio_data)
and keep the original top-level module as a thin facade that re-exports the key
symbols (AudioSaver, save_audio, apply_fade, normalize_audio,
get_lora_weights_hash, generate_uuid_from_params, generate_uuid_from_audio_data)
so existing imports remain stable; update any internal imports to reference the
new modules and ensure the global _default_saver instance is preserved in the
facade and that AudioSaver._save_mp3 logic is moved alongside AudioSaver in
audio_io.py.
- Around line 245-313: The outer except currently catches errors from
_save_mp3() and other ffmpeg-backed saves and then attempts a soundfile
fallback, but ends up masking the original ffmpeg RuntimeError; change the logic
so that if the failing format is an ffmpeg-backed format (e.g., "mp3", "opus",
"aac" or when _save_mp3() is used) you immediately re-raise the original
exception instead of trying the soundfile fallback, whereas for PCM containers
("wav", "flac", "wav32") keep the current soundfile fallback path; locate
references to _save_mp3, the format variable checks, and the outer except block
to implement conditional re-raise for ffmpeg formats and only perform the
fallback for PCM formats.

---

Nitpick comments:
In `@acestep/audio_utils.py`:
- Around line 189-201: Update the save_audio docstring (and the corresponding
public wrapper docstrings noted around the later section) to clearly state that
when format == 'mp3' the function overrides the requested sample_rate and forces
encoding at 44.1 kHz and 320 kbps, and that this branch may raise RuntimeError
if ffmpeg is not available or the MP3 encode fails; mention the inputs/outputs
as already present and add the raised exceptions line for clarity. Reference the
save_audio function name in the docs so callers see the MP3 behavior and error
conditions in the public API documentation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6016d9b1-7de7-4874-b18c-58401b2921db

📥 Commits

Reviewing files that changed from the base of the PR and between eafe291 and 147325c.

📒 Files selected for processing (2)
  • acestep/audio_utils.py
  • acestep/audio_utils_test.py

…ormats

This follow-up fix corrects the MP3 export error handling introduced in the previous patch.

Changes:
- preserve the original ffmpeg error for MP3 export failures
- prevent the generic `soundfile` fallback from overriding explicit MP3 failures
- keep fallback behavior only for PCM/container formats where it actually makes sense
- add/update tests to verify that MP3 export errors are re-raised correctly

Why:
The previous MP3 patch correctly forced `44.1 kHz / 320 kbps CBR`, but its outer exception handling could still catch the explicit MP3 failure and retry through a generic fallback path. That behavior risked hiding the real ffmpeg/libmp3lame error and made debugging harder. This follow-up ensures compressed ffmpeg-backed formats fail loudly and correctly.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
acestep/audio_utils.py (1)

1-577: 🛠️ Refactor suggestion | 🟠 Major

This module is still far above the repository’s Python LOC cap.

This remains a large multi-responsibility module; please split by responsibility and keep acestep.audio_utils as a stable facade re-export.

As per coding guidelines, "Prefer single-responsibility modules with clear boundaries. Target module size: optimal <= 150 LOC, hard cap 200 LOC." and "Preserve stable facade imports when splitting large files so external callers are not broken." Based on learnings, module-size concerns should be raised when Python files exceed 200 LOC.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/audio_utils.py` around lines 1 - 577, This file is too large and must
be split by responsibility while keeping acestep.audio_utils as a stable facade
re-exporting the originals; create smaller modules (e.g.,
acestep.audio_processing with apply_fade and normalize_audio, acestep.io with
AudioSaver, save_audio, save_batch, convert_audio, and acestep.hashing with
get_audio_file_hash, generate_uuid_from_params, generate_uuid_from_audio_data,
get_lora_weights_hash) and move related helper functions/constants
(MP3_SAMPLE_RATE, MP3_BITRATE, _save_mp3) accordingly; then update the original
module to import and re-export the same symbols (AudioSaver, save_audio,
apply_fade, normalize_audio, get_lora_weights_hash, get_audio_file_hash,
generate_uuid_from_params, generate_uuid_from_audio_data, save_batch,
convert_audio) so external callers remain unchanged. Ensure tests/imports still
reference the same names and that any internal relative imports (torchaudio,
soundfile usage) are preserved in the new modules.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@acestep/audio_utils_test.py`:
- Around line 382-395: The test test_save_audio_mp3_preserves_ffmpeg_failure
references missing fixtures (self.temp_dir and self.sample_audio) defined in a
different test setup; add local setup for this test (or ensure the test class
setUp defines them) so the RuntimeError assertion runs instead of raising
AttributeError: initialize self.temp_dir (e.g., create a temp dir via
tempfile.mkdtemp()/TemporaryDirectory) and initialize self.sample_audio to the
same kind of audio tensor/array used elsewhere in tests (matching
self.sample_rate and AudioSaver expectations) before calling saver.save_audio.

In `@acestep/audio_utils.py`:
- Around line 224-233: The current heuristic that transposes when
audio_tensor.dim() == 2 and audio_tensor.shape[0] > audio_tensor.shape[1] can
mis-detect short clips; remove this shape-based transpose and instead trust the
channels_first flag: when channels_first is True treat incoming numpy as
[channels, samples] and convert directly, and when channels_first is False treat
incoming numpy as [samples, channels] and do not auto-transpose; update the
block that creates audio_tensor (references: channels_first, audio_data,
audio_tensor) to eliminate the heuristic transpose so orientation is
deterministic.
- Around line 359-361: Guard against deleting the converted output by checking
that input_path and output_path are not the same file before unlinking: inside
the block that handles remove_input (the branch referencing remove_input,
input_path, output_path and logger.debug("[AudioSaver]...")), compare the
resolved paths (or use input_path.samefile/output_path.exist checks) and only
call input_path.unlink() when they differ and the file exists; if they are the
same, skip deletion and optionally log that deletion was skipped to avoid
removing the freshly converted output.
- Around line 167-174: The ffmpeg subprocess call in the ffmpeg MP3 export block
currently calls subprocess.run(...) without a timeout; modify the call to
include timeout=120 and add an except subprocess.TimeoutExpired handler
(alongside the existing FileNotFoundError and subprocess.CalledProcessError
handlers) that raises a RuntimeError indicating the ffmpeg command timed out and
include any available output (stdout/stderr) from the exception. Update the
subprocess.run invocation and add the TimeoutExpired except branch in the same
try/except that wraps the ffmpeg invocation to ensure hung ffmpeg processes
don't block save requests.

---

Duplicate comments:
In `@acestep/audio_utils.py`:
- Around line 1-577: This file is too large and must be split by responsibility
while keeping acestep.audio_utils as a stable facade re-exporting the originals;
create smaller modules (e.g., acestep.audio_processing with apply_fade and
normalize_audio, acestep.io with AudioSaver, save_audio, save_batch,
convert_audio, and acestep.hashing with get_audio_file_hash,
generate_uuid_from_params, generate_uuid_from_audio_data, get_lora_weights_hash)
and move related helper functions/constants (MP3_SAMPLE_RATE, MP3_BITRATE,
_save_mp3) accordingly; then update the original module to import and re-export
the same symbols (AudioSaver, save_audio, apply_fade, normalize_audio,
get_lora_weights_hash, get_audio_file_hash, generate_uuid_from_params,
generate_uuid_from_audio_data, save_batch, convert_audio) so external callers
remain unchanged. Ensure tests/imports still reference the same names and that
any internal relative imports (torchaudio, soundfile usage) are preserved in the
new modules.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c81479c7-f945-4d5c-935f-a2d187bc02b8

📥 Commits

Reviewing files that changed from the base of the PR and between 147325c and 58b79b2.

📒 Files selected for processing (2)
  • acestep/audio_utils.py
  • acestep/audio_utils_test.py

- add ffmpeg timeout handling
- remove fragile channel transpose heuristic
- prevent deleting output when input == output
- fix broken MP3 failure test fixtures
- add regression coverage for the above cases
Copy link
Contributor

@ChuxiJ ChuxiJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still some other issues with 44.1kHz.
We still recommend 48kHz, and suggest making it user-selectable here instead of hardcoding it.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
acestep/audio_utils.py (1)

1-590: ⚠️ Potential issue | 🟠 Major

Module is still far above the repo hard cap (already reported).

This file remains a multi-responsibility module at ~590 LOC. Please split by responsibility (or provide a concrete deferred split plan in PR notes) before merge.

As per coding guidelines, "Prefer single-responsibility modules with clear boundaries. Target module size: optimal <= 150 LOC, hard cap 200 LOC."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/audio_utils.py` around lines 1 - 590, The file is too large and mixes
responsibilities; split it into smaller modules: move audio processing functions
(apply_fade, normalize_audio, generate_uuid_from_audio_data) into a new
audio_processing module, move I/O and saving logic (AudioSaver class, _save_mp3,
save_audio, save_batch, convert_audio and the global _default_saver) into an
audio_io module, and move hashing/utility functions (get_lora_weights_hash,
get_audio_file_hash, generate_uuid_from_params) into a utils/hashing module;
update imports/exports so callers use the new modules (or re-export them from a
lightweight aceitstep.audio_utils shim) and include a short deferred split plan
in PR notes if you cannot split in this PR. Ensure each new module has tests and
preserved behavior and adjust any relative imports that referenced AudioSaver or
the helper functions.
🧹 Nitpick comments (1)
acestep/audio_utils_test.py (1)

234-260: Strengthen the success-path MP3 test by asserting subprocess.run kwargs.

This test validates command arguments, but it does not pin timeout/check/capture_output. A regression there could slip through.

💡 Suggested assertion additions
             mock_run.assert_called_once()
             cmd = mock_run.call_args[0][0]
+            run_kwargs = mock_run.call_args[1]
             self.assertIn('ffmpeg', cmd)
             self.assertIn('libmp3lame', cmd)
             self.assertIn('44100', cmd)
             self.assertIn('320k', cmd)
+            self.assertEqual(run_kwargs.get("timeout"), 120)
+            self.assertTrue(run_kwargs.get("check"))
+            self.assertTrue(run_kwargs.get("capture_output"))
             self.assertTrue(result.endswith('.mp3'))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/audio_utils_test.py` around lines 234 - 260, The MP3 export test
should also assert the subprocess.run keyword arguments to prevent regressions;
in test_save_audio_mp3_uses_explicit_ffmpeg_export_settings, after obtaining
mock_run.call_args inspect call_args[1] and add assertions that 'check' is True,
'capture_output' is True, and that 'timeout' exists and is a positive integer
(e.g. assert mock_run.call_args[1].get('timeout') is not None and > 0) so the
test pins these important kwargs for the AudioSaver.save_audio ffmpeg
invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@acestep/audio_utils.py`:
- Around line 429-450: The docstring incorrectly states MD5 while the
implementation uses SHA-256; update the function docstring to say SHA-256 (or
"SHA-256 hash") and adjust any descriptive text to match (e.g., "Compute a
SHA-256 hash identifying the currently loaded LoRA adapter weights" and "returns
hex digest string"). Locate the docstring above the function that uses
dit_handler, lora_service and hash_obj = hashlib.sha256() and change the
algorithm name and any related wording to accurately reflect SHA-256.

---

Duplicate comments:
In `@acestep/audio_utils.py`:
- Around line 1-590: The file is too large and mixes responsibilities; split it
into smaller modules: move audio processing functions (apply_fade,
normalize_audio, generate_uuid_from_audio_data) into a new audio_processing
module, move I/O and saving logic (AudioSaver class, _save_mp3, save_audio,
save_batch, convert_audio and the global _default_saver) into an audio_io
module, and move hashing/utility functions (get_lora_weights_hash,
get_audio_file_hash, generate_uuid_from_params) into a utils/hashing module;
update imports/exports so callers use the new modules (or re-export them from a
lightweight aceitstep.audio_utils shim) and include a short deferred split plan
in PR notes if you cannot split in this PR. Ensure each new module has tests and
preserved behavior and adjust any relative imports that referenced AudioSaver or
the helper functions.

---

Nitpick comments:
In `@acestep/audio_utils_test.py`:
- Around line 234-260: The MP3 export test should also assert the subprocess.run
keyword arguments to prevent regressions; in
test_save_audio_mp3_uses_explicit_ffmpeg_export_settings, after obtaining
mock_run.call_args inspect call_args[1] and add assertions that 'check' is True,
'capture_output' is True, and that 'timeout' exists and is a positive integer
(e.g. assert mock_run.call_args[1].get('timeout') is not None and > 0) so the
test pins these important kwargs for the AudioSaver.save_audio ffmpeg
invocation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9ebc6175-4669-4548-8e84-96544a98c3e4

📥 Commits

Reviewing files that changed from the base of the PR and between 58b79b2 and 60acebe.

📒 Files selected for processing (2)
  • acestep/audio_utils.py
  • acestep/audio_utils_test.py

Comment on lines +429 to +450
"""Compute an MD5 hash identifying the currently loaded LoRA adapter weights.

Iterates over the handler's LoRA service registry to find adapter weight
file paths, then hashes each file to produce a combined fingerprint.

Args:
dit_handler: DiT handler instance with LoRA state attributes.

Returns:
Hex digest string uniquely identifying the loaded LoRA weights,
or empty string if no LoRA is active.
"""
if not getattr(dit_handler, "lora_loaded", False):
return ""
if not getattr(dit_handler, "use_lora", False):
return ""

lora_service = getattr(dit_handler, "_lora_service", None)
if lora_service is None or not lora_service.registry:
return ""

hash_obj = hashlib.sha256()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hash algorithm docstring is inaccurate.

The docstring states MD5, but the implementation uses SHA-256 (hashlib.sha256). Please update the docstring to avoid API confusion.

💡 Proposed docstring fix
-def get_lora_weights_hash(dit_handler) -> str:
-    """Compute an MD5 hash identifying the currently loaded LoRA adapter weights.
+def get_lora_weights_hash(dit_handler) -> str:
+    """Compute a SHA-256 hash identifying the currently loaded LoRA adapter weights.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/audio_utils.py` around lines 429 - 450, The docstring incorrectly
states MD5 while the implementation uses SHA-256; update the function docstring
to say SHA-256 (or "SHA-256 hash") and adjust any descriptive text to match
(e.g., "Compute a SHA-256 hash identifying the currently loaded LoRA adapter
weights" and "returns hex digest string"). Locate the docstring above the
function that uses dit_handler, lora_service and hash_obj = hashlib.sha256() and
change the algorithm name and any related wording to accurately reflect SHA-256.

@robustini
Copy link
Contributor Author

There are still some other issues with 44.1kHz. We still recommend 48kHz, and suggest making it user-selectable here instead of hardcoding it.

That makes sense.
In the original implementation, MP3 export went through torchaudio.save(..., backend="ffmpeg"), but bitrate and MP3-specific ffmpeg options were not actually exposed/passed through there, so there was no real user-selectable path for kbps/sample-rate tuning.
My change introduced an explicit MP3 export path so those parameters can be controlled deterministically. I agree that 48 kHz should remain the default/recommended option, and 44.1 kHz should be exposed as a user-selectable export setting rather than hardcoded.
I can revise the PR in that direction.

Keep explicit MP3 export parameters provided by the UI all the way to final export.
If the UI sends mp3_bitrate and mp3_sample_rate, those values are now preserved and used for MP3 output. If they are not provided, export behavior remains unchanged and the existing default MP3 settings are used.
@robustini
Copy link
Contributor Author

@ChuxiJ the export path now honors explicit MP3 bitrate and sample-rate values when they are provided by the UI.
If those parameters are not sent, MP3 export falls back to the existing default behavior exactly as before.
I haven't touched the Gradio, do it yourself if you think it's appropriate, tested with AceFlow and it works.

immagine

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
acestep/audio_utils.py (3)

435-436: ⚠️ Potential issue | 🟡 Minor

Hash algorithm docstring is incorrect.

The docstring says MD5, but implementation uses hashlib.sha256() (Line 456). Please align the docstring to SHA-256.

💡 Proposed fix
-def get_lora_weights_hash(dit_handler) -> str:
-    """Compute an MD5 hash identifying the currently loaded LoRA adapter weights.
+def get_lora_weights_hash(dit_handler) -> str:
+    """Compute a SHA-256 hash identifying the currently loaded LoRA adapter weights.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/audio_utils.py` around lines 435 - 436, The docstring for the
function that "Compute an MD5 hash identifying the currently loaded LoRA adapter
weights" is incorrect: the implementation uses hashlib.sha256() (see the
function that computes the adapter weights hash), so update the docstring to
state that the function computes a SHA-256 hash (not MD5) and, if present,
adjust any descriptive text to reference SHA-256 and its properties consistently
with the use of hashlib.sha256().

377-380: ⚠️ Potential issue | 🟠 Major

Guard against deleting the converted output when paths match.

remove_input=True currently unlinks the input unconditionally; if input and output resolve to the same file, this deletes the final output.

💡 Proposed fix
-        output_path = self.save_audio(
+        output_path = Path(self.save_audio(
             audio_tensor,
             output_path,
             sample_rate=sample_rate,
             format=output_format,
             channels_first=True
-        )
+        ))
@@
-        if remove_input:
-            input_path.unlink()
-            logger.debug(f"[AudioSaver] Removed input file: {input_path}")
+        if remove_input and input_path.exists():
+            if input_path.resolve() != output_path.resolve():
+                input_path.unlink()
+                logger.debug(f"[AudioSaver] Removed input file: {input_path}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/audio_utils.py` around lines 377 - 380, Guard against deleting the
converted output by checking paths before unlinking: in the routine that handles
saving/converting audio (refer to variables input_path, output_path and the
remove_input logic in AudioSaver or the function that contains this unlink
block), only call input_path.unlink() when remove_input is True AND
input_path.resolve() != output_path.resolve(); if they are equal, skip deletion
and log a debug/warning that removal was skipped to avoid deleting the output.

238-255: ⚠️ Potential issue | 🟡 Minor

Remove shape-based transpose heuristic for channels_first=False.

The shape[0] > shape[1] heuristic can still mis-orient valid short clips. Make orientation deterministic from channels_first only.

💡 Proposed fix
         if isinstance(audio_data, np.ndarray):
-            if channels_first:
-                # numpy already [channels, samples]
-                audio_tensor = torch.from_numpy(audio_data).float()
-            else:
-                # numpy [samples, channels] -> tensor [samples, channels] -> [channels, samples] (if transposed)
-                audio_tensor = torch.from_numpy(audio_data).float()
-                if audio_tensor.dim() == 2 and audio_tensor.shape[0] > audio_tensor.shape[1]:
-                     # Assume [samples, channels] if dim0 > dim1 (heuristic)
-                     audio_tensor = audio_tensor.T
+            audio_tensor = torch.from_numpy(audio_data).float()
+            if not channels_first and audio_tensor.dim() == 2:
+                audio_tensor = audio_tensor.transpose(0, 1)
         else:
             # torch tensor
             audio_tensor = audio_data.cpu().float()
             if not channels_first and audio_tensor.dim() == 2:
-                # [samples, channels] -> [channels, samples]
-                if audio_tensor.shape[0] > audio_tensor.shape[1]:
-                    audio_tensor = audio_tensor.T
+                audio_tensor = audio_tensor.transpose(0, 1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/audio_utils.py` around lines 238 - 255, Remove the shape-based
transpose heuristic and make orientation deterministic from channels_first only:
stop using audio_tensor.shape[0] > audio_tensor.shape[1] to decide transposition
and instead transpose solely based on the channels_first flag (for 2-D inputs).
Concretely, in the numpy branch (handling audio_data as np.ndarray) and the
torch branch (audio_data as tensor), if channels_first is True and
audio_tensor.dim() == 2 then transpose audio_tensor; otherwise do not
transpose—delete the conditional checks that compare shape[0] and shape[1].
Ensure you update both places referencing audio_data/audio_tensor so orientation
is driven exclusively by channels_first.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@acestep/audio_utils.py`:
- Around line 435-436: The docstring for the function that "Compute an MD5 hash
identifying the currently loaded LoRA adapter weights" is incorrect: the
implementation uses hashlib.sha256() (see the function that computes the adapter
weights hash), so update the docstring to state that the function computes a
SHA-256 hash (not MD5) and, if present, adjust any descriptive text to reference
SHA-256 and its properties consistently with the use of hashlib.sha256().
- Around line 377-380: Guard against deleting the converted output by checking
paths before unlinking: in the routine that handles saving/converting audio
(refer to variables input_path, output_path and the remove_input logic in
AudioSaver or the function that contains this unlink block), only call
input_path.unlink() when remove_input is True AND input_path.resolve() !=
output_path.resolve(); if they are equal, skip deletion and log a debug/warning
that removal was skipped to avoid deleting the output.
- Around line 238-255: Remove the shape-based transpose heuristic and make
orientation deterministic from channels_first only: stop using
audio_tensor.shape[0] > audio_tensor.shape[1] to decide transposition and
instead transpose solely based on the channels_first flag (for 2-D inputs).
Concretely, in the numpy branch (handling audio_data as np.ndarray) and the
torch branch (audio_data as tensor), if channels_first is True and
audio_tensor.dim() == 2 then transpose audio_tensor; otherwise do not
transpose—delete the conditional checks that compare shape[0] and shape[1].
Ensure you update both places referencing audio_data/audio_tensor so orientation
is driven exclusively by channels_first.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 48f64c76-1b68-4fd7-a254-93aeab3551e4

📥 Commits

Reviewing files that changed from the base of the PR and between 60acebe and 8790c2f.

📒 Files selected for processing (3)
  • acestep/audio_utils.py
  • acestep/audio_utils_test.py
  • acestep/inference.py

@robustini robustini changed the title Force MP3 export to 320 kbps CBR at 44.1 kHz Selectable MP3 export format Mar 17, 2026
@robustini robustini requested a review from ChuxiJ March 17, 2026 10:11
Copy link
Contributor

@ChuxiJ ChuxiJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Verdict: Needs Changes ❌

The MP3 export feature itself is well-implemented (_save_mp3() with proper temp file cleanup, input validation, timeout), but the inference.py refactoring introduces a functional regression.

Critical: Dropped global_caption parameter

The refactoring converts the dit_handler.generate_music() call from explicit keywords to a dict + inspect.signature() filter. However, the new dict omits global_caption=params.global_caption which is present in the current main branch and accepted by the handler. The user's global_caption value will be silently ignored — this is a functional regression.

chunk_mask_mode is also missing from the dict, though its default "auto" matches the previous getattr fallback, making it benign.

Medium: inspect.signature() pattern is fragile

This pattern silently drops parameters without error, making bugs like the above easy to introduce. A misspelled key would be silently filtered out instead of raising TypeError. The logger.debug for dropped keys is easy to miss. The previous explicit-keyword approach was safer.

Required changes

  1. Add global_caption and chunk_mask_mode back to dit_generate_kwargs
  2. Consider raising a warning (not just debug) when parameters are dropped by the signature filter

Other suggestions (non-blocking)

  • Forward mp3_bitrate/mp3_sample_rate in convert_audio(), save_batch(), and the module-level save_audio() for consistency
  • Change logger.info at the export log line to logger.debug to reduce noise in batch generation
  • Minimize the audio_utils.py diff to only actual semantic changes (current diff is a full-file replacement)

This fix restores the missing global_caption forwarding in acestep/inference.py, which was accidentally dropped during the dict + inspect.signature() refactor and caused a functional regression.
It also restores chunk_mask_mode in the forwarded kwargs and changes the log for filtered-out parameters from debug to warning, so unexpected drops are more visible during development and review.
Scope intentionally kept minimal: only the file already involved in the PR was updated.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
acestep/inference.py (1)

200-201: Consider documenting the new fields in the class docstring.

The new mp3_bitrate and mp3_sample_rate fields are functional with sensible defaults aligned with PR objectives (48kHz as the default). However, the GenerationConfig docstring (lines 179-192) doesn't document these new attributes.

📝 Suggested docstring addition

Add to the Attributes section around line 191:

         audio_format: Output audio format, one of "mp3", "wav", "flac", "wav32", "opus", "aac". Default: "flac"
+        mp3_bitrate: MP3 encoding bitrate (e.g., "128k", "192k", "320k"). Only used when audio_format="mp3".
+        mp3_sample_rate: MP3 output sample rate in Hz (e.g., 44100, 48000). Only used when audio_format="mp3".
     """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/inference.py` around lines 200 - 201, The GenerationConfig class
docstring is missing descriptions for the newly added attributes mp3_bitrate and
mp3_sample_rate; update the GenerationConfig docstring (Attributes section) to
document mp3_bitrate: str (e.g., "MP3 bitrate for output, default '128k'") and
mp3_sample_rate: int (sample rate in Hz, default 48000), briefly stating
units/format and default values so readers and auto-doc tools see the new
fields; locate GenerationConfig and add these two lines to its Attributes block
near the other parameter docs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@acestep/inference.py`:
- Around line 200-201: The GenerationConfig class docstring is missing
descriptions for the newly added attributes mp3_bitrate and mp3_sample_rate;
update the GenerationConfig docstring (Attributes section) to document
mp3_bitrate: str (e.g., "MP3 bitrate for output, default '128k'") and
mp3_sample_rate: int (sample rate in Hz, default 48000), briefly stating
units/format and default values so readers and auto-doc tools see the new
fields; locate GenerationConfig and add these two lines to its Attributes block
near the other parameter docs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 00fd3d5b-f249-496a-a0cb-4728ba1ff1aa

📥 Commits

Reviewing files that changed from the base of the PR and between 8790c2f and d0cf144.

📒 Files selected for processing (1)
  • acestep/inference.py

Copy link
Contributor

@ChuxiJ ChuxiJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overall: Well-implemented MP3 export with proper error handling, input validation, and test coverage.

Key Items to Address

  1. Medium - API gap for convert_audio and save_batch: These methods (and the module-level save_audio convenience function) do not accept the new mp3_bitrate/mp3_sample_rate parameters. Users calling these functions cannot control MP3 export settings. At minimum, thread the parameters through convert_audio and save_batch, or document this limitation.

  2. Medium - Whole-file rewrite inflates diff: The diff shows -516/+596 lines for audio_utils.py, but the actual changes are ~60 lines. This was likely caused by re-formatting. A cleaner diff would make review easier.

  3. Low - logger.info in hot path: The new logger.info at line 743 in inference.py logs on every audio export. For batch generation this is noisy. Consider using logger.debug to match the existing logging style in save_audio.

  4. Low - Test formatting: Multi-patch decorators on single lines are hard to read. Consider stacking or using with blocks.

Good Practices

  • subprocess.run with list args prevents shell injection
  • Input validation on bitrate and sample rate with sensible defaults
  • timeout=120 on subprocess prevents runaway ffmpeg processes
  • Temp file cleanup in finally block
  • inspect.signature filtering in inference.py is a clever forward-compatibility mechanism

No blocking issues.

Thread mp3_bitrate and mp3_sample_rate through save_audio, convert_audio, and save_batch so callers can control MP3 export settings consistently.
Also reduce noisy export logging in inference.py from info to debug, improve readability of multi-patch test formatting, and add coverage to verify MP3 parameter forwarding.
Copy link
Contributor

@ChuxiJ ChuxiJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Thanks for the PR! The feature concept (selectable MP3 export format) is useful. However:

Concern: Scope

The diff rewrites the entire audio_utils.py file (516 lines deleted, 618 added). For adding MP3 format selection, the changes should be much more targeted — primarily adding parameters to the UI and passing them through to torchaudio.save().

Could you scope this down to only the changes needed for MP3 format selection? A full rewrite of audio_utils.py makes it very hard to review and introduces regression risk.

Specific feedback:

  1. The batch processing and audio effects features (fade, normalization) mentioned in the release notes appear to already exist in the current codebase. Please ensure you're not duplicating existing functionality.
  2. CodeRabbit has paused review due to rapid commits — please stabilize the branch before requesting final review.

Suggestion:

  • Rebase on latest main
  • Keep only the MP3 bitrate/sample rate selection changes
  • Add a test verifying the new MP3 export options work correctly

@ChuxiJ
Copy link
Contributor

ChuxiJ commented Mar 20, 2026

Code review

The core _save_mp3 mechanism (explicit ffmpeg subprocess with temp WAV) is sound, and test coverage is solid.

Found 2 issues:

  1. repaint_* fields in GenerationParams are dead code -- repaint_latent_crossfade_frames, repaint_wav_crossfade_sec, repaint_mode, and repaint_strength are declared in GenerationParams (lines 144-147) but never added to dit_generate_kwargs (lines 593-628). The inspect.signature filtering shim cannot pass keys that are not in the dict. Any non-default values set by callers will be silently dropped. These four keys need to be added to dit_generate_kwargs.

  2. Default bitrate (128k) contradicts stated purpose (320k) -- The PR title, branch name (mp3-320k-44k1), and early commit messages claim the goal is 320 kbps CBR. But MP3_DEFAULT_BITRATE = "128k" and mp3_bitrate: str = "128k" mean every user gets 128k unless they explicitly override. No UI wiring for mp3_bitrate/mp3_sample_rate is included in this PR, so the user-facing behavior is unchanged from before. Either update the description or raise the default.

Minor notes:

  • getattr(config, "mp3_bitrate", "128k") is redundant since both are proper dataclass fields
  • ffmpeg command uses ABR mode by default (not CBR as stated) -- for true CBR, needs -abr 0
  • Unused imports: io, Tuple in audio_utils.py

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Copy link
Contributor

@ChuxiJ ChuxiJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Selectable MP3 export format

Thanks for the contribution! The MP3 export configurability is useful.

Concerns

  1. Full file rewrite of audio_utils.py: This PR replaces the entire 516-line file with a 618-line rewrite (+777/-559). A full rewrite makes it very difficult to review what actually changed vs what was just reformatted. Could you restructure this as incremental changes on top of the existing file? This would make the diff much clearer and reduce merge conflict risk.

  2. Missing changes in Gradio UI: The PR description says "If the UI does not support these settings, generation will default to the previous settings" — but without UI controls, users can't actually select MP3 bitrate/sample rate. Is there a follow-up PR planned for the Gradio UI integration?

  3. inference.py changes: The +58/-43 diff in inference.py also appears to be partially a reformat. Please separate formatting changes from functional changes.

Suggestions

  • Break this into two PRs: (1) the MP3 export feature with minimal changes, (2) any refactoring/cleanup separately
  • Add MP3 bitrate/sample rate options to the Gradio UI and API request schema
  • Add a test that verifies MP3 output with non-default bitrate

The feature itself is welcome — just needs a cleaner diff. Happy to re-review once restructured.

- pass repaint params through dit_generate_kwargs
- set MP3 defaults to 320k and 44100
- force true CBR export with ffmpeg -abr 0
- update tests for new MP3 defaults
- remove redundant fallback access and unused imports
Add a minimal backend-only implementation for configurable MP3 export.

This change keeps the diff intentionally small and focused on the core feature:
- add optional MP3 bitrate and sample rate settings in the backend
- pass those settings through the generation pipeline
- add targeted tests for default and custom MP3 export behavior

The Gradio UI was not modified on purpose, to leave lead developers full freedom to integrate these settings where and how they see fit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants