Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment


Anonymous Authors

Abstract: Neural speech codecs based on Vector-Quantized VAEs (VQ-VAEs) are core audio tokenizers for speech LLMs, yet their reconstruction fidelity is bottlenecked by quantization error. Instead of modifying the quantizer or increasing model capacity—common approaches that complicate downstream language modeling—we introduce self-guidance, a simple yet general training principle that enhances the decoder's robustness to quantization error. Our core idea is to align the decoder's internal feature manifolds when processing both the quantized tokens and their original continuous embeddings, using a lightweight feature-mapping loss. This requires minimal training overhead and no inference-time changes. Applied to XCodec2, self-guidance improves all reconstruction metrics, achieving state-of-the-art low-bitrate performance. It generalizes across codebook sizes, quantizer types, and network architectures, demonstrating value as a universal codec enhancer. Notably, it enables a 4× codebook reduction without fidelity loss, which downstream TTS experiments show significantly improves LLM-based synthesis by simplifying the token modeling space. Self-guidance thus establishes an efficient, broadly applicable method for high-fidelity neural audio coding.

Illustration of the VQ-VAE architecture and the proposed self-guidance (SG) mechanism

Figure 1: Illustration of the VQ-VAE architecture and the proposed self-guidance (SG) mechanism

Quantitative demonstration of decoder feature alignment

In this section, we reveal the functionality of self-guidance by providing statistic evidence on 2 key measures in Figure 1:

  • Quantization error: the error between the pre-quantize latent embedding and quantized token embedding
  • Quantization error expression
  • Hidden feature alignment MSE: the error between decoder output hidden features from the pre-quantize latent embedding and quantized token embedding
  • Hidden feature alignment MSE expression

We run inference on the LibriSpeech test-clean subset using models with varying codebook sizes, with and without self-guidance. The results show that self-guidance casts little influence on quantization error, but significantly reduces hidden feature alignment MSE. This finding verifies that self-guidance indeed helps to align the decoder's internal feature manifold, enhancing robustness to quantization error, rather than directly reducing quantization error.

Quantization Error

As shown in Figure 2, the distribution of quantization error is relatively stable with and without self-guidance, indicating that self-guidance does not significantly impact this metric. This is consistent with the statistics in the following table.

Codebook size With self-guidance Quantization Error mean Quantization Error std
65536 No 0.858 0.120
65536 Yes 0.851 0.121
16384 No 0.798 0.121
16384 Yes 0.799 0.120
8192 No 0.741 0.120
8192 Yes 0.744 0.121
Quantization error histogram

Figure 2: Quantization error histogram for different codebook sizes with and without self-guidance

Hidden feature alignment MSE

As shown in Figure 3, there is a obvious increase in the portion of higher values when self-guidance is not activated (Note that the x-axis is of log-scale), which is harmful to faithful reconstruction. As also presented in the following table, where self-guidance significantly reduces both mean values and standard deviations.

Codebook size With self-guidance Hidden MSE mean Hidden MSE std
65536 No 9.439 29.203
65536 Yes 5.854 13.712
16384 No 13.551 59.137
16384 Yes 4.958 5.863
8192 No 23.605 109.458
8192 Yes 4.197 6.865
Hidden feature alignment MSE histogram

Figure 3: Hidden feature alignment MSE histogram for different codebook sizes with and without self-guidance

Reconstruction results from different models

We provide a collection of audio samples, including the ground truth (GT) and the reconstruction results from following neural codec models:

Abbreviation Frame rate Total bitrate Description
GT - - Ground truth audio
BigCodec.40Hz 40Hz 520 bps A compact BigCodec model with a lower frame rate, serving as the lower bound for comparison
XCodec2 50Hz 800 bps Default XCodec2 model, serving as a baseline
XCodec2+SG 50Hz 800 bps XCodec2 model with the proposed self-guidance
BigCodec 80Hz 1040 bps Default BigCodec model, serving as the upper limit

Each model is trained on the LibriSpeech training dataset (EN) for 600,000 iterations with 8 Nvidia RTX 4090 GPUs. Audio samples are drawn from the LibriSpeech test-clean subset, together with the text transcripts. We encourage listeners to pay attention to the differences in clarity, and presence of artifacts among the various reconstructions.

Text script GT BigCodec.40Hz XCodec2 XCodec2+SG BigCodec
hello bertie any good in your mind
if she could only see phronsie for just one moment
father thee's unjust to philip he's going into business
i say sir harry the little girl's going famously to night isn't she
been looking up tooms county
it's a stock company and rich
you are mate replied the sailor
i don't want to stand around and look on
mister jago is an american philip
don't worry sizzle dear it'll all come right pretty soon

Qualitative demonstration on the fidelity enhancement from self-guidance

In this case study, we analyze a group of audio samples selected from the above table. By comparing audio samples with and without self-guidance, we qualitatively demonstrate how the self-guidance mechanism reduces quantization artifacts and enhances the perceptual quality of reconstructed speech.

Case 1: Smeared harmonics

Description: In this case, the harmonics structure of the speech reconstructed by the XCodec2 baseline is blurry at the word "phronsie", leading to bubbly artifacts in the audio. While with the proposed self-guidance mechanism, the reconstructed speech shows clearer and sharper harmonics structure.

Text script: "if she could only see phronsie for just one moment"

Source Audio Spectrogram

GT

Spectrogram of GT

XCodec2

Spectrogram of XCodec2

XCodec2+SG

Spectrogram of XCodec2+SG

Case 2: Pitch spike

Description: In this case, there is a undesired pitch spike at the word "sir" in the speech reconstructed by the XCodec2 baseline, where the fundamental frequency suddenly bumps up. While with the proposed self-guidance mechanism, this issue is effectively mitigated.

Text script: "i say sir harry the little girl's going famously to night isn't she"

Source Audio Spectrogram

GT

Spectrogram of GT

XCodec2

Spectrogram of XCodec2

XCodec2+SG

Spectrogram of XCodec2+SG

Case 3: Oversmoothed harmonics shape

Description: In this case, the high-order harmonics of the speech reconstructed by the XCodec2 baseline are oversmoothed to a flat line at the phrase "you are mate", leading to repeating "echoes" in the audio. While with the proposed self-guidance mechanism, the GT harmonics is better preserved.

Text script: "you are mate, replied the sailor"

Source Audio Spectrogram

GT

Spectrogram of GT

XCodec2

Spectrogram of XCodec2

XCodec2+SG

Spectrogram of XCodec2+SG

Appendix: Failure Case and Limitation

While self-guidance significantly lowers the overall frequency of such artifacts, it does not eliminate them entirely. Due to training dynamics, the proposed approach may still present some artifacts in certain cases.

Description: In this case, the reconstructed audio from the proposed approach presents a depressed pitch in the starting word "also" of the utterance.

Text script: "also, there was a stripling page who turned into a mai"

Source Audio Spectrogram

GT

Spectrogram of GT

XCodec2

Spectrogram of XCodec2

XCodec2+SG

Spectrogram of XCodec2+SG