Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment
Anonymous Authors
Abstract: Neural speech codecs based on Vector-Quantized VAEs (VQ-VAEs) are core audio tokenizers for speech LLMs, yet their reconstruction fidelity is bottlenecked by quantization error. Instead of modifying the quantizer or increasing model capacity—common approaches that complicate downstream language modeling—we introduce self-guidance, a simple yet general training principle that enhances the decoder's robustness to quantization error. Our core idea is to align the decoder's internal feature manifolds when processing both the quantized tokens and their original continuous embeddings, using a lightweight feature-mapping loss. This requires minimal training overhead and no inference-time changes. Applied to XCodec2, self-guidance improves all reconstruction metrics, achieving state-of-the-art low-bitrate performance. It generalizes across codebook sizes, quantizer types, and network architectures, demonstrating value as a universal codec enhancer. Notably, it enables a 4× codebook reduction without fidelity loss, which downstream TTS experiments show significantly improves LLM-based synthesis by simplifying the token modeling space. Self-guidance thus establishes an efficient, broadly applicable method for high-fidelity neural audio coding.
Figure 1: Illustration of the VQ-VAE architecture and the proposed self-guidance (SG) mechanism
Quantitative demonstration of decoder feature alignment
In this section, we reveal the functionality of self-guidance by providing statistic evidence on 2 key measures in Figure 1:
- Quantization error: the error between the pre-quantize latent embedding and quantized token embedding
- Hidden feature alignment MSE: the error between decoder output hidden features from the pre-quantize latent embedding and quantized token embedding
We run inference on the LibriSpeech test-clean subset using models with varying codebook sizes, with and without self-guidance. The results show that self-guidance casts little influence on quantization error, but significantly reduces hidden feature alignment MSE. This finding verifies that self-guidance indeed helps to align the decoder's internal feature manifold, enhancing robustness to quantization error, rather than directly reducing quantization error.
Quantization Error
As shown in Figure 2, the distribution of quantization error is relatively stable with and without self-guidance, indicating that self-guidance does not significantly impact this metric. This is consistent with the statistics in the following table.
| Codebook size | With self-guidance | Quantization Error mean | Quantization Error std |
|---|---|---|---|
| 65536 | No | 0.858 | 0.120 |
| 65536 | Yes | 0.851 | 0.121 |
| 16384 | No | 0.798 | 0.121 |
| 16384 | Yes | 0.799 | 0.120 |
| 8192 | No | 0.741 | 0.120 |
| 8192 | Yes | 0.744 | 0.121 |
Figure 2: Quantization error histogram for different codebook sizes with and without self-guidance
Hidden feature alignment MSE
As shown in Figure 3, there is a obvious increase in the portion of higher values when self-guidance is not activated (Note that the x-axis is of log-scale), which is harmful to faithful reconstruction. As also presented in the following table, where self-guidance significantly reduces both mean values and standard deviations.
| Codebook size | With self-guidance | Hidden MSE mean | Hidden MSE std |
|---|---|---|---|
| 65536 | No | 9.439 | 29.203 |
| 65536 | Yes | 5.854 | 13.712 |
| 16384 | No | 13.551 | 59.137 |
| 16384 | Yes | 4.958 | 5.863 |
| 8192 | No | 23.605 | 109.458 |
| 8192 | Yes | 4.197 | 6.865 |
Figure 3: Hidden feature alignment MSE histogram for different codebook sizes with and without self-guidance
Reconstruction results from different models
We provide a collection of audio samples, including the ground truth (GT) and the reconstruction results from following neural codec models:
| Abbreviation | Frame rate | Total bitrate | Description |
|---|---|---|---|
| GT | - | - | Ground truth audio |
| BigCodec.40Hz | 40Hz | 520 bps | A compact BigCodec model with a lower frame rate, serving as the lower bound for comparison |
| XCodec2 | 50Hz | 800 bps | Default XCodec2 model, serving as a baseline |
| XCodec2+SG | 50Hz | 800 bps | XCodec2 model with the proposed self-guidance |
| BigCodec | 80Hz | 1040 bps | Default BigCodec model, serving as the upper limit |
Each model is trained on the LibriSpeech training dataset (EN) for 600,000 iterations with 8 Nvidia RTX 4090 GPUs. Audio samples are drawn from the LibriSpeech test-clean subset, together with the text transcripts. We encourage listeners to pay attention to the differences in clarity, and presence of artifacts among the various reconstructions.
| Text script | GT | BigCodec.40Hz | XCodec2 | XCodec2+SG | BigCodec |
|---|---|---|---|---|---|
| hello bertie any good in your mind | |||||
| if she could only see phronsie for just one moment | |||||
| father thee's unjust to philip he's going into business | |||||
| i say sir harry the little girl's going famously to night isn't she | |||||
| been looking up tooms county | |||||
| it's a stock company and rich | |||||
| you are mate replied the sailor | |||||
| i don't want to stand around and look on | |||||
| mister jago is an american philip | |||||
| don't worry sizzle dear it'll all come right pretty soon |
Qualitative demonstration on the fidelity enhancement from self-guidance
In this case study, we analyze a group of audio samples selected from the above table. By comparing audio samples with and without self-guidance, we qualitatively demonstrate how the self-guidance mechanism reduces quantization artifacts and enhances the perceptual quality of reconstructed speech.
Case 1: Smeared harmonics
Description: In this case, the harmonics structure of the speech reconstructed by the XCodec2 baseline is blurry at the word "phronsie", leading to bubbly artifacts in the audio. While with the proposed self-guidance mechanism, the reconstructed speech shows clearer and sharper harmonics structure.
Text script: "if she could only see phronsie for just one moment"
| Source | Audio | Spectrogram |
|---|---|---|
|
GT |
|
|
|
XCodec2 |
|
|
|
XCodec2+SG |
|
Case 2: Pitch spike
Description: In this case, there is a undesired pitch spike at the word "sir" in the speech reconstructed by the XCodec2 baseline, where the fundamental frequency suddenly bumps up. While with the proposed self-guidance mechanism, this issue is effectively mitigated.
Text script: "i say sir harry the little girl's going famously to night isn't she"
| Source | Audio | Spectrogram |
|---|---|---|
|
GT |
|
|
|
XCodec2 |
|
|
|
XCodec2+SG |
|
Case 3: Oversmoothed harmonics shape
Description: In this case, the high-order harmonics of the speech reconstructed by the XCodec2 baseline are oversmoothed to a flat line at the phrase "you are mate", leading to repeating "echoes" in the audio. While with the proposed self-guidance mechanism, the GT harmonics is better preserved.
Text script: "you are mate, replied the sailor"
| Source | Audio | Spectrogram |
|---|---|---|
|
GT |
|
|
|
XCodec2 |
|
|
|
XCodec2+SG |
|
Appendix: Failure Case and Limitation
While self-guidance significantly lowers the overall frequency of such artifacts, it does not eliminate them entirely. Due to training dynamics, the proposed approach may still present some artifacts in certain cases.
Description: In this case, the reconstructed audio from the proposed approach presents a depressed pitch in the starting word "also" of the utterance.
Text script: "also, there was a stripling page who turned into a mai"
| Source | Audio | Spectrogram |
|---|---|---|
|
GT |
|
|
|
XCodec2 |
|
|
|
XCodec2+SG |
|