Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment

Xiang Li, Yixuan Zhou, Jingran Xie, Zhiyong Wu, Hui Wang

ICML 2026 Poster

Abstract: Neural speech codecs based on Vector-Quantized VAEs (VQ-VAEs) are core audio tokenizers for speech LLMs, yet their reconstruction fidelity is bottlenecked by quantization error. Instead of modifying the quantizer or increasing model capacity—common approaches that complicate downstream language modeling—we introduce self-guidance, a simple yet general training principle that enhances the decoder's robustness to quantization error. Our core idea is to align the decoder's internal feature manifolds when processing both the quantized tokens and their original continuous embeddings, using a lightweight feature-mapping loss. This requires minimal training overhead and no inference-time changes. Applied to XCodec2, self-guidance improves all reconstruction metrics, achieving state-of-the-art low-bitrate performance. It generalizes across codebook sizes, quantizer types, and network architectures, demonstrating value as a universal codec enhancer. Notably, it enables a 4× codebook reduction without fidelity loss, which downstream TTS experiments show significantly improves LLM-based synthesis by simplifying the token modeling space. Self-guidance thus establishes an efficient, broadly applicable method for high-fidelity neural audio coding.

Figure 1: Illustration of the VQ-VAE architecture and the proposed self-guidance (SG) mechanism

Performance gain: 4× codebook reduction without fidelity loss

We compare the reconstruction quality across codebook sizes. Specifically, the results highlight a key advantage of self-guidance: the model using a 16,384-sized codebook matches the performance of the baseline with 4× larger codebook (65,536), demonstrating improved codec efficiency without increasing inference-time complexity.

Reconstruction performance comparison across codebook sizes with and without self-guidance

Figure 2: Comparison of the reconstruction performance under various settings along the training process. Horizontal axis is the training iterations.

Quantitative demonstration of decoder feature alignment

In this section, we reveal the functionality of self-guidance by providing statistic evidence on 2 key measures in Figure 1:

Quantization error: the error between the pre-quantize latent embedding and quantized token embedding

Hidden feature alignment MSE: the error between decoder output hidden features from the pre-quantize latent embedding and quantized token embedding

We run inference on the LibriSpeech test-clean subset using models with varying codebook sizes, with and without self-guidance. The results show that self-guidance casts little influence on quantization error, but significantly reduces hidden feature alignment MSE. This finding verifies that self-guidance indeed helps to align the decoder's internal feature manifold, enhancing robustness to quantization error, rather than directly reducing quantization error.

Quantization Error

As shown in Figure 3, the distribution of quantization error is relatively stable with and without self-guidance, indicating that self-guidance does not significantly impact this metric. This is consistent with the statistics in the following table.

Codebook size	With self-guidance	Quantization Error mean	Quantization Error std
65536	No	0.858	0.120
65536	Yes	0.851	0.121
16384	No	0.798	0.121
16384	Yes	0.799	0.120
8192	No	0.741	0.120
8192	Yes	0.744	0.121

Figure 3: Quantization error histogram for different codebook sizes with and without self-guidance

Hidden feature alignment MSE

As shown in Figure 4, there is a obvious increase in the portion of higher values when self-guidance is not activated (Note that the x-axis is of log-scale), which is harmful to faithful reconstruction. As also presented in the following table, where self-guidance significantly reduces both mean values and standard deviations.

Codebook size	With self-guidance	Hidden MSE mean	Hidden MSE std
65536	No	9.439	29.203
65536	Yes	5.854	13.712
16384	No	13.551	59.137
16384	Yes	4.958	5.863
8192	No	23.605	109.458
8192	Yes	4.197	6.865

Figure 4: Hidden feature alignment MSE histogram for different codebook sizes with and without self-guidance

t-SNE Visualization of Hidden States

We further visualize the decoder hidden states with joint t-SNE, comparing the continuous teacher branch and the quantized student branch. With self-guidance activated, the hidden features from both the continuous teacher (circles) and quantized student (triangles) cluster by token ID (color, 50 most common tokens), indicating that self-guidance preserves discriminative, token-specific information and does not induce feature collapse.

In the baseline approach, features from the two branches separate into distinct halves of the latent space (red dashed line), showing baseline manifold misalignment. This is not a collapse, but it visualizes the dual-path inconsistency that self-guidance is designed to mitigate.

Joint t-SNE visualization of teacher and student decoder hidden states with and without self-guidance

Figure 5: Joint t-SNE visualization of decoder hidden states from the continuous teacher branch and quantized student branch, with and without self-guidance.

Layerwise Decoder Alignment Analysis

We also perform a layerwise analysis across all 12 Transformer blocks in the decoder, where linear CKA (Centered Kernel Alignment, higher is better) is computed between the outputs from the teacher and student branches. Although self-guidance only applies the feature mapping loss on the final hidden state of the Transformer backbone, it substantially improves alignment throughout the decoder.

Layerwise linear CKA between teacher and student decoder branches across Transformer blocks

Figure 6: Linear CKA between teacher and student branch outputs across the 12 Transformer blocks in the decoder, with and without self-guidance.

Reconstruction results from different models

We provide a collection of audio samples, including the ground truth (GT) and the reconstruction results from following neural codec models:

Abbreviation	Frame rate	Total bitrate	Description
GT	-	-	Ground truth audio
BigCodec.40Hz	40Hz	520 bps	A compact BigCodec model with a lower frame rate, serving as the lower bound for comparison
XCodec2	50Hz	800 bps	Default XCodec2 model, serving as a baseline
XCodec2+SG	50Hz	800 bps	XCodec2 model with the proposed self-guidance
BigCodec	80Hz	1040 bps	Default BigCodec model, serving as the upper limit

Each model is trained on the LibriSpeech training dataset (EN) for 600,000 iterations with 8 Nvidia RTX 4090 GPUs. Audio samples are drawn from the LibriSpeech test-clean subset, together with the text transcripts. We encourage listeners to pay attention to the differences in clarity, and presence of artifacts among the various reconstructions.

Text script	GT	BigCodec.40Hz	XCodec2	XCodec2+SG	BigCodec
hello bertie any good in your mind
if she could only see phronsie for just one moment
father thee's unjust to philip he's going into business
i say sir harry the little girl's going famously to night isn't she
been looking up tooms county
it's a stock company and rich
you are mate replied the sailor
i don't want to stand around and look on
mister jago is an american philip
don't worry sizzle dear it'll all come right pretty soon

Qualitative demonstration on the fidelity enhancement from self-guidance

In this case study, we analyze a group of audio samples selected from the above table. By comparing audio samples with and without self-guidance, we qualitatively demonstrate how the self-guidance mechanism reduces quantization artifacts and enhances the perceptual quality of reconstructed speech.

Case 1: Smeared harmonics

Description: In this case, the harmonics structure of the speech reconstructed by the XCodec2 baseline is blurry at the word "phronsie", leading to bubbly artifacts in the audio. While with the proposed self-guidance mechanism, the reconstructed speech shows clearer and sharper harmonics structure.

Text script: "if she could only see phronsie for just one moment"

Source	Audio	Spectrogram
GT
XCodec2
XCodec2+SG

Case 2: Pitch spike

Description: In this case, there is a undesired pitch spike at the word "sir" in the speech reconstructed by the XCodec2 baseline, where the fundamental frequency suddenly bumps up. While with the proposed self-guidance mechanism, this issue is effectively mitigated.

Text script: "i say sir harry the little girl's going famously to night isn't she"

Source	Audio	Spectrogram
GT
XCodec2
XCodec2+SG

Case 3: Oversmoothed harmonics shape

Description: In this case, the high-order harmonics of the speech reconstructed by the XCodec2 baseline are oversmoothed to a flat line at the phrase "you are mate", leading to repeating "echoes" in the audio. While with the proposed self-guidance mechanism, the GT harmonics is better preserved.

Text script: "you are mate, replied the sailor"

Source	Audio	Spectrogram
GT
XCodec2
XCodec2+SG

Appendix: Failure Case and Limitation

While self-guidance significantly lowers the overall frequency of such artifacts, it does not eliminate them entirely. Due to training dynamics, the proposed approach may still present some artifacts in certain cases.

Description: In this case, the reconstructed audio from the proposed approach presents a depressed pitch in the starting word "also" of the utterance.

Text script: "also, there was a stripling page who turned into a mai"

Source	Audio	Spectrogram
GT
XCodec2
XCodec2+SG