Context of this work and potential problems to be explored · Issue #10 · ML-GSAI/LLaDA

I want to provide some background to readers who are unfamiliar with the context. This work is built on the theoretical framework of masked diffusion models (MDMs), a variant of the broad discrete diffusion family. MDM is very similar to masked models such as BERT, which is trained by randomly masking a portion of tokens and predicting them. Note that BERT is targeted at language understanding instead of generation, and masks at most 15% of tokens. Later work [1] uniformly chooses the masking ratio from 0 to 100% at training to make the model generative. MDM has some differences from masked models at first glance, as it is involved with a continuous time variable just like typical diffusion models. Specifically, the training of MDM is a time-related integral of cross-entropy terms but with certain coefficients that form an ELBO (bound of the model likelihood, which enables maximum likelihood estimation). The sampling of MDM is through time discretization instead of token-by-token decoding. However, several recent works suggest that MDM is the same as masked models, and **the diffusion-related characteristics are unnecessary, or even harmful**. For example, [2] shows that - In training, the ELBO of MDM can be converted to the ELBO of masked models, as long as (1) the number of masked tokens is uniformly chosen and (2) the cross-entropy loss is averaged over all masked positions. **The ELBO form for masked models is also mentioned in the appendix of LLaDa paper and claimed to have a smaller variance.** - The MDM sampling process can be converted to a token-by-token decoding process of masked models, as long as the position to be unmasked is uniformly chosen. **The token-by-token decoding is not only more efficient but also numerically more stable**. Therefore, MDM-based language models are essentially maximum-likelihood-trained masked language models, and it may be simpler to directly use the masked model formulations (only remember to use the ELBO if you still want maximum likelihood training). However, LLaDa seems to still use the time variable, especially during sampling. I wonder - Does the time discretization cause numerical issues that lower the effective temperature, as discovered in [2]? **Will it be more efficient to alter to a token-by-token decoding process of masked models?** - **Is maximum likelihood training (ELBO) really necessary on language?** Other loss weightings and masked ratio schedules can lead to the same optimal prediction network, just like in diffusion model training. **It is well known that maximum likelihood training often leads to inferior generation quality (e.g., measured by FID) on image diffusion models [3]**. **For auto-regressive language modeling, we use maximum likelihood training only because we don't have other choices.** As mentioned in the paper, **a major limitation of masked language models is the inference inefficiency: the bidirectional attention in masked models is incompatible with KV caching**. This can be a crucial obstacle in the development of LLaDa as long-context is the current trend of LLMs, which necessitates low-cost deployment and inspires state-space models like Mamba. One may consider the following improving approaches: - **Fast solvers or distillation (such as [4,5,6])**. I know these two methods have significantly accelerated diffusion models for image/video generation. Still, their success owes to the smoothness of the continuous data space and the Lipschitzness of pretrained diffusion models, which may not hold for discrete data. In the masked model case, we can understand the pretrained model as predicting factorized distributions, and the distilled model as predicting joint distributions of multiple tokens. - **Interpolating between auto-regressive models and masked models and training semi-autoregressive models**. This becomes a battle between causal attention and bi-directional attention: what exactly are their respective advantageous scenarios? Is there a principled way to determine? - Though decoding in random order is theoretically the most correct, the trained models are not perfect and may fit **other decoding orders** [7,8]. Hope these discussions can inspire future exploration! [1] Mask-Predict: Parallel Decoding of Conditional Masked Language Models [2] Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling [3] Maximum likelihood training of score-based diffusion models [4] Distillation of Discrete Diffusion through Dimensional Correlations [5] Discrete Copula Diffusion [6] Fast LLMs via Self-Distillation Through Time [7] Path Planning for Masked Diffusion Model Sampling [8] Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions

Context of this work and potential problems to be explored#10