Masked models use bidirectional attention and cannot be accelerated by KV cache. What is the generation speed compared to auto-regressive counterparts? I used to try masked image generation models (e.g., [MAR](https://github.com/LTH14/mar)) and found it extremely slow at inference compared to auto-regressive ones (e.g. [RAR](https://github.com/bytedance/1d-tokenizer/blob/main/README_RAR.md)). I know parallel decoding can decrease the number of inference steps needed, but the benchmark performance will also be affected, am I right?
This issue appears to be discussing a feature request or bug report related to the repository. Based on the content, it seems to be resolved. The issue was opened by zhengkw18 and has received 3 comments.