Dong et al. (2019) and Tay et al. (2022) train on a mixture of denoising tasks with different attention masks (full, causal and prefix attention) to bridge the performance gap with next token pretraining on generative tasks.
Incorporating state space models (SSMs) into deep neural networks provides an innovative approach to model selection that enhances the capacity, efficiency, and overall performance of neural architectures.