Allowing for the LM to learn a soft-"multi-step program" to predict future tokens instead of learning to predict future tokens itself.