An end-to-end speech large language model.
Process audio and generate text output based on instructions