What about the audio codec?

#1
by yukiarimo - opened

Hello! Very impressed with this release! I have a few questions:

  1. How to fine-tune this audio codec?
  2. Very very simply, audio in and audio out are both like: tokens? What’s the rate?
  3. Is it possible to connect it into Gemma 3?
StepFun org

Step-Audio-2-mini is an end-to-end large audio language model.
It employs an audio encoder and adaptor to convert the input audio into 12.5Hz latent audio features, and generate discrete 25Hz audio tokens with an LLM decoder.

You can find more information in our technical report.

Thank! Why input os 12.5Hz, but the output is 25Hz?

yukiarimo changed discussion status to closed
StepFun org

As indicated in our report, we employ the audio adaptor to further downsample the output of audio encoder (25Hz) to 12.5Hz, mainly for reducing the number of input tokens and computation cost.

Sign up or log in to comment