Artifacts at boundaries during streaming
Hi everyone π
I tried to run the ONNX model in streaming mode by decoding every N amount of output tokens. I tried with a range of values from 10 to 100. The issue I face now is that when I put the audio together I can hear some clicks and pops at audio chunk boundaries. This seems to be because decoder produces slightly different audio based on context and potentially makes audio to be quite at the end of each decode run or adds silence at the start of each decode run. Simply removing silence or using lookback with lookahead strategies does not seem to help.
Any advice and how this could be solved? Any direction advice? I assume this decoder was not designed for true streaming?
Thank you in advance!
UPDATE:
Looks like non ONNX model allows to make decoder outputs more deterministic. Plan to investigate in this direction.
How did u made this model stream output? any code u can share so i can implement streaming for this model?