They are basically cheating which is why they all sound like shit or have a lot of latency. Those are the only two choices. Why is this so hard for you to understand?
I'd probably describe it as masking, or estimating perhaps, not necessarily cheating.
I did a LOT of reading on this yesterday. Much more than I have in the past. So you've got a few issues that crop up:
- Latency that comes from the
fundamental physics truth that you cannot reliably predict a frequency without at least 1 full cycle of that frequency. This is a first principle. It is simply not debateable. Attempting to debate it is the height of ignorance, and the insistence that it is wrong, is the height of hubris.
- This is also further complicated that real signals in the real world, are not test sine-waves. They're not easily predictable, and you cannot always guarantee that you'll even be able to recognise a full cycle of a given note - harmonics, noise floor, everything throws it off. So you actually often need 2-3 cycles for robustness. Essentially - no periodicity, no fundamental. No fundamental, no pitch. No pitch, no tracking. No tracking, no pitch shift. All of these are again, fundamental first principles.
- Enter FFT analysis. Using FFT you can then perform such things like noise and harmonic suppression, in order to arrive at the most likely fundamental frequency. You can also track stability and identify peaks more reliably.
- But this introduces its own problems, because FFT analysis itself introduces latency. This is determined by the FFT window size, the hop size, and the overlap size. With a window size of 1024 samples, that has its own 21.33ms latency. Even with overlap add techniques, you cannot fully eliminate window latency.
- So how else could we get at the fundamental? We could perform an autocorrelation. You can think of an autocorrelation as the signal multiplied by a delayed version of itself. Essentially sliding the signal against a shifted copy of itself, and measuring how well they line up. When the peak value of the shifted signal is a strong positive peak, that means the alignment is good and you've found a likely fundamental in that period. If the peak is low, then the alignment is worse, so the correlation is low.
- But this also has a latency. It is a little more controllable, but with caveats. Make your window to small and you get octave errors, jitter, instability, false pitches, and artifacts. Make it too long, and you get .... more latency.
- Enter resampling. Resampling is zero or near zero latency. But the downsides are transient smearing, formant destruction, modulation artifacts; all leading to a warbling robotic tone. Hello Whammy pedal. This is all fundamentally because
you are not shifting pitch when you resample. You are shifting TIME.
- Phase-Vocoder. Is essentially a process of performing an STFT (Short Time Fourier Transform), then phase manipulation, and then an inverse FFT with overlap-add techniques. This is one of the most used techniques in pitch shifting, as I understand it.
- This gives you, accurate pitch shifting, and stable. But also smears transients. Phase artifacts. Pretty easy to understand if you know the physics behind it and how windowing a signal works. You get latency from the window size you decide to use. Long windows lead to good pitch extraction, but bad transients, and latency. Short windows lead to good transients, bad pitch resolution, and less latency.
I might have some of that wrong in the details. But the principles are correct. My original statement that sent Sascha into a spazz attack, was correct.
I’m pretty sure this discussion started because you said that all you need for less pitch shifting latency is more CPU. In a discussion about polyphonic pitch shifting g, correct? And then you’ve just moved the goalposts to address this cycle length discussion. And then you ChatGPT’d an answer to a question that wasn’t even being asked until you changed from “cpu power” to “cycle time” for your point of contention.
Yes, indeed. He's completely moved the goalposts to obfuscate from his initial comment that was completely wrong. I still maintain that it doesn't matter how much CPU you throw at the pitch shifting problem, there will
ALWAYS be trade-offs because of inherent latency and qualities issues in all of the options we have to perform the operation.