Latency of pitch shifting algorithms cage fight

Our virtual friend says this:

---

Short answer:


No — for static pitch shifting (e.g., always shifting by a fixed interval like a perfect fifth), you generally do not need to detect the input pitch first.

Why pitch detection is not required


Plugins and hardware pitch-shifters that apply a fixed transposition amount (e.g., +7 semitones) usually rely on signal-processing algorithms that shift the entire waveform without needing to know what note is being played.


Common algorithms include:



  • FFT-based phase vocoder pitch shifters
  • Time-domain harmonic scaling (e.g., PSOLA)
  • Granular pitch shifters
  • Delay-based shifters (classic Eventide-style)

All of these methods mathematically “rescale” or recombine the waveform to raise or lower it by the desired interval.


None of these need pitch detection to shift by a fixed amount like +700 cents.


When pitch detection is required


Pitch tracking becomes necessary when the processor works in a note-aware or pitch-dependent way, such as:


  • Autotune / pitch correction (maps detected notes to target notes)
  • Harmonizers that generate musical intervals (e.g., add a 3rd or 5th based on the current note)
  • Adaptive effects that react differently depending on the fundamental frequency
  • Audio-to-MIDI conversion

These systems must know which note is present to compute the musical interval properly.

---

I am perfectly aware that in the OG Helix thread Orvillain mentioned those "non analysing" methods as well. But his first answer has still been that a full length wave cycle would have to be analyzed for any pitch shifting to work at all. Apparently he only looked up the other options later on (more on that below, btw), because otherwise, why would anyone try to come across with such an exclusive, generalizing (and even more: plain wrong for most use cases discussed here) statement in the first place? But hey, fortunately we have Jay backing him up with the same nonsense.

Then, about the other ways of pitch shifiting: I looked them up. At least as good as I could within my "user only brain" limitations.
And yes, I was wrong with this:

Well, it basically really only depends on CPU power. More juice = less calculation time needed.

And yes, it seems that only pretty bad sounding pitch shifting algorithms (such as, say, delay-based ones) would be able to work within, say, the boundaries of the buffersize you've set for your DAW, most decent sounding ones do apparently need an extra chunk of latency, which apparently is caused by some "analysis" or "grain length" window. And then there's also some DSP technique called "PSOLA" which then indeed needs at least one full waveform cycle as its analysis window.

The latter very apparently is what Orvillain and Jay are refering to. Just that they came up with answers (at least in their initial forms) making it seem as the only way pitch shifting would work at all. Which, as clear as a morning sky, is just complete bogus (and as said, yes, I know that Orvillain came up with some of the other options - but that was only after he got called out on that absolutistic, generalizing "needs a full wave cycle" blurb).
In fact, most pitch shifting algorithms we are dealing with in our modelers and plugins are NOT using that very technique as it'd cause too much latency, especially in case you also wanted to cover lower frequencies. Which would render it useless for most realtime applications.

TL;DR, the nutshell:
- Yes, there is pitch shifting methods requiring at least one full length wave cycle to work.
- But also yes: There's some other methods which simply don't need a full length wave cycle.
- And even more so: The vast majority of pitch shifting algorithms we're dealing with in modeling land are of the latter kind (which can easily be proven by measuring their latency).

And fwiw, as the last thing: I already admitted I was wrong with my initial statement, but yet, those methods not requiring a full length wave cycle to work, do in fact profit from faster CPUs, especially in case they're grain/granular based. At least that's what I got out of my (admittedly very incomplete) research. They don't profit as much as I thought they would, though.
So it seems to me, at least as long as there's no new pitch shifting methods, we won't be seeing decent sounding pitch shifting and very low latencies.
Still hasn't got anything to do with full wave lengths cycles in the vast majority of all cases, regardless of what Orvillain and Jay want to make you believe.
 
They're not cheating. They're just using different algorithms.

They are using algorithms that can’t do a full and accurate polyphonic pitch shift. Call it cheating, taking a shortcut, doing a poor job, whatever. It’s still not going to be able to shift 6 notes and their harmonic content without errors and artifacts.
 
Last edited:
Why do you think after all these decades of digital signal processing no one has come up with a high quality low latency pitch shifter? Do you think no one has wanted to?
 
Why do you think after all these decades of digital signal processing no one has come up with a high quality low latency pitch shifter? Do you think no one has wanted to?

Please read my last longer post. I have done some reading up and learned a thing or two.
I still don't think it's "cheating", but it seems to be in the nature of things that you'll always have to deal with shortcomings.

Besides, that's not what this discussion started - but the wrong assertion that pitch shifting required a full length wave cycle to start with in any case. Which it doesn't.
 
Besides, that's not what this discussion started - but the wrong assertion that pitch shifting required a full length wave cycle to start with in any case. Which it doesn't.
In the other thread @Orvillain posted a list of different methods and their pros and cons. Here, I'll even link it for you.


Yet you got stuck at that one specific thing, and did not address your own claim of "throwing more CPU horsepower at it solves pitch shifting issues".
 
Please read my last longer post. I have done some reading up and learned a thing or two.
I still don't think it's "cheating", but it seems to be in the nature of things that you'll always have to deal with shortcomings.

Besides, that's not what this discussion started - but the wrong assertion that pitch shifting required a full length wave cycle to start with in any case. Which it doesn't.
I’m pretty sure this discussion started because you said that all you need for less pitch shifting latency is more CPU. In a discussion about polyphonic pitch shifting g, correct? And then you’ve just moved the goalposts to address this cycle length discussion. And then you ChatGPT’d an answer to a question that wasn’t even being asked until you changed from “cpu power” to “cycle time” for your point of contention.
 
In the other thread @Orvillain posted a list of different methods and their pros and cons. Here, I'll even link it for you.

I have already acknowledged that. Yet, he started all of it with a posting saying that analyzing a full length wave cycle would be a requirement for all pitch shifting to happen. The other stuff came after the fact.

And then you’ve just moved the goalposts to address this cycle length discussion.

No, I didn't. The one starting the full cycle length discussion wasn't me.
I already said so that I was wrong - but it absolutely wasn't because pitch shifters would need a full length wave cycle (hence what Orvillain used as his entry post to the discussion) because that method typically isn't even used.
 
They are basically cheating which is why they all sound like shit or have a lot of latency. Those are the only two choices. Why is this so hard for you to understand?
I'd probably describe it as masking, or estimating perhaps, not necessarily cheating.

I did a LOT of reading on this yesterday. Much more than I have in the past. So you've got a few issues that crop up:

- Latency that comes from the fundamental physics truth that you cannot reliably predict a frequency without at least 1 full cycle of that frequency. This is a first principle. It is simply not debateable. Attempting to debate it is the height of ignorance, and the insistence that it is wrong, is the height of hubris.

- This is also further complicated that real signals in the real world, are not test sine-waves. They're not easily predictable, and you cannot always guarantee that you'll even be able to recognise a full cycle of a given note - harmonics, noise floor, everything throws it off. So you actually often need 2-3 cycles for robustness. Essentially - no periodicity, no fundamental. No fundamental, no pitch. No pitch, no tracking. No tracking, no pitch shift. All of these are again, fundamental first principles.

- Enter FFT analysis. Using FFT you can then perform such things like noise and harmonic suppression, in order to arrive at the most likely fundamental frequency. You can also track stability and identify peaks more reliably.

- But this introduces its own problems, because FFT analysis itself introduces latency. This is determined by the FFT window size, the hop size, and the overlap size. With a window size of 1024 samples, that has its own 21.33ms latency. Even with overlap add techniques, you cannot fully eliminate window latency.

- So how else could we get at the fundamental? We could perform an autocorrelation. You can think of an autocorrelation as the signal multiplied by a delayed version of itself. Essentially sliding the signal against a shifted copy of itself, and measuring how well they line up. When the peak value of the shifted signal is a strong positive peak, that means the alignment is good and you've found a likely fundamental in that period. If the peak is low, then the alignment is worse, so the correlation is low.

- But this also has a latency. It is a little more controllable, but with caveats. Make your window to small and you get octave errors, jitter, instability, false pitches, and artifacts. Make it too long, and you get .... more latency.

- Enter resampling. Resampling is zero or near zero latency. But the downsides are transient smearing, formant destruction, modulation artifacts; all leading to a warbling robotic tone. Hello Whammy pedal. This is all fundamentally because you are not shifting pitch when you resample. You are shifting TIME.

- Phase-Vocoder. Is essentially a process of performing an STFT (Short Time Fourier Transform), then phase manipulation, and then an inverse FFT with overlap-add techniques. This is one of the most used techniques in pitch shifting, as I understand it.

- This gives you, accurate pitch shifting, and stable. But also smears transients. Phase artifacts. Pretty easy to understand if you know the physics behind it and how windowing a signal works. You get latency from the window size you decide to use. Long windows lead to good pitch extraction, but bad transients, and latency. Short windows lead to good transients, bad pitch resolution, and less latency.


I might have some of that wrong in the details. But the principles are correct. My original statement that sent Sascha into a spazz attack, was correct.

I’m pretty sure this discussion started because you said that all you need for less pitch shifting latency is more CPU. In a discussion about polyphonic pitch shifting g, correct? And then you’ve just moved the goalposts to address this cycle length discussion. And then you ChatGPT’d an answer to a question that wasn’t even being asked until you changed from “cpu power” to “cycle time” for your point of contention.

Yes, indeed. He's completely moved the goalposts to obfuscate from his initial comment that was completely wrong. I still maintain that it doesn't matter how much CPU you throw at the pitch shifting problem, there will ALWAYS be trade-offs because of inherent latency and qualities issues in all of the options we have to perform the operation.
 
He's completely moved the goalposts to obfuscate from his initial comment that was completely wrong.

The one moving goalposts was you just as much, coming up with an "explanation" that it would be mandatory for any pitch shifting to read out a full length wave cycle to work at all. Which simply isn't a) the only truth, nor is it b) even happening in pretty much all pitch shifters.
 
Do you need the full wave to detect pitch? Assuming a pure Sine wave (which is not what a guitar low E string is), wouldn't half a wave suffice (0 to peak to 0)? Or even a quarter wave? (0 to peak...would need to detect that maxima and falling edge)

Now assuming a regular E string which will also have a bunch of harmonics - could the harmonic makeup be used to also detect probability that it is an E string without waiting for the fundamental to complete its cycle?

Sorry if dumb questions.
 
Back
Top