The issue isn’t just using LUFS, it’s that null tests are useless besides from telling you if something is identical or not.
My example above shows files sounding identical, but which have underlying differences that result in a poor null. How much it nulls doesn’t really tell us anything useful at all, especially in this context. There is literally nothing of value, Leo continually using this methodology and (even worse) framing it as “scientific” needs to stop.
Of course there are factors that can subvert a null test. That's why it's important to do the test properly to eliminate those factors. That obvious statement is true of almost any kind of test in almost any discipline. Done properly, the test will have little influence from those factors. Further, in real world tests between amps and amp sims that sound similar, those factors tend to be minor anyway.
The bottom line is his Kemper test matches quite well with subjective observations. That doesn't mean it's perfect, but it does mean it's hard to dismiss that kind of independent confirmation of his methods.