Some of the code used when the carry flag is known to be a
constant value is really not much better than just setting
the carry flag and then using the normal code, and with how
rarely this code runs, it isn't well tested either.
Might as well get rid of some of this code and simplify things.
These optimizations were already present, but only when d == a. They
also make sense when this condition does not hold.
- imm == 0
Before:
41 BB 00 00 00 00 mov r11d,0
45 2B DF sub r11d,r15d
After:
45 8B DF mov r11d,r15d
41 F7 DB neg r11d
- imm == -1
Before:
41 BD FF FF FF FF mov r13d,0FFFFFFFFh
44 2B EE sub r13d,esi
0F 93 45 68 setae byte ptr [rbp+68h]
After:
44 8B EE mov r13d,esi
41 F7 D5 not r13d
C6 45 68 01 mov byte ptr [rbp+68h],1
Fixes a regression from 5.0-12066, where setting the GFXBackend variable
to one other than the current global backend would crash Dolphin upon
launching the game.
fcmpX only updates the FPCC bits, not the C bit.
This was already correctly implemented in the interpreter.
Not known to affect any games, but affects a hardware test.
Not doing this can cause desyncs when TASing. (I don't know
how common such desyncs would be, though. For games that
don't change rounding modes, they shouldn't be a problem.)
When I added the software FMA path in 2c38d64 and made us use
it when determinism is enabled, I was assuming that either the
performance impact of software FMA wouldn't be too large or CPUs
that were too old to have FMA instructions were too slow to run
Dolphin well anyway. This was wrong. To give an example, the
netplay performance went from 60 FPS to 30 FPS in one case.
This change makes netplay clients negotiate whether FMA should
be used. If all clients use an x64 CPU that supports FMA, or
AArch64, then FMA is enabled, and otherwise FMA is disabled.
In other words, we sacrifice accuracy if needed to avoid massive
slowdown, but not otherwise. When not using netplay, whether to
enable FMA is simply based on whether the host CPU supports it.
The only remaining case where the software FMA path gets used
under normal circumstances is when an input recording is created
on a CPU with FMA support and then played back on a CPU without.
This is not an especially common scenario (though it can happen),
and TASers are generally less picky about performance and more
picky about accuracy than other users anyway.
With this change, FMA desyncs are avoided between AArch64 and
modern x64 CPUs (unlike before 2c38d64), but we do get FMA
desyncs between AArch64 and old x64 CPUs (like before 2c38d64).
This desync can be avoided by adding a non-FMA path to JitArm64 as
an option, which I will wait with for another pull request so that
we can get the performance regression fixed as quickly as possible.
https://bugs.dolphin-emu.org/issues/12542