We try to keep as many registers as possible in callee saved registers, so if we have guest registers in the correct registers and the interpreter
call we are falling back to doesn't need the registers then we can dump just those ones. Which means we don't have to dump 100% of our register state
when falling to the interpreter.
ComputeRC was a bit unclear by using 64bit registers for setting the immediate and then calling SXTW on a 6b4it register which is just a bit obscure.
When the source register is an immediate in cntlzwx, just use the built in GCC function instead of our own implementing for counting leading zeros.
Before block linking was enabled but it wasn't ever implemented.
Implements link blocks and destroy block functions and moves the downcount check in the WriteExit function so it doesn't get overwritten when linking.
Changes the dispatcher to make sure to we are saving the LR(X30) to the stack. Also makes sure to keep the stack aligned.
AArch64's AAPCS64 mandates the stack to be quad-word aligned.
Fixes the dispatcher from infinite looping due to a downcount check jumping to the dispatcher. This was because checking exceptions and the state
pointer wouldn't reset the global conditional flags. So it would leave the timing/exception, jump to the start of the dispatcher and then jump back
again due to the conditional branch.
Removes the REG_AWAY nonsense I was doing. I've got to get the JIT more up to speed before thinking of insane register cache things.
Also fixes a bug in immediate setting where if the register being set to an immediate already had a host register tied to it then it wouldn't free the
register it had. Resulting in register exhaustion.
lmw/stmw weren't properly setting input and output registers since they use multiple registers.
dcbz was just missing a flag in the instruction tables.
This code was obviously wrong, we were sign extending 8 bit unsigned values and loading from the wrong offset as well.
This fixes a bug in Muramasa where some colours were going insane.
If the inputs are both float singles, and the top half is known to be identical
to the bottom half, we can use packed arithmetic instead of scalar to skip
the movddup.
This is slower on a few rather old CPUs, plus the Atom+Silvermont, so detect
Atom and disable it in that case.
Also avoid PPC_FP on stores if we know that the output came from a float op.
Rather than playing terrible hacks to determine the start of input
frames, just update system input periodically. Specifically, every
60th of a second.