Now usagetracker is properly in place, indirect draw HLE can be used without requiring any hacks. Dirtiness is now ignored when fetching macro arguments, and it's now the duty of the HLE impls themselves to perform flushing if they require it.
Indirect draws are implemented by having the macro arguments overflow into a seperate GP Entry that points directly to the indirect argument buffer. To HLE indirect draws a buffer needs to be created from this pointer, and it cannot be dereferenced on the CPU at any point to avoid hitting traps.
Previously for methods with count > 1 the subchannel and engine would be looked up for each part of the method rather than only doing so at the start. Each call also needed to be looked up to see if it touched a macro or GPFIFO method. Fix this by doing checks outside of the main dispatch loop with templated helper lambdas to avoid needing to repeat lots of code. Maxwell3D is the only subchannel with a fast path for now but more can be added later if needed.
Fermi2D supports macros in addition to Maxwell3D, these both share code memory. To support this we rework the macro interpreter to support passing in a target engine and abstract the communications out into an interface that can be implemented by applicable engines.
```
GPFIFO <-> MME <-> Maxwell3D
^ ^---> Fermi2D
X------------> I2M
X------------> MaxwellComputeB
X--Flush-----> MaxwellDMA
```
These are used heavily in OpenGL games, which now, together with the
previous syncpoint changes, work perfectly. The actual implementation is
rather novel as rather than using a per-class state machine for all
methods we only use it for those that are known to be split across
GpEntry boundaries, as a result only a single bounds check is added to
the hot path of contiguous method execution and the performance loss is
negligible.
Allows the execution of multiple channels at the same time, with locking
being performed on the host GPU scheduler layer, address spaces can be
bound to one or more channels.
Using a u32 for the loop index prevents masking on all increments,
giving a moderate performance increase.
Passing methods as u32 parameters and stopping subChannel being passed
gives quite a significant increase when combined with the inlining
allowed by subchannel based engine selection.
We decided to restructure Skyline to draw a layer of separation between guest and host GPU. We're reserving the `gpu` namespace and directory for purely host GPU and creating a new `soc` directory and namespace for emulation of parts of the X1 SoC which is currently limited to guest GPU but will be expanded to contain components like the audio DSP down the line.