Keep a copy of the old TIC entry and view even after purge caches and use the execution number to check validity instead, if that doesn't match then just memcmp can be used as opposed to a full hash and map lookup.
When profiling SMO, it became obvious that the constant locking of textures and buffers in SyncDescriptors took up a large amount of CPU time (3-5%), a precious resource in intensive areas like Metro. This commit implements somewhat of a workaround to avoid constant relocking, if a buffer is frequently attached on the GPU and almost never used on the CPU we can keep the lock held between executions. Of course it's not that simple though, if the guest tries to lock a texture for the first time which has already been locked as preserve on the GPFIFO we need to avoid a deadlock. This is acheived through a combination of two things: first we periodically clear the locked attachments every 2*SlotCount submissions, preventing a complete deadlock on the CPU (just a long wait instead) and meaning that the next time the resource is attached on the GPU it will not be marked for preservation due to having been locked on the guest before; second, we always need to unlock everything when the GPU thread runs out of work, as the perioding clearing will not execute in this case which would otherwise leave the textures locked on the GPFIFO thread forever (if guest was waiting on a lock to submit work). It should be noted that we don't clear preserve attached resources in the latter scenario, only unlock them and then relock when more work is available.
Avoids one race where we would end up hogging all the locks of chained cycles and ourself when waiting for submission of previous cycles and prevent any forward progress due to another thread locking one of the chained cycles.
For the upcoming preserve attachment optimisation, which will keep buffers/textures locked on the GPU between executions, we don't want to preserve any which are frequently locked on the CPU as that would result in lots of needless waiting for a resource to be unlocked by the GPU when it occasionally frees all preserve attachments when it could have been done much sooner. By checking if a resource has ever been locked on the CPU and using that to choose whether we preserve it we can avoid such waiting.
Allowing for parallel execution of channels never really benefitted many games and prevented optimisations such as keeping frequently used resources always locked to avoid the constant overhead of locking on the hot path.
Ontop of the TIC cache from previous code a simple index based lookup has been added which vastly speeds things up by avoding the need to hash the TIC structure every time.
Introducing async record resulted in breaking the assumption that any work submitted through command scheduler would be submitted in order with graphics submits. Since async record now unlocks the texture before it's submitted a seperate mechanism is needed to ensure ordering of submits. This is achieved by building support into fence cycle itself, with a conditional variable that is waited on for submission before any fence waits occur.
GPFIFO code is very high throughput due to the sheer number of commands used for rendering. Adjust some types and switch to a if statement with hints to slightly increase processing speed.
Recording of command nodes into Vulkan command buffers is very easily parallelisable as it can effectively be treated as part of the GPU execution, which is inherently async. By moving it to a seperate thread we can shave off about 20% of GPFIFO execution time. It should be noted that the command scheduler command buffer infra is no longer used, since we need to record texture updates on the GPFIFO thread (while another slot is being recorded on the record thread) and then use the same command buffer on the record thread later. This ends up requiring a pool per slot, which is reasonable considering we only have four slots by default.
Using command executor for each state individual update was found to be infeasible due to the shear number of state updates per draw and it relying on per-node heap allocations. Instead this commit takes advantage of each state update being used only once to implement a system of linearly-allocated state update commands that are linked together. After setting up all draw state with StateUpdateBuilder, the built StateUpdater can then be used in the execution phase to record all of the draw state into the command buffer with almost zero ovehead.
SMO implements instanced draws by repeating the same draw just with a different constant buffer bound. Reduce the cost of this significantly by detecting such cases and instead of processing every descriptor, copy the previous descriptor set and update only the ones affected by the bound constant buffer.
Credits to ripinperiperi for the initial idea and making me aware of how SMO does these draws
When a buffer is trapped nearly every frame, the cost of trapping and synchronising its contents starts to quickly add up. By always using the megabuffer when this is the case, since megabuffer copies are done directly from the guest, we skip the need to synchronise/trap the backing.
The original intention was to cache on the user side, but especially with shader constant buffers that's difficult and costly. Instead we can cache on the buffer side, with a page-table like structure to hold variable sized allocations indexed by the aligned view base address. This avoids most redundant copies from repeated use of the same buffer without updates inbetween.
Avoids the need to hash PipelineState when we can guess the pipeline that will be used next. This could very easily be optimised in the future with generational, usage-based caching if necessary.
gm20b performs instanced draws by repeating draw methods for each instance, the code to detect this together with the cost of interpreting macros took up around 6% of GPFIFO time in Metro Kingdom. By detecting these specific macros and performing an instanced draw directly much of that cost can be avoided.
gpu-new will use a monolithic pipeline object for each pipeline to store state, keyed by the PackedPipelineState contents. This allows for a greater level of per-pipeline optimisations and a reduction in the overall number of lookups in a draw compared to the previous system.
Caching here was deemed unnecessary since it will be done implicitly by the pipeline cache and creates issues with the legacy attribute conversion pass. It now purely serves as a frontend for Hades.
It was determined that a general purpose Vulkan pipeline cache isn't viable for the significant performance reqs of Draw(), by using a Maxwell 3D specific key we can shrink state significantly more than if we used Vulkan structs.
Removes all usage of graphics_context.h from the codebase, exclusively using the new interconnect and its dirty tracking system. While porting the code a number of bugs were discovered such as not respecting the base instance or primitive type override, which have all been fixed. Currently only clears and constant buffer updates are implemented but due to the dirty state system allowing register handling on the interconnect end there shouldn't end up being many more changes.
This mainly distributes operations down to activeState and pipelineState, aside from clears which are implemented in-place. The exposed interface is much reduced as opposed to the previous GraphicsContext system due to the newly introduced dirty system, this should hopefully make the code more maintainable and keep actual rendering operations seperate from primitive restart state or whatever. Currently draws are unimplemented and the only full implemented things are clears and constant buffer operations.
Active state encapsulates all state that isn't part of a pipeline and can be set dynamically with Vulkan calls. This includes both dynamic state like stencil faces, and command buffer state like vertex buffer bindings.
Simililarly to the last commit, the main goal of this is to reduce the number of redundant work done per draw by employing dirty state as much as possible. Without using dirty state for this every active state operation would need to be performed every draw, which gets very expensive when things like buffer lookups end up being reqiored. Code has also been heavily cleaned up as is described in the previous commit.
The main goal of this is to reduce the number of redundant lookups and work done per draw as much as possible, this is mainly achived through heavy used of dirty tracking though other optimisations like heavily using the linear allocator are also in play. In addition to the goal of performance, the code has been cleaned up and abstracted significantly from its state in graphics_context, hopefully making the GPU interconnect code much more maintainable in the future and reducing the boilerplace needed to add even simple functionality. This commit includes partial pipeline state, enough for implementing clears + a slight bit extra.
Adepted from the previous code to use dirty state tracking. The cache has also been removed since with the new buffer view and GMMU optimisations it actually ended up slowing lookups down, another result of the buffer view optimisations is that raw pointers are no longer used for buffer views since destruction is now much cheaper.