By only using what we need, and mirroring the descriptor structs to allow for much tighter packing (while keeping the same member names) we can reduce pipeline memory to about 1/3 of what it was before.
All writes are done async into a staging file, which is then merged into the main pipeline cache file at the time of the next launch. Upon encountering file corruption the cache can be trimmed up to the last-known-good entry to avoid any excessive loss of data from just one error.
Introduces the base abstractions that will be used for pipeline caching, with a 'PipelineStateBundle' that can be (de)serialised to/from disk and an abstract accessor class to allow switching between creating disk-cached pipelines and fresh ones.
Symbol hooking is required for HLE implementations of certain features in the future such as `nvdec` and for more in-depth debugging of games as we can inspect them on a SDK function level which allows us to debug issues far more easily.
Previously, both I2M uploads and DMA copies would force GPU serialisation if they happened to hit a trap or were used to copy GPU dirty buffers. By using the buffer manager to implement them on the host GPU we can avoid such slowdowns entiely.
Ontop of the TIC cache from previous code a simple index based lookup has been added which vastly speeds things up by avoding the need to hash the TIC structure every time.
gm20b performs instanced draws by repeating draw methods for each instance, the code to detect this together with the cost of interpreting macros took up around 6% of GPFIFO time in Metro Kingdom. By detecting these specific macros and performing an instanced draw directly much of that cost can be avoided.
gpu-new will use a monolithic pipeline object for each pipeline to store state, keyed by the PackedPipelineState contents. This allows for a greater level of per-pipeline optimisations and a reduction in the overall number of lookups in a draw compared to the previous system.
Adepted from the previous code to use dirty state tracking. The cache has also been removed since with the new buffer view and GMMU optimisations it actually ended up slowing lookups down, another result of the buffer view optimisations is that raw pointers are no longer used for buffer views since destruction is now much cheaper.
Constant buffer updates result in a barrage of std::mutex calls that take a lot of time even under no contention (around 5%). Using a custom spinlock in cases like these allows inlining locking code reducing the cost of locks under no contention to almost 0.
Certain titles depend on HID LIFO entries being written out at a fixed frequency rather than on actual state change, not doing this can lead to applications freezing till the LIFO is filled up to maximum size, this behavior is seen in Super Mario Odyssey. In other cases such as Metroid Dread, the game can run into race conditions that would lead to crashes, these were worked around by smashing a button during loading prior.
This commit introduces a thread which sleeps and wakes up occasionally to write LIFO entries into HID shared memory at the desired frequencies. This alleviates any issues as it fills up the LIFO instantly and correctly emulates HID Shared Memory behavior expected by the guest.
Co-authored-by: Narr the Reg <juangerman-13@hotmail.com>
It is desirable for us to use a shader for blits to allow easily emulating out of bounds blits and blits between different swizzled colour formats. The helper shader infrastructure is designed to be generic so it can be reused by any other helper shaders that we may need in the future.
After the introduction of workahead a system to hold a single large megabuffer per submission was implemented, this worked fine for most cases however when many submissions were flight at the same time memory usage would increase dramatically due to the amount of megabuffers needed. Since only one megabuffer was allowed per execution, it forced the buffer to be fairly large in order to accomodate the upper-bound, even further increasing memory usage.
This commit implements a system to fix the memory usage issue described above by allowing multiple megabuffers to be allocated per execution, as well as reuse across executions. Allocations now go through a global allocator object which chooses which chunk to allocate into on a per-allocation scale, if all are in use by the GPU another chunk will be allocated, that can then be reused for future allocations too. This reduces Hollow Knight megabuffer memory usage by a factor 4 and SMO by even more.
The `Settings` class now has a pure virtual `Update` method, and uses inheritance over template specialization for platform-specific behavior override.
These applets are used by applications to display a custom error message to the user. Both the error message and the detailed error message are printed to the error log.
Co-authored-by: lynxnb <niccolo.betto@gmail.com>
Certain GPU vendors such as ARM's Mali do not have support for BCn textures whatsoever while other vendors such as AMD only have partial support (BC1-BC3). Most titles on the guest utilize BC textures and to address this on host GPUs without support for BCn, we need to decompress the texture on the CPU. This commit implements a CPU BCn texture decoder based off Swiftshader's BC decoder, it also adds the necessary infrastructure to have different formats for the `GuestTexture` and `Texture` objects.
The API for texture swizzling is now more concrete and abstracted out from `GuestTexture`, this allows for neater usage in certain areas such as MaxwellDMA while having a `GuestTexture` wrapper as well allowing for neater usage in those cases.
The code itself has also been cleaned up slightly with all usage of `u32`s being upgraded to `size_t` as this is simply more efficient due to the compiler not needing to emulate wraparound behavior for integer types smaller than the processor word size.
The Fermi 2D engine implements both image blit and resolve operations, supporting subpixel sampling with both linear and point filtering.
Resolve operations are performed by sampling from the center of each pixel in order to resolve the final image from the MSAA samples
MSAA images are stored in memory like regular images but each pixels dimensions are scaled: e.g for 2x2 MSAA
```
112233
112233
445566
445566
```
These would be sampled with both duDx and duDy as 2 (integer part), resolving to the following:
```
123
456
```
Blit operations are performed by sampling from the corner of each pixel, scaling the image as one would expect.
This implementation isn't fully complete as Vulkan blit doesn't support some combinations which Fermi does, most notably between colour and depth stencil. These will be implemented properly at a later date, likely after the texture manager rework.
Out of Bounds Blit, used by some OpenGL games is also missing since supporting it requires texture aliasing, this will also be supported after the texture manager rework.
Co-authored-by: Billy Laws <blaws05@gmail.com>
A basic `bcat:u` implementation to prevent titles such as "Kirby and the Forgotten Land" dependent on BCAT support from crashing due to the lack of an implementation.
Implements a cache for storing `VkFramebuffer` objects with a special path on devices with `VK_KHR_imageless_framebuffer` to allow for more cache hits due to an abstract image rather than a specific one.
Caching framebuffers is a fairly crucial optimization due to the cost of creating framebuffers on TBDRs since it involves calculating tiling memory allocations and in the case of Adreno's proprietary driver involves several kernel calls for mapping and allocating the corresponding framebuffer memory.