Some games rely on the vsync event to schedule frames, by matching its timing with presentation we can reduce needless waiting as the game will immediely be able to queue the next frame after presentation.
This allows for the presentation engine to grab the presentation image early when direct buffers are in use, since it'll handle sync on its own using semaphores it doesn't need to wait for GPU execution.
By importing guest memory directly onto the host GPU we can avoid many of the complexities that occur with memory tracking as well as the heavy performance overhead in some situations. Since it's still desired to support the traditional buffer method, as it's faster in some cases and more widely supported, most of the exposed buffer methods have been split into two variants with just a small amount of shared code. While in most cases the code is simpler, one area with more complexity is handling CPU accesses that need to be sequenced, since we don't have any place we can easily apply writes to on the GPFIFO thread that wont also impact the buffer on the GPU, to solve this, when the GPU is actively using a buffer's contents, an interval list is used to keep track of any GPFIO-written regions on the CPU and any CPU reads to them will instead be directed to a shadow of the buffer with just those writes applied. Once the GPU has finished using buffer contents the shadow can then be removed as all writes will have been done by the GPU.
The main caveat of this is that it requires tying host sync to guest sync, this can reduce performance in games which double buffer command buffers as it prevents us from fully saturating the CPU with the GPFIFO thread.
This is necessary for the upcoming direct buffer support, as in order to use guest buffers directly without trapping we need to recreate any guest GPU sync on the host GPU. This avoids the guest thinking work is done that isn't and overwriting in-use buffer contents.
Extends the profile picture stub into a full-fledged implementation with the ability for users to set their profile picture in settings while having the Skyline icon as the default profile picture.
HOS's TIDs are one-based rather than zero-based, certain titles such as Pokémon Arceus, Naruto Shippuden: Ultimate Ninja Storm 3, Splatoon 3, etc. use the TID being zero as a sentinel value but as we assigned this ID to our first thread prior it broke this logic which has now been fixed by this commit as it now matches HOS behavior.
All writes are done async into a staging file, which is then merged into the main pipeline cache file at the time of the next launch. Upon encountering file corruption the cache can be trimmed up to the last-known-good entry to avoid any excessive loss of data from just one error.
By distributing the load of shader compiling onto multiple threads and then only waiting for completion until absolutely neccessary we can reduce compilation stutters significantly.
Introduces the base abstractions that will be used for pipeline caching, with a 'PipelineStateBundle' that can be (de)serialised to/from disk and an abstract accessor class to allow switching between creating disk-cached pipelines and fresh ones.
When caching pipelines we can't cache whole images, only their formats so refactor PackedPipelineState so that it can be used for pipeline creation, as opposed to passing in a list of attachments.
The problem is in StoreOpenContext wasn't storing any user, but ListOpenContextStoredUsers was writing default user (when it's not stored by StoreOpenContext)
Related service calls are called in a loop by SM3DW. A variable tracking zero drift mode has been added to `npad_device`, but it's unused at the moment.
This can reuse a fair bit of the now-commonised Maxwell 3D code and mostly consists of compute-specific pipeline code which was deemed not suitable for being commonised (e.g. descriptor update code is somewhat duplicated). Of note is how compute lacks any active state at all de to its use of QMDs which bundle up all state into a single object in memory.
A lot of pipeline code is difficult to commonise due to the inherent difference between compute and graphics pipelines, however the binding layout is shared so we can at least commonise that
This will be shared with the compute engine implementation, the only thing of note with this is that the binding register is now passed as a param since it is part of the compute QMD which can't be dirty tracked.
Although rtld and IPC prevent TLS/IO and code from being above the 36-bit AS limit, nothing depends the heap being below it. We can take advantage of this by stealing as much AS as possible for code in the lower 36-bits.