By distributing the load of shader compiling onto multiple threads and then only waiting for completion until absolutely neccessary we can reduce compilation stutters significantly.
Introduces the base abstractions that will be used for pipeline caching, with a 'PipelineStateBundle' that can be (de)serialised to/from disk and an abstract accessor class to allow switching between creating disk-cached pipelines and fresh ones.
When caching pipelines we can't cache whole images, only their formats so refactor PackedPipelineState so that it can be used for pipeline creation, as opposed to passing in a list of attachments.
The problem is in StoreOpenContext wasn't storing any user, but ListOpenContextStoredUsers was writing default user (when it's not stored by StoreOpenContext)
Related service calls are called in a loop by SM3DW. A variable tracking zero drift mode has been added to `npad_device`, but it's unused at the moment.
This can reuse a fair bit of the now-commonised Maxwell 3D code and mostly consists of compute-specific pipeline code which was deemed not suitable for being commonised (e.g. descriptor update code is somewhat duplicated). Of note is how compute lacks any active state at all de to its use of QMDs which bundle up all state into a single object in memory.
A lot of pipeline code is difficult to commonise due to the inherent difference between compute and graphics pipelines, however the binding layout is shared so we can at least commonise that
This will be shared with the compute engine implementation, the only thing of note with this is that the binding register is now passed as a param since it is part of the compute QMD which can't be dirty tracked.
Although rtld and IPC prevent TLS/IO and code from being above the 36-bit AS limit, nothing depends the heap being below it. We can take advantage of this by stealing as much AS as possible for code in the lower 36-bits.
Exynos SoCs have a bug where the `CNTFRQ_EL0` register is either set to 0 or contain incoherent values. With this patch, the frequency value is loaded into a static variable and used instead of reading the register. The value will be initialised to the correct value for affected SoCs, while unaffected ones will use the value from the register.
* Close the input and output file streams before moving the output file to the final destination
* Clean up the destination path before moving the new file
* Introduce a `ImportResult` return value to differentiate between the possible causes of import errors
* Display more meaningful error messages in the UI
If a producer thread was waiting for the queue to have free space and the consumer thread hadn't yet acquired the production mutex a deadlock could occur
Capture HAT axes events ourselves instead of relying on the android framework to turn them into KeyCodes. Fixes handling of DPAD button presses on most controllers.
In the case of axis value being zero, polarity would favor one side of the stick resulting in invalid values. Fix that by taking into account axis history when calculating polarity.
This was incorrectly allocated in words, rather than bytes, meaning that guest allocations could overwrite the private memory and break inline syncpt operations
We need to use a shared_ptr to ensure that the present callback doesn't do any UAFs, also unlocks the GBP during presentation as if the queue is full a deadlock could a rise where the present callback wouldn't be able to run due to the (waiting) DequeueBuffer thread holding the lock.
Exiting from emulation has always been a big issue for Skyline, with guest and host threads that would keep running in the background unless the app was manually killed. Running emulation in a separate process allows us to kill it when we are done, avoiding the need for complex exiting management code.
The old message was being misinterpreted as if the device's gpu was not supported by the emulator. Reword that message to explicitly mention custom drivers.
* Add a drag indicator at the top
* Fix flex layout wrapping when buttons didn't fit on a single line
* Fix BottomSheetDialog peek height too small on landscape orientation
* General cleanup of the layout
A new `DragIndicatorView` had been introduced, which draws a small drag handle element. When used inside a `BottomSheetDialog`, this view will add a callback for hiding the indicator when the dialog is fully expanded.
Symbol hooking is required for HLE implementations of certain features in the future such as `nvdec` and for more in-depth debugging of games as we can inspect them on a SDK function level which allows us to debug issues far more easily.
The register wouldn't be cleared with a `MOVZ` when a value was zero due to the condition for writing an instruction requiring the `offsetValue` to be non-zero.
Since the register writes technically happen after the draw, issues can occur if they happen before: e.g. skyrim updates ctSelect and disables all RTs after a draw, but this would happen before it previously and crash the driver.
Vulkan doesn't allow sampling a texture and using it as an RT in the same RP, by tracking the texture usage status and splitting RPs when this occurs we can avoid such potential sync errors.
Previously, both I2M uploads and DMA copies would force GPU serialisation if they happened to hit a trap or were used to copy GPU dirty buffers. By using the buffer manager to implement them on the host GPU we can avoid such slowdowns entiely.
The lock release within the wait for submission means that another thread could end up signalling the cycle and then the VK wait still happen after when the lock has been reacquired.
Readback can be especially slow on mobile due to the varying load pattern it creates which often prevents the CPU/GPU from clocking up. Since some games perform texture readback but don't actually use it for anything significant implement a hack to skip it and significantly improve performance in such cases.
Due to the frequency at which is is called megabuffering performance is critical to the performance of the entire emulator, especially in high-drawcall-count scenarios. After the view redesign, megabuffering on a per-view level was no longer possible nor desirable, and thus megabuffering was modified to just copy for every usage of a view. This worked great at the time since there were other bottlenecks, however gpu-new has since removed almost all of them and megabuffering is now a major sore point. Fix this by megabuffering small chunks and storing them in a page-table like structure within the buffer, these chunks can be referenced by multiple views and will be smartly invalidated whenever the sequence number or execution number changes to avoid any sequencing issues. In addition to this, to help the case where almost the whole buffer is read every single frame across a set of multiple views, an optimisation to skip the chunked tracking and use one large single megabuffer allocation and one single memcpy has been introduced. This reduces the overall amount of time spent in memcpy since large memcpys are quicker.
Rather than using just bpb for format compat, additionally check that the exact component bit layout matches since many games end up reusing RTs for unrelated textures. The texture size requirements have also been weaked to only check the resulting layer size as opposed to width/height - this is somewhat hacky but it gets around the problem of blocklinear alignment.
Prevents situations where nothing would otherwise be waiting on the GPU and since presentation no longer blocks too many images would be submitted for presentation.
In some cases like presentation, it may be possible to avoid waiting on the CPU by using a semaphore to indicate GPU completion. Due to the binary nature of Vulkan semaphores this requires a fair bit of code as we need to ensure semaphores are always unsignalled before they are waited on and signalled again. This is achieved with a special kind of chained cycle that can be added even after guest GPFIFO processing for a given cycle, the main cycle's semaphore can be waited and then the cycle for the wait attached to the main cycle and it will be waited on before signalling.
TIC sizes may not be aligned to block linear dimensions whereas RT sizes are and then limited by the surface clip. By using this to determine surface size we are more likely to get a match in texture manager for any future usages.
Keep a copy of the old TIC entry and view even after purge caches and use the execution number to check validity instead, if that doesn't match then just memcmp can be used as opposed to a full hash and map lookup.
When profiling SMO, it became obvious that the constant locking of textures and buffers in SyncDescriptors took up a large amount of CPU time (3-5%), a precious resource in intensive areas like Metro. This commit implements somewhat of a workaround to avoid constant relocking, if a buffer is frequently attached on the GPU and almost never used on the CPU we can keep the lock held between executions. Of course it's not that simple though, if the guest tries to lock a texture for the first time which has already been locked as preserve on the GPFIFO we need to avoid a deadlock. This is acheived through a combination of two things: first we periodically clear the locked attachments every 2*SlotCount submissions, preventing a complete deadlock on the CPU (just a long wait instead) and meaning that the next time the resource is attached on the GPU it will not be marked for preservation due to having been locked on the guest before; second, we always need to unlock everything when the GPU thread runs out of work, as the perioding clearing will not execute in this case which would otherwise leave the textures locked on the GPFIFO thread forever (if guest was waiting on a lock to submit work). It should be noted that we don't clear preserve attached resources in the latter scenario, only unlock them and then relock when more work is available.