Some games, for example PGLE, have heavy contention in code that locks mutexes for only a brief period of time. This heavy contention over multiple threads results in futex latency (often ~20us) impacting performance heavily. Using an adaptive condition variable helps to reduce this latency.
By spin waiting for a small period before falling back to an actual condition variable, some of the overheads inherent to futex's can be avoided. The used constants were tuned for optimal performance on 8G1 on Skyrim and PGLE.
* Add bold text and antialiasing for osc buttons
* Fix osc dpad and button position (widder than taller)
* Set default OSC color to white background with black text
This helps to prevent issues that result from the overlapping of buffer and texture data, by only ever syncing back textures if they are actually used as RTs, which are much less likely to overlap buffers.
* Apply translations in fr
* Apply translations in ru
* Apply translations in b+zh+Hans
* Apply translations in b+zh+Hant
* Apply translations in de
* Apply translations in el
* Apply translations in ja
* Apply translations in ar
* Apply translations in ta
* Apply translations in pl
* Apply translations in ko
* Apply translations in es
* Apply translations in pl
* Apply translations in in
* Apply translations in it
* Apply translations in b+es+419
* Apply translations in hu
`PreferenceSettings` was removed in favor of:
* `AppSettings`: stores general purpose settings mostly used for UI configuration and state
* `EmulationSettings`: stores emulation-related settings, most of these are passed to native emulation code
This commit reverts PR #2037. Passing `NativeSettings` to emulation code through a member reference, instead of a local variable, caused unpredictable crashes when using custom GPU drivers (v615+) on some Qualcomm SoCs.
The exact cause of the issue remains unknown, my best guess is that it was caused by an incorrect optimization performed on the Kotlin bytecode in release mode, which caused an issue when reading memory that had been forked, because of running emulation in a separate process.
Runtime settings modification will be reimplemented in the future via an alternative method.
Indirect layers are used by the game to render layer on its own, the game allocates a buffer with the size from `GetIndirectLayerImageRequiredMemoryInfo` and uses `GetIndirectLayerImageMap` to draw the applet contents into the buffer.
As we don't LLE applet implementations nor do our HLE implementations draw equivalent applets, we cannot submit this to the guest. As a result, these functions are stubbed with the framebuffer being cleared to red.
Stubbing these functions allows titles such as Dark Souls to not crash while initializing indirect layers.
Accessing the settings class during the execution of the `OnChangedCallback` results in a deadlock, as accesses to values are protected by a mutex. Instead, we now keep a local copy of the relevant settings and update those with the new value.
I missed that addSubpass was only called once-per-subpass, meaning that if a new barrier req was discovered several draws into the RP it wouldn't be applied. Split out barriers into a seperate function to avoid this.
Full pipeline barriers between every RP can be extremely expensive on HW, by analysing the inputs and outputs of a draw it's possible to construct a much more optimal barrier that only syncs what is neccessary.
Sometimes view pointers may change despite the underlying Vulkan image view not actually changing, so use vk::ImageViews for tracking to keep RP breaks to a minimum.
The layer stride provided by the depth register in Maxwell3D needs to be shifted by 2, this caused the stride to be 1/4th of what it needed to be resulting in OOB access.
When calculating mip-level dimensions in terms of GOBs, they need to be divided by 2 while rounding upwards rather than downwards. This fixes corrupted textures and OOB access on lower mip levels across a substantial amount of titles, reducing arbitrary crashes as a result.
mapped
Since we align up when allocating, not doing so when deallocating would result in a gradual buildup of boundary pages that eventually fill the whole address space.
These are about 100x as expensive on adreno than nvidia due to the lack of a dedicated instruction, since some games work fine without them add a hack to disable them.
The vulkan guest driver doesn't expect a 0xB return code from SyncptEventWait, even though this is valid when an event is being signalled. Just ignore the intermediate state instead as doing so avoids races without causing any more.
The excessive blocking caused by initial compilation happening async to the guest caused issues in some cases, now we have a Vulkan pipeline cache to speed it up we can wait for a full compile before launch without too many issues.
By only using what we need, and mirroring the descriptor structs to allow for much tighter packing (while keeping the same member names) we can reduce pipeline memory to about 1/3 of what it was before.
Certain titles such as Super Smash Bros Ultimate can use SVC `UnmapPhysicalMemory` to punch holes into physical memory mappings, this wasn't handled correctly as we completely deleted the portion after the hole. It has now been fixed which results in these titles which depend on this behavior to work now.
Since the waitermutex is only ever locked for a short amount of time, spinning in contention-heavy scenarios ends up quite a bit more efficient than a kernel wait.
This removes the need to concatenate the variable multiple times, recycles the scaled bitmap after it has been stored, addresses the Android Studio complaint about that method name, and generates a preview of the current profile image as the preference icon.
The implementation for this service function wasn't added to the service function table. Additionally, the type for the output `ScalingMode` was implicitly `int` as it was unspecified in the `enum class` which has now been corrected to `u64` as it should be.
Due to broken drivers, it's possible to find no Vulkan physical devices but this can lead to a cryptic segfault. This explicitly checks for it instead and throws an exception which will be emitted into logcat thus can be easily caught.
Due to the trampoline and save/load context functions, `GetHookSectionSize` returned a non-zero size for when there were no hooked symbols supplied to it. This is problematic as it isn't required and hooking is currently not stable so it can lead to crashes or freezes in certain titles.
A thread can be paused while it is in a synchronization primitive which will do `RemoveThread`, we need to update the state of `insertThreadOnResume` in this case by clearing it so it isn't incorrectly reinserted on resuming the thread.
`Scheduler::UpdateCore` implicitly depended on `KThread::coreMigrationMutex` being locked during calls to it, this requirement has now been made explicit to avoid confusion.
When a timeout occurs in `ConditionVariableWait`, we used to check `waitMutex` which is cleared by `MutexUnlock` but when we hit the CAS case in `ConditionVariableSignal` then we don't clear `waitMutex`. It's far more reliable to check `waitThread` as an indication for if the thread has already been unlocked as it's cleared at the start of `ConditionVariableWait` and would implicitly stay cleared in the CAS case while being set in `MutexLock` and being unset in `MutexUnlock`.
There's multiple locations where a thread is yielded in the scheduler and all of them repeat the code of checking for `pendingYield` and signalling with an optional optimization of checking if the thread being yielded is the calling thread.
All this functionality has now been consolidated into `Scheduler::YieldThread` which checks for `pendingYield` and does the calling thread yield optimization. This should lead to better readability and better performance in cases where `UpdatePriority` would signal the calling thread.
`forceYield` was incorrectly not set when pausing running threads if the thread already had `pendingYield` set. This could lead to cases where `Rotate` would later throw an exception due to it being unset.
Blocking while inserting a paused thread can lead to deadlocks where the inserting thread later resumes the paused thread.
Co-authored-by: Billy Laws <blaws05@gmail.com>
As we didn't hold `coreMigrationMutex`, the thread could simply migrate during `InsertThread` which would lead to the thread potentially never waking up as it's been inserted on a non-resident core.
Co-authored-by: PixelyIon <pixelyion@protonmail.com>
`SignalToAddress`/`ConditionVariableSignal` need to wake waiters in priority order, while threads are inserted in order this doesn't remain the case as priority updates don't reinsert the thread into `syncWaiters`.
It was determined that reinsertion into `syncWaiters` would be fairly complex due to locking the `syncWaitersMutex` with the thread's mutexes. To avoid this, this commit instead sorts waiters by priority at signal time to always wake threads in the right order.
Calling `WaitSchedule` inside the block where `syncWaiterMutex` is locked causes a race with other threads which lock the core mutex and `syncWaiterMutex` together. This commit moves the `WaitSchedule` outside the block while simply setting a flag to wait later similar to `ConditionVariableWait`'s timeout case.
Co-authored-by: Billy Laws <blaws05@gmail.com>
This is a cause for a large amount of scheduler bugs so we should generally check for this on debug builds as it is a fairly easy way to check for issues for some performance cost.
The way we handled waking/timeouts of condition variables was fairly inaccurate to HOS as we moved locking of the mutex to the waker thread which could change the order of operations and would cause what were functionally spurious wakeups for all awoken threads.
This commit fixes it by doing all locks on the waker thread and only awakening the waiter thread once the condition variable was signalled and the mutex was unlocked. In addition, this fixes races between a timeout and a signal that could lead to double-insertion as a result of a refactor of how timeouts work in the new system.
`MutexLock` incorrectly returned `InvalidCurrentMemory` for cases where the userspace value didn't match the expected value. It's been corrected to return no error in those cases while preserving the error code for usage in `ConditionalVariableWait`.
We didn't read the values for arbitration atomically in all cases as we should have, this consolidates the reading of the value and uses the value across all cases.
A race could occur from the timeout path in `WaitForAddress` taking place at the same time as `SignalToAddress` has been caused, this causes a deadlock due to double-insertion.
Some games rely on the vsync event to schedule frames, by matching its timing with presentation we can reduce needless waiting as the game will immediely be able to queue the next frame after presentation.
This allows for the presentation engine to grab the presentation image early when direct buffers are in use, since it'll handle sync on its own using semaphores it doesn't need to wait for GPU execution.
By importing guest memory directly onto the host GPU we can avoid many of the complexities that occur with memory tracking as well as the heavy performance overhead in some situations. Since it's still desired to support the traditional buffer method, as it's faster in some cases and more widely supported, most of the exposed buffer methods have been split into two variants with just a small amount of shared code. While in most cases the code is simpler, one area with more complexity is handling CPU accesses that need to be sequenced, since we don't have any place we can easily apply writes to on the GPFIFO thread that wont also impact the buffer on the GPU, to solve this, when the GPU is actively using a buffer's contents, an interval list is used to keep track of any GPFIO-written regions on the CPU and any CPU reads to them will instead be directed to a shadow of the buffer with just those writes applied. Once the GPU has finished using buffer contents the shadow can then be removed as all writes will have been done by the GPU.
The main caveat of this is that it requires tying host sync to guest sync, this can reduce performance in games which double buffer command buffers as it prevents us from fully saturating the CPU with the GPFIFO thread.
This is necessary for the upcoming direct buffer support, as in order to use guest buffers directly without trapping we need to recreate any guest GPU sync on the host GPU. This avoids the guest thinking work is done that isn't and overwriting in-use buffer contents.
Extends the profile picture stub into a full-fledged implementation with the ability for users to set their profile picture in settings while having the Skyline icon as the default profile picture.
HOS's TIDs are one-based rather than zero-based, certain titles such as Pokémon Arceus, Naruto Shippuden: Ultimate Ninja Storm 3, Splatoon 3, etc. use the TID being zero as a sentinel value but as we assigned this ID to our first thread prior it broke this logic which has now been fixed by this commit as it now matches HOS behavior.
All writes are done async into a staging file, which is then merged into the main pipeline cache file at the time of the next launch. Upon encountering file corruption the cache can be trimmed up to the last-known-good entry to avoid any excessive loss of data from just one error.
By distributing the load of shader compiling onto multiple threads and then only waiting for completion until absolutely neccessary we can reduce compilation stutters significantly.
Introduces the base abstractions that will be used for pipeline caching, with a 'PipelineStateBundle' that can be (de)serialised to/from disk and an abstract accessor class to allow switching between creating disk-cached pipelines and fresh ones.
When caching pipelines we can't cache whole images, only their formats so refactor PackedPipelineState so that it can be used for pipeline creation, as opposed to passing in a list of attachments.
The problem is in StoreOpenContext wasn't storing any user, but ListOpenContextStoredUsers was writing default user (when it's not stored by StoreOpenContext)
Related service calls are called in a loop by SM3DW. A variable tracking zero drift mode has been added to `npad_device`, but it's unused at the moment.