I missed that addSubpass was only called once-per-subpass, meaning that if a new barrier req was discovered several draws into the RP it wouldn't be applied. Split out barriers into a seperate function to avoid this.
Full pipeline barriers between every RP can be extremely expensive on HW, by analysing the inputs and outputs of a draw it's possible to construct a much more optimal barrier that only syncs what is neccessary.
Sometimes view pointers may change despite the underlying Vulkan image view not actually changing, so use vk::ImageViews for tracking to keep RP breaks to a minimum.
The layer stride provided by the depth register in Maxwell3D needs to be shifted by 2, this caused the stride to be 1/4th of what it needed to be resulting in OOB access.
When calculating mip-level dimensions in terms of GOBs, they need to be divided by 2 while rounding upwards rather than downwards. This fixes corrupted textures and OOB access on lower mip levels across a substantial amount of titles, reducing arbitrary crashes as a result.
mapped
Since we align up when allocating, not doing so when deallocating would result in a gradual buildup of boundary pages that eventually fill the whole address space.
These are about 100x as expensive on adreno than nvidia due to the lack of a dedicated instruction, since some games work fine without them add a hack to disable them.
The vulkan guest driver doesn't expect a 0xB return code from SyncptEventWait, even though this is valid when an event is being signalled. Just ignore the intermediate state instead as doing so avoids races without causing any more.
The excessive blocking caused by initial compilation happening async to the guest caused issues in some cases, now we have a Vulkan pipeline cache to speed it up we can wait for a full compile before launch without too many issues.
By only using what we need, and mirroring the descriptor structs to allow for much tighter packing (while keeping the same member names) we can reduce pipeline memory to about 1/3 of what it was before.
Certain titles such as Super Smash Bros Ultimate can use SVC `UnmapPhysicalMemory` to punch holes into physical memory mappings, this wasn't handled correctly as we completely deleted the portion after the hole. It has now been fixed which results in these titles which depend on this behavior to work now.
Since the waitermutex is only ever locked for a short amount of time, spinning in contention-heavy scenarios ends up quite a bit more efficient than a kernel wait.
This removes the need to concatenate the variable multiple times, recycles the scaled bitmap after it has been stored, addresses the Android Studio complaint about that method name, and generates a preview of the current profile image as the preference icon.
The implementation for this service function wasn't added to the service function table. Additionally, the type for the output `ScalingMode` was implicitly `int` as it was unspecified in the `enum class` which has now been corrected to `u64` as it should be.
Due to broken drivers, it's possible to find no Vulkan physical devices but this can lead to a cryptic segfault. This explicitly checks for it instead and throws an exception which will be emitted into logcat thus can be easily caught.
Due to the trampoline and save/load context functions, `GetHookSectionSize` returned a non-zero size for when there were no hooked symbols supplied to it. This is problematic as it isn't required and hooking is currently not stable so it can lead to crashes or freezes in certain titles.
A thread can be paused while it is in a synchronization primitive which will do `RemoveThread`, we need to update the state of `insertThreadOnResume` in this case by clearing it so it isn't incorrectly reinserted on resuming the thread.
`Scheduler::UpdateCore` implicitly depended on `KThread::coreMigrationMutex` being locked during calls to it, this requirement has now been made explicit to avoid confusion.
When a timeout occurs in `ConditionVariableWait`, we used to check `waitMutex` which is cleared by `MutexUnlock` but when we hit the CAS case in `ConditionVariableSignal` then we don't clear `waitMutex`. It's far more reliable to check `waitThread` as an indication for if the thread has already been unlocked as it's cleared at the start of `ConditionVariableWait` and would implicitly stay cleared in the CAS case while being set in `MutexLock` and being unset in `MutexUnlock`.
There's multiple locations where a thread is yielded in the scheduler and all of them repeat the code of checking for `pendingYield` and signalling with an optional optimization of checking if the thread being yielded is the calling thread.
All this functionality has now been consolidated into `Scheduler::YieldThread` which checks for `pendingYield` and does the calling thread yield optimization. This should lead to better readability and better performance in cases where `UpdatePriority` would signal the calling thread.
`forceYield` was incorrectly not set when pausing running threads if the thread already had `pendingYield` set. This could lead to cases where `Rotate` would later throw an exception due to it being unset.
Blocking while inserting a paused thread can lead to deadlocks where the inserting thread later resumes the paused thread.
Co-authored-by: Billy Laws <blaws05@gmail.com>
As we didn't hold `coreMigrationMutex`, the thread could simply migrate during `InsertThread` which would lead to the thread potentially never waking up as it's been inserted on a non-resident core.
Co-authored-by: PixelyIon <pixelyion@protonmail.com>
`SignalToAddress`/`ConditionVariableSignal` need to wake waiters in priority order, while threads are inserted in order this doesn't remain the case as priority updates don't reinsert the thread into `syncWaiters`.
It was determined that reinsertion into `syncWaiters` would be fairly complex due to locking the `syncWaitersMutex` with the thread's mutexes. To avoid this, this commit instead sorts waiters by priority at signal time to always wake threads in the right order.
Calling `WaitSchedule` inside the block where `syncWaiterMutex` is locked causes a race with other threads which lock the core mutex and `syncWaiterMutex` together. This commit moves the `WaitSchedule` outside the block while simply setting a flag to wait later similar to `ConditionVariableWait`'s timeout case.
Co-authored-by: Billy Laws <blaws05@gmail.com>
This is a cause for a large amount of scheduler bugs so we should generally check for this on debug builds as it is a fairly easy way to check for issues for some performance cost.
The way we handled waking/timeouts of condition variables was fairly inaccurate to HOS as we moved locking of the mutex to the waker thread which could change the order of operations and would cause what were functionally spurious wakeups for all awoken threads.
This commit fixes it by doing all locks on the waker thread and only awakening the waiter thread once the condition variable was signalled and the mutex was unlocked. In addition, this fixes races between a timeout and a signal that could lead to double-insertion as a result of a refactor of how timeouts work in the new system.
`MutexLock` incorrectly returned `InvalidCurrentMemory` for cases where the userspace value didn't match the expected value. It's been corrected to return no error in those cases while preserving the error code for usage in `ConditionalVariableWait`.
We didn't read the values for arbitration atomically in all cases as we should have, this consolidates the reading of the value and uses the value across all cases.
A race could occur from the timeout path in `WaitForAddress` taking place at the same time as `SignalToAddress` has been caused, this causes a deadlock due to double-insertion.
Some games rely on the vsync event to schedule frames, by matching its timing with presentation we can reduce needless waiting as the game will immediely be able to queue the next frame after presentation.
This allows for the presentation engine to grab the presentation image early when direct buffers are in use, since it'll handle sync on its own using semaphores it doesn't need to wait for GPU execution.
By importing guest memory directly onto the host GPU we can avoid many of the complexities that occur with memory tracking as well as the heavy performance overhead in some situations. Since it's still desired to support the traditional buffer method, as it's faster in some cases and more widely supported, most of the exposed buffer methods have been split into two variants with just a small amount of shared code. While in most cases the code is simpler, one area with more complexity is handling CPU accesses that need to be sequenced, since we don't have any place we can easily apply writes to on the GPFIFO thread that wont also impact the buffer on the GPU, to solve this, when the GPU is actively using a buffer's contents, an interval list is used to keep track of any GPFIO-written regions on the CPU and any CPU reads to them will instead be directed to a shadow of the buffer with just those writes applied. Once the GPU has finished using buffer contents the shadow can then be removed as all writes will have been done by the GPU.
The main caveat of this is that it requires tying host sync to guest sync, this can reduce performance in games which double buffer command buffers as it prevents us from fully saturating the CPU with the GPFIFO thread.
This is necessary for the upcoming direct buffer support, as in order to use guest buffers directly without trapping we need to recreate any guest GPU sync on the host GPU. This avoids the guest thinking work is done that isn't and overwriting in-use buffer contents.
Extends the profile picture stub into a full-fledged implementation with the ability for users to set their profile picture in settings while having the Skyline icon as the default profile picture.
HOS's TIDs are one-based rather than zero-based, certain titles such as Pokémon Arceus, Naruto Shippuden: Ultimate Ninja Storm 3, Splatoon 3, etc. use the TID being zero as a sentinel value but as we assigned this ID to our first thread prior it broke this logic which has now been fixed by this commit as it now matches HOS behavior.
All writes are done async into a staging file, which is then merged into the main pipeline cache file at the time of the next launch. Upon encountering file corruption the cache can be trimmed up to the last-known-good entry to avoid any excessive loss of data from just one error.