Commit Graph

1648 Commits

Author SHA1 Message Date
PixelyIon
c3cf79cb39
Rework KThread::waiterMutex Locking
Two issues exist with locking of `KThread::waiterMutex`:
* It was not always locked when accessing waiter members such as `waitThread`, `waitKey` and `waitTag` which would lead to a race that could end up in a deadlock or most notably a segfault inside `UpdatePriorityInheritance`
* There could be a deadlock from `UpdatePriorityInheritance` locking `waiterMutex` of a thread and waiting to get the owner's `waiterMutex` while on another thread `MutexUnlock` holds the owner's `waiterMutex` and waits on locking the `waiterMutex` held by `UpdatePriorityInheritance`

This commit fixes both issues by adding appropriate locking to all locations where waiter members are accessed in addition to adding a fallback mechanism inside `UpdatePriorityInheritance` that unlocks `waiterMutex` on contention to avoid a deadlock.
2022-08-06 22:20:54 +05:30
PixelyIon
68615703c1
Fix KProcess/SetThreadPriority PI CAS
The condition for exiting the CAS loops is incorrect in several places which leads to additional loops, while this doesn't make the behavior incorrect it does lead to redundant iterations. 

Co-authored-by: Billy Laws <blaws05@gmail.com>
2022-08-06 22:20:54 +05:30
PixelyIon
8fc3cc7a16
Rework Descriptor Set Allocation/Updates
A substantial amount of time would be spent on creation/destruction of `VkDescriptorSet` which scales on titles doing a substantial amount of draws with bindings, this leads to poor performance on those titles as the frametime is dragged down by performing these tasks while they repeatedly create descriptor sets of the same layouts.

This commit fixes it by pooling descriptor sets per-layout in a dynamically resizable pool and keeping them around rather than destroying them after usage which leads to the vast majority of cases not requiring a new descriptor set to even be created. It leads to significantly improved performance where it would otherwise be spent on redundant destruction/recreation or push descriptor updates which took a substantial amount of time themselves.

Additionally, the `BaseDescriptorSizes` were not kept up to date with all of the descriptor types, it led to no crashes on Adreno/Mali as they were purely used for size calculations on either driver but has been corrected to avoid any future issues.
2022-08-06 22:20:54 +05:30
PixelyIon
e1a4325137
Introduce FenceCycle Waiter Thread
A substantial amount of time is spent destroying dependencies for any threads waiting or polling `FenceCycle`s, this is not optimal as it blocks them from moving onto other tasks while destruction is a fundamentally async task and can be delayed.

This commit solves this by introducing a thread that is dedicated to waiting on every `FenceCycle` then signalling and destroying all dependencies which entirely fixes the issue of destruction blocking on more important threads.
2022-08-06 22:20:54 +05:30
PixelyIon
5f8619f791
Optimize Buffer Lookups using Range Tables
Buffer lookups are a fairly expensive operation that we currently spend `O(log n)` on the simplest and most frequent case of which is a direct match, this is a very frequent operation where that may be insufficient. This commit optimizes that case to `O(1)` by utilizing a `RangeTable` at the cost of slightly higher insertion/deletion costs for setting ranges of values but these are minimal in frequency compared to lookups.
2022-08-06 22:20:54 +05:30
PixelyIon
578ae86cca
Implement Multi-Level Range Table
A data structure that can represent the same value for a range of addresses (pages) is required for fast lookup in certain cases. This commit implements a near optimal data structure for mass insertion and O(1) lookup of range-based data, this is achieved using the host MMU and implementing multiple levels of atomicity for the ranges. 

It should be noted that the table is limited to two levels but can be extended to a variable amount of ranges in the future, it was determined that additional levels of ranges can be beneficial for performance depending on the specific use-case.
2022-08-06 22:20:54 +05:30
Billy Laws
38eab80ed8
Disable Vulkan Push Descriptors on Adreno
Adreno drivers have certain errata which leads to Vulkan Push Descriptors to be broken on them in certain cases which leads to a descriptor set update being swallowed. This has been worked around by disabling push descriptors on Adreno drivers, this may lead to reduced performance on certain titles which frequently bind new descriptors.
2022-08-06 22:20:54 +05:30
Billy Laws
88fd491ed5
Submit after Maxwell3D Semaphore Release
Any semaphore releases are implicit synchronization events that can be utilized by the guest to pick up that the GPU has executed till a certain point and therefore we must submit all prior work accordingly.
2022-08-06 22:20:54 +05:30
Billy Laws
b77da1182f
Don't flush submission on DMA Copies
DMA copies utilized `SubmitWithFlush` instead of `Submit`, this is not required and incurs significant additional synchronization penalties which will no longer be required.
2022-08-06 22:20:54 +05:30
PixelyIon
0992fde028
Don't block on surface creation in GetTransformHint
We want to avoid blocking on surface creation unless necessary, this commit doesn't wait on the creation of the surface as it default initializes the value which'll generally be `Identity` or the transformation of the previous surface if it was lost.

Co-authored-by: Billy Laws <blaws05@gmail.com>
2022-08-06 22:20:54 +05:30
Billy Laws
35133381b6
Fix V-Sync KEvent construction order
The V-Sync `KEvent` would be used by the presentation thread prior to construction leading to dereferencing an invalid value, this has been fixed by changing the order of construction to move the construction of the presentation thread after the V-Sync event.
2022-08-06 22:20:54 +05:30
PixelyIon
ffad246d67
Split NCE Trap page-out functionality from TrapRegions
The `TrapRegions` function performed a page-out on any regions that were trapped as read-only, this wasn't optimal as it would tie them both into the same operation while Buffers/Textures require to protect then synchronize and page-out. The trap was being moved to after the synchronize to get around this limitation but that can cause a potential race due to certain writes being done after the synchronization but prior to the trap which would be lost. This commit fixes these issues by splitting paging out into `PageOutRegions` which can be called after `TrapRegions` by any API users.

Co-authored-by: Billy Laws <blaws05@gmail.com>
2022-08-06 22:20:54 +05:30
PixelyIon
da464d84bc
Consolidate NCE::TrapRegions functionality into CreateTrap
`NCE::TrapRegions` was a bit too overloaded as a method as it implicitly trapped which was unnecessary in all current usage cases, this has now been made more explicit by consolidating the functionality into `NCE::CreateTrap` which handles just creation of the trap and nothing past that, `RetrapRegions` has been renamed to `TrapRegions` and handles all trapping now.

Co-authored-by: Billy Laws <blaws05@gmail.com>
2022-08-06 22:20:54 +05:30
PixelyIon
8a62f8d37b
Rework Texture Synchronization API + Locking
Similar to `Buffer`s, `Texture`s suffered from unoptimal behavior due to using atomics for `DirtyState` and had certain bugs with replacement of the variable at times where it shouldn't be possible which have now been fixed by moving to using a mutex instead of atomics. This commit also updates the API to more closely match what is expected of it now and removes any functions that weren't utilized such as `SynchronizeGuestWithBuffer`.
2022-08-06 22:20:54 +05:30
Billy Laws
04bcd7e580
Rework Buffer DirtyState with BackingImmutability
Having a single variable denoting the exact state of a buffer and the operations that could be performed on it was found to be too restrictive, it's now been expanded into an additional `BackingImmutability` variable but due to these two. We can no longer use atomics without significant additional complexity so all accesses to the state are now mediated through `stateMutex`, a mutex specifically designed for tracking the state.

While designing the system around `stateMutex` it was determined to be more efficient than atomics as it would enforce blocking far less than it would generally have been compared to if the regular atomic fallback of locking the main resource lock which is locked for significantly longer generally.

Co-authored-by: PixelyIon <pixelyion@protonmail.com>
2022-08-06 22:20:54 +05:30
PixelyIon
1af781c0a5
Add Perfetto Tracing to NCE Trapping API
As a performance sensitive part of code, the NCE Trapping API benefits from having tracing and it helps us better determine where guest code is spending its time for more targeted optimizations.
2022-08-06 22:20:54 +05:30
PixelyIon
9d294b9ccc
Use weak_ptr for TrapHandler Callbacks
The lifetime of the `this` pointer in the trap callbacks could be invalid as the lifetime of the underlying `Buffer`/`Texture` object wasn't guaranteed, this commit fixes that by passing a `weak_ptr` of the objects into the callbacks which is locked during the callbacks and ensures that a destroyed object isn't accessed.

Co-authored-by: Billy Laws <blaws05@gmail.com>
2022-08-06 22:20:54 +05:30
Billy Laws
96d8676d5b
Fix SubmitWithFlush not updating MegaBuffer cycle
The `CommandExecutor`'s `MegaBuffer` was not being updated with the latest `FenceCycle` on being flushed in `SubmitWIthFlush`, this led to the megabuffer being overwritten prior to its GPU-side usage being complete. This commit fixes that by replacing the cycle to the latest cycle and prevents any races that occurred prior.
2022-08-06 22:20:54 +05:30
PixelyIon
3e9d84b0c3
Split FindOrCreate functionality across BufferManager
`FindOrCreate` ended up being monolithic function with poor readability, this commit addresses those concerns by refactoring the function to split it up into multiple member functions of `BufferManager`, while some of these member functions may only have a single call-site they are important to logically categorize tasks into individual functions. The end result is far neater logic which is far more readable and slightly better optimized by virtue of being abstracted better.
2022-08-06 22:20:54 +05:30
PixelyIon
d2a34b5f7a
Implement ContextLock Move-assignment operator
In certain cases the move constructor may not suffice and the move assignment operator is required, this commit implements that and moves to using a pointer for storing the `resource` member rather than a reference as its semantics matched what we desired more and allowed for assignment of the `resource`.
2022-08-06 22:20:54 +05:30
PixelyIon
38d3ff4300
Fix BufferManager::FindOrCreate Recreation Bugs
It was determined that `FindOrCreate` has several issues which this commit fixes:
* It wouldn't correctly handle locking of the newly created `Buffer` as the constructor would setup traps prior to being able to lock it which could lead to UB
* It wouldn't propagate the `usedByContext`/`everHadInlineUpdate` flags correctly
* It wouldn't correctly set the `dirtyState` of the buffer according to that of its source buffers
2022-08-06 22:20:54 +05:30
Billy Laws
d1a682eace
Fix setDirty behavior in Buffer::SynchronizeGuest
The condition for `setDirty` in the dirty state CAS was inverted from what it should've been resulting in synchronizing incorrectly, this commit fixes the condition to correct synchronization.
2022-08-06 22:20:54 +05:30
Billy Laws
00d434efdc
Remove Texture::CopyFrom format check
The formats of the textures involved in a texture were checked for equality, this broke certain copies as the presentation engine would invoke copies between textures of different yet compatible formats.

Co-authored-by: PixelyIon <pixelyion@protonmail.com>
2022-08-06 22:20:54 +05:30
Billy Laws
58174f255f
Improve ContextLock semantics
`ContextLock` had unoptimal semantics in the form of direct access to the `isFirst` member which wasn't clearly defined, it's now been broken up into function calls `IsFirstUsage` and `OwnsLock` with explicit move semantics and a function for releasing the lock.

Co-authored-by: PixelyIon <pixelyion@protonmail.com>
2022-08-06 22:20:54 +05:30
Billy Laws
561103d3da
Submit GPFIFO work prior to CircularQueue waiting
The position at which we call submit is a significant factor in performance and we did so at the end of PBs (PushBuffers), this isn't optimal as there could be multiple PBs queued up that would benefit from being in the same submission. We now delay the submission of the workload till we run out of PBs.
2022-08-06 22:20:54 +05:30
PixelyIon
3ac5ed8c06
Attach coalesced Buffer if any source Buffer is attached
A buffer that's attached to a context could be coalesced into a larger buffer which isn't attached, this would break as it wouldn't keep the buffer alive till the end of the associated context. To fix this if any source buffers are attached then the resulting coalesced buffer is also attached now.
2022-08-06 22:20:54 +05:30
PixelyIon
284ac53d88
Fix KThread Priority Inheritance CAS
The CAS condition for KThread PI was inverted which lead to entirely incorrect behavior for CAS conditions which while it might work in the vast majority of cases would lead to significantly inaccurate behavior.
2022-08-06 22:20:54 +05:30
PixelyIon
45cb8388cc
Fix NCE Trap API Lock Callback
The lock callback would `continue` which would end up skipping over the current item as it applied to the inner loop rather than the outer loop as intended. This has now been fixed by using `break` and a check instead.
2022-08-06 22:20:54 +05:30
PixelyIon
745d809e07
Fix Buffer::SynchronizeGuest Non-Blocking Behavior
The buffer's non-blocking behavior could lead to an invalid state where the dirty state doesn't adequately represent the buffer's true state, the check has now been moved inside the CAS loop as its behavior changes depending on the dirty state. In addition, `SynchronizeGuest` returns a boolean denoting if the synchronization was successful now to make code flows depending on non-blocking synchronization cleaner.
2022-08-06 22:20:54 +05:30
PixelyIon
c1f2445772
Set state to CpuDirty directly in SynchronizeGuest
`SynchronizeGuest` could only set the dirty state to `Clean` which was redundant since calls to it from inside the write trap handler would set it to `CpuDirty` directly after, this fixes that by doing it inside the function when necessary.
2022-08-06 22:20:54 +05:30
PixelyIon
4f6a67af36
Fix Texture Trap Data Race
The trap callbacks did not wait on the `Texture` to complete synchronization to the guest, this resulted in races where the contents written to the texture would be overwritten by the synced content. This commit fixes that by waiting on the fences at the end of the trap callback.
2022-08-06 22:20:54 +05:30
PixelyIon
cb7c3602e7
Attach TextureView to FenceCycle
The lifetime of `TextureView` objects wasn't correctly managed as they weren't being attached the the `FenceCycle` in `AttachTexture`, this led to them getting deleted and causing all sorts of UB.
2022-08-06 22:20:54 +05:30
PixelyIon
ffaefc82d3
Call all flush callbacks prior to CommandExecutor submission
The flush callbacks inside `CommandExecutor` weren't being called prior to submission as they should've been, this fixes that by calling them. It additionally removes the requirement to manually flush Maxwell3D at the end of `ChannelGpfifo` pushbuffers as it's a flush callback and will automatically be called by `Submit`.

Co-authored-by: Billy Laws <blaws05@gmail.com>
2022-08-06 22:20:54 +05:30
PixelyIon
e65707cd9d
Handle CommandExecutor submission at end of ChannelGpfifo PB
Any work that was done in a `ChannelGpfifo` pushbuffer needs to be submitted at the end of it, if it isn't done then the work might incorrectly be not done till the next submission. This commit fixes it by calling `CommandExecutor::Submit` at the end of a pushbuffer, submitting any buffers that would've been left over.

Co-authored-by: Billy Laws <blaws05@gmail.com>
2022-08-06 22:20:54 +05:30
PixelyIon
7b209c54a2
Only reallocate MegaBuffer on usage
Certain submissions might not utilize megabuffering but reserve a `MegaBuffer` regardless, this is not optimal since it can inflate the allocations and waste memory. This commit addresses the issue by eliding the allocation given the current submission doesn't utilize them.
2022-08-06 22:20:54 +05:30
PixelyIon
2366f81443
Fix Buffer::PollFence incorrectly handling null-FenceCycle
If a `FenceCycle` isn't attached then `PollFence` returned `false` while it should return if the buffer has any concurrent GPU usages in flight, this has now been fixed by returning `true` in those cases.
2022-08-06 22:20:54 +05:30
PixelyIon
34e1e39d1c
Always reset all attached resources on Submit
Certain resources can be attached to an empty `Submit` with no nodes, this can cause it to become a false dependency and not be removed till the next non-empty submission. This has now been fixed by doing a reset regardless of if any nodes exist.
2022-08-06 22:20:54 +05:30
PixelyIon
47db8e8cbc
Fix GPU inline copy callback for Buffer::Write
The GPU inline copy callback was broken for `Buffer::Write` as it wasn't always called when it needed to be and didn't handle attaching of the buffer to the executor which would cause it to be unlocked. This commit addresses both of these issues, it introduces a `AttachLockedBuffer` method to attach an already locked buffer to the executor.
2022-08-06 22:20:54 +05:30
PixelyIon
cd969316e9
Don't delete build folder in CI runs
The build folder was being deleted at the end of CI runs but it has to be cached and this deletion wasn't necessary as the disk would be wiped at the end of the CI build, this has now been fixed.
2022-08-06 22:20:54 +05:30
PixelyIon
2636a37b31
Introduce alternative FPS measurement for disabled frame throttling
The FPS is implicitly bound to the refresh rate due to the timestamp being that of the presentation time, this leads to a misleading FPS figure for disabled frame throttling. It has now been fixed by using the frame submission time rather than the presentation time when frame throttling is disabled and to make this more apparent the color of the OSD FPS has been changed.
2022-08-06 22:20:54 +05:30
PixelyIon
0f56d01e58
Fix Packed format component ordering in IsAdrenoAliasCompatible
All `Packed` formats have their components stored in the opposite ordering to the label, this was not followed for `IsAdrenoAliasCompatible` prior and the ordering has now been flipped.
2022-08-06 22:18:42 +05:30
PixelyIon
3ca56ef578
Fix NCE Trapping API Deadlock
A deadlock was caused by holding `trapMutex` while waiting on the lock of a resource inside a callback while another thread holding the resource's mutex waits on `trapMutex`. This has been fixed by no longer allowing blocking locks inside the callbacks and introducing a separate callback for locking the resource which is done after unlocking the `trapMutex` which can then be locked by any contending threads.
2022-08-06 22:18:42 +05:30
PixelyIon
a6599c30b4
Correct IntervalMap insertion end calculation
The `end` pointer for `interval` was incorrectly calculated as `interval.data() + interval.size_bytes()` which would be incorrect when the interval span type is not `u8` as the pointer derived from `interval.data()` would be a pointer to the span type rather than a byte pointer and be subject to arithmetic of that object's size rather than in terms of a byte.
2022-08-06 22:18:42 +05:30
PixelyIon
b0910e7b1a
Avoid locking Texture/Buffer in trap handler
We generally don't need to lock the `Texture`/`Buffer` in the trap handler, this is particularly problematic now as we hold the lock for the duration of a submission of any workloads. This leads to a large amount of contention for the lock and stalling in the signal handler when the resource may be `Clean` and can simply be switched over to `CpuDirty` without locking and utilizing atomics which is what this commit addresses.
2022-08-06 22:18:42 +05:30
PixelyIon
a60d6ec58f
Replace host immutability FenceCycle with GPU usage tracking
We utilized a `FenceCycle` to keep track of if the buffer was mutable or not and introduced another cycle to track GPU-side requirements only on fulfillment of which could the buffer be utilized on the host but due to the recent change in the behavior this system ended up being unoptimal. 

This commit replaces the cycle with a boolean tracking if there are any usages of the resource on the GPU within the current context that may prevent it from being mutated on the CPU. The fence of the context is simply attached to the buffer based off this which was allowed as the new behavior of buffer fences matches all the requirements for this.
2022-08-06 22:18:42 +05:30
PixelyIon
217d484cba
Abstract TextureView/BufferDelegate locking into LockableSharedPtr
An atomic transactional loop was performed on the backing `std::shared_ptr` inside `BufferView`/`TextureView`'s `lock`/`LockWithTag`/`try_lock` functions, these locks utilized `std::atomic_load` for atomically loading the value from the `shared_ptr` recursively till it was the same value pre/post-locking. 

This commit abstracts the locking functionality of `TextureView`/`BufferDelegate` into `LockableSharedPtr` to avoid code duplication and removes the usage of `std::atomic_load` in either case as it is not necessary due to the implicit memory barrier provided by locking a mutex.
2022-08-06 22:18:42 +05:30
PixelyIon
2d08886e4e
Utilize TextureView rather than Texture for presentation
`PresentationEngine` and `GraphicBufferProducer` methods that utilized textures for the surface utilized the `Texture` type rather than the `TextureView` type, this was never correct but at the time of authoring this code `TextureView` was not finalized and in a major flux which is why it was not utilized and `Texture` was utilized instead. Now that is is far more stable, it has been replaced with `TextureView`.
2022-08-06 22:18:42 +05:30
PixelyIon
c25ad6e71a
Update Project Inspection Profile
Android Studio has updated their inspection profile for projects in newer versions, this commit updates it on the repository.
2022-08-06 22:18:42 +05:30
PixelyIon
d7399e33c1
Avoid waiting on mutex in PresentationEngine::Present
We want to block on the host thread during presentation while the host surface isn't present to implicitly pause the game, this can end up being fairly costly as it involves locking the `PresentationEngine` mutex which can lead to a lot of contention with the presentation thread. This fixes the issue by polling if there is a surface and only if there isn't then doing the wait as it isn't mandatory to wait always, we'll eventually run into the guest thread stalling.
2022-08-06 22:18:42 +05:30
PixelyIon
30475ffc43
Fix queueBuffer GraphicBuffer Compatibility Check
Newer versions of the Deko3D homebrew were crashing due to this check and it was discovered that the check was incorrect and rather than comparing the `NvSurface` what had to be compared was the `GraphicBuffer` associated with the slot directly.

Co-authored-by: lynxnb <niccolo.betto@gmail.com>
2022-08-06 22:18:42 +05:30