A data structure that can represent the same value for a range of addresses (pages) is required for fast lookup in certain cases. This commit implements a near optimal data structure for mass insertion and O(1) lookup of range-based data, this is achieved using the host MMU and implementing multiple levels of atomicity for the ranges.
It should be noted that the table is limited to two levels but can be extended to a variable amount of ranges in the future, it was determined that additional levels of ranges can be beneficial for performance depending on the specific use-case.
Adreno drivers have certain errata which leads to Vulkan Push Descriptors to be broken on them in certain cases which leads to a descriptor set update being swallowed. This has been worked around by disabling push descriptors on Adreno drivers, this may lead to reduced performance on certain titles which frequently bind new descriptors.
Any semaphore releases are implicit synchronization events that can be utilized by the guest to pick up that the GPU has executed till a certain point and therefore we must submit all prior work accordingly.
DMA copies utilized `SubmitWithFlush` instead of `Submit`, this is not required and incurs significant additional synchronization penalties which will no longer be required.
We want to avoid blocking on surface creation unless necessary, this commit doesn't wait on the creation of the surface as it default initializes the value which'll generally be `Identity` or the transformation of the previous surface if it was lost.
Co-authored-by: Billy Laws <blaws05@gmail.com>
The V-Sync `KEvent` would be used by the presentation thread prior to construction leading to dereferencing an invalid value, this has been fixed by changing the order of construction to move the construction of the presentation thread after the V-Sync event.
The `TrapRegions` function performed a page-out on any regions that were trapped as read-only, this wasn't optimal as it would tie them both into the same operation while Buffers/Textures require to protect then synchronize and page-out. The trap was being moved to after the synchronize to get around this limitation but that can cause a potential race due to certain writes being done after the synchronization but prior to the trap which would be lost. This commit fixes these issues by splitting paging out into `PageOutRegions` which can be called after `TrapRegions` by any API users.
Co-authored-by: Billy Laws <blaws05@gmail.com>
`NCE::TrapRegions` was a bit too overloaded as a method as it implicitly trapped which was unnecessary in all current usage cases, this has now been made more explicit by consolidating the functionality into `NCE::CreateTrap` which handles just creation of the trap and nothing past that, `RetrapRegions` has been renamed to `TrapRegions` and handles all trapping now.
Co-authored-by: Billy Laws <blaws05@gmail.com>
Similar to `Buffer`s, `Texture`s suffered from unoptimal behavior due to using atomics for `DirtyState` and had certain bugs with replacement of the variable at times where it shouldn't be possible which have now been fixed by moving to using a mutex instead of atomics. This commit also updates the API to more closely match what is expected of it now and removes any functions that weren't utilized such as `SynchronizeGuestWithBuffer`.
Having a single variable denoting the exact state of a buffer and the operations that could be performed on it was found to be too restrictive, it's now been expanded into an additional `BackingImmutability` variable but due to these two. We can no longer use atomics without significant additional complexity so all accesses to the state are now mediated through `stateMutex`, a mutex specifically designed for tracking the state.
While designing the system around `stateMutex` it was determined to be more efficient than atomics as it would enforce blocking far less than it would generally have been compared to if the regular atomic fallback of locking the main resource lock which is locked for significantly longer generally.
Co-authored-by: PixelyIon <pixelyion@protonmail.com>
As a performance sensitive part of code, the NCE Trapping API benefits from having tracing and it helps us better determine where guest code is spending its time for more targeted optimizations.
The lifetime of the `this` pointer in the trap callbacks could be invalid as the lifetime of the underlying `Buffer`/`Texture` object wasn't guaranteed, this commit fixes that by passing a `weak_ptr` of the objects into the callbacks which is locked during the callbacks and ensures that a destroyed object isn't accessed.
Co-authored-by: Billy Laws <blaws05@gmail.com>
The `CommandExecutor`'s `MegaBuffer` was not being updated with the latest `FenceCycle` on being flushed in `SubmitWIthFlush`, this led to the megabuffer being overwritten prior to its GPU-side usage being complete. This commit fixes that by replacing the cycle to the latest cycle and prevents any races that occurred prior.
`FindOrCreate` ended up being monolithic function with poor readability, this commit addresses those concerns by refactoring the function to split it up into multiple member functions of `BufferManager`, while some of these member functions may only have a single call-site they are important to logically categorize tasks into individual functions. The end result is far neater logic which is far more readable and slightly better optimized by virtue of being abstracted better.
In certain cases the move constructor may not suffice and the move assignment operator is required, this commit implements that and moves to using a pointer for storing the `resource` member rather than a reference as its semantics matched what we desired more and allowed for assignment of the `resource`.
It was determined that `FindOrCreate` has several issues which this commit fixes:
* It wouldn't correctly handle locking of the newly created `Buffer` as the constructor would setup traps prior to being able to lock it which could lead to UB
* It wouldn't propagate the `usedByContext`/`everHadInlineUpdate` flags correctly
* It wouldn't correctly set the `dirtyState` of the buffer according to that of its source buffers
The condition for `setDirty` in the dirty state CAS was inverted from what it should've been resulting in synchronizing incorrectly, this commit fixes the condition to correct synchronization.
The formats of the textures involved in a texture were checked for equality, this broke certain copies as the presentation engine would invoke copies between textures of different yet compatible formats.
Co-authored-by: PixelyIon <pixelyion@protonmail.com>
`ContextLock` had unoptimal semantics in the form of direct access to the `isFirst` member which wasn't clearly defined, it's now been broken up into function calls `IsFirstUsage` and `OwnsLock` with explicit move semantics and a function for releasing the lock.
Co-authored-by: PixelyIon <pixelyion@protonmail.com>
The position at which we call submit is a significant factor in performance and we did so at the end of PBs (PushBuffers), this isn't optimal as there could be multiple PBs queued up that would benefit from being in the same submission. We now delay the submission of the workload till we run out of PBs.
A buffer that's attached to a context could be coalesced into a larger buffer which isn't attached, this would break as it wouldn't keep the buffer alive till the end of the associated context. To fix this if any source buffers are attached then the resulting coalesced buffer is also attached now.
The CAS condition for KThread PI was inverted which lead to entirely incorrect behavior for CAS conditions which while it might work in the vast majority of cases would lead to significantly inaccurate behavior.
The lock callback would `continue` which would end up skipping over the current item as it applied to the inner loop rather than the outer loop as intended. This has now been fixed by using `break` and a check instead.
The buffer's non-blocking behavior could lead to an invalid state where the dirty state doesn't adequately represent the buffer's true state, the check has now been moved inside the CAS loop as its behavior changes depending on the dirty state. In addition, `SynchronizeGuest` returns a boolean denoting if the synchronization was successful now to make code flows depending on non-blocking synchronization cleaner.
`SynchronizeGuest` could only set the dirty state to `Clean` which was redundant since calls to it from inside the write trap handler would set it to `CpuDirty` directly after, this fixes that by doing it inside the function when necessary.
The trap callbacks did not wait on the `Texture` to complete synchronization to the guest, this resulted in races where the contents written to the texture would be overwritten by the synced content. This commit fixes that by waiting on the fences at the end of the trap callback.
The lifetime of `TextureView` objects wasn't correctly managed as they weren't being attached the the `FenceCycle` in `AttachTexture`, this led to them getting deleted and causing all sorts of UB.
The flush callbacks inside `CommandExecutor` weren't being called prior to submission as they should've been, this fixes that by calling them. It additionally removes the requirement to manually flush Maxwell3D at the end of `ChannelGpfifo` pushbuffers as it's a flush callback and will automatically be called by `Submit`.
Co-authored-by: Billy Laws <blaws05@gmail.com>
Any work that was done in a `ChannelGpfifo` pushbuffer needs to be submitted at the end of it, if it isn't done then the work might incorrectly be not done till the next submission. This commit fixes it by calling `CommandExecutor::Submit` at the end of a pushbuffer, submitting any buffers that would've been left over.
Co-authored-by: Billy Laws <blaws05@gmail.com>
Certain submissions might not utilize megabuffering but reserve a `MegaBuffer` regardless, this is not optimal since it can inflate the allocations and waste memory. This commit addresses the issue by eliding the allocation given the current submission doesn't utilize them.
If a `FenceCycle` isn't attached then `PollFence` returned `false` while it should return if the buffer has any concurrent GPU usages in flight, this has now been fixed by returning `true` in those cases.
Certain resources can be attached to an empty `Submit` with no nodes, this can cause it to become a false dependency and not be removed till the next non-empty submission. This has now been fixed by doing a reset regardless of if any nodes exist.
The GPU inline copy callback was broken for `Buffer::Write` as it wasn't always called when it needed to be and didn't handle attaching of the buffer to the executor which would cause it to be unlocked. This commit addresses both of these issues, it introduces a `AttachLockedBuffer` method to attach an already locked buffer to the executor.
The FPS is implicitly bound to the refresh rate due to the timestamp being that of the presentation time, this leads to a misleading FPS figure for disabled frame throttling. It has now been fixed by using the frame submission time rather than the presentation time when frame throttling is disabled and to make this more apparent the color of the OSD FPS has been changed.
All `Packed` formats have their components stored in the opposite ordering to the label, this was not followed for `IsAdrenoAliasCompatible` prior and the ordering has now been flipped.
A deadlock was caused by holding `trapMutex` while waiting on the lock of a resource inside a callback while another thread holding the resource's mutex waits on `trapMutex`. This has been fixed by no longer allowing blocking locks inside the callbacks and introducing a separate callback for locking the resource which is done after unlocking the `trapMutex` which can then be locked by any contending threads.
The `end` pointer for `interval` was incorrectly calculated as `interval.data() + interval.size_bytes()` which would be incorrect when the interval span type is not `u8` as the pointer derived from `interval.data()` would be a pointer to the span type rather than a byte pointer and be subject to arithmetic of that object's size rather than in terms of a byte.
We generally don't need to lock the `Texture`/`Buffer` in the trap handler, this is particularly problematic now as we hold the lock for the duration of a submission of any workloads. This leads to a large amount of contention for the lock and stalling in the signal handler when the resource may be `Clean` and can simply be switched over to `CpuDirty` without locking and utilizing atomics which is what this commit addresses.
We utilized a `FenceCycle` to keep track of if the buffer was mutable or not and introduced another cycle to track GPU-side requirements only on fulfillment of which could the buffer be utilized on the host but due to the recent change in the behavior this system ended up being unoptimal.
This commit replaces the cycle with a boolean tracking if there are any usages of the resource on the GPU within the current context that may prevent it from being mutated on the CPU. The fence of the context is simply attached to the buffer based off this which was allowed as the new behavior of buffer fences matches all the requirements for this.
An atomic transactional loop was performed on the backing `std::shared_ptr` inside `BufferView`/`TextureView`'s `lock`/`LockWithTag`/`try_lock` functions, these locks utilized `std::atomic_load` for atomically loading the value from the `shared_ptr` recursively till it was the same value pre/post-locking.
This commit abstracts the locking functionality of `TextureView`/`BufferDelegate` into `LockableSharedPtr` to avoid code duplication and removes the usage of `std::atomic_load` in either case as it is not necessary due to the implicit memory barrier provided by locking a mutex.
`PresentationEngine` and `GraphicBufferProducer` methods that utilized textures for the surface utilized the `Texture` type rather than the `TextureView` type, this was never correct but at the time of authoring this code `TextureView` was not finalized and in a major flux which is why it was not utilized and `Texture` was utilized instead. Now that is is far more stable, it has been replaced with `TextureView`.
We want to block on the host thread during presentation while the host surface isn't present to implicitly pause the game, this can end up being fairly costly as it involves locking the `PresentationEngine` mutex which can lead to a lot of contention with the presentation thread. This fixes the issue by polling if there is a surface and only if there isn't then doing the wait as it isn't mandatory to wait always, we'll eventually run into the guest thread stalling.
Newer versions of the Deko3D homebrew were crashing due to this check and it was discovered that the check was incorrect and rather than comparing the `NvSurface` what had to be compared was the `GraphicBuffer` associated with the slot directly.
Co-authored-by: lynxnb <niccolo.betto@gmail.com>
The copyright headers for external project such as yuzu/Ryujinx were inconsistent in ordering, Skyline should always be the first item in the list. In addition, they didn't always link to the project's GitHub which has also been fixed.
Multiple threads concurrently accessing the `TextureManager`/`BufferManager` (Referred to as "resource managers") has a potential deadlock with a resource being locked while acquiring the resource manager lock while the thread owning it tries to acquire a lock on the resource resulting in a deadlock.
This has been fixed with locking of resource manager now being externally handled which ensures it can be locked prior to locking any resources, `CommandExecutor` provides accessors for retrieving the resource manager which automatically handles locking aside doing so on attachment of resources.
GPU resources have been designed with locking by fences in mind, fences were treated as implicit locks on a GPU, design paradigms such as `GraphicsContext` simply unlocking the texture mutex after attaching it which would set the fence cycle were considered fine prior but are unoptimal as it enforces that a `FenceCycle` effectively ensures exclusivity. This conflates the function of a mutex which is mutual exclusion and that of the fence which is to track GPU-side completion and led to tying if it was acceptable to use a GPU resource to GPU completion rather than simply if it was not currently being used by the CPU which is the function of the mutex.
This rework fixes this with the groundwork that has been laid with previous commits, as `Context` semantics are utilized to move back to using mutexes for locking of resources and tracking the usage on the GPU in a cleaner way rather than arbitrary fence comparisons. This also leads to cleaning up a lot of methods that involved usage of fences that no longer require it and therefore can be entirely removed, further cleaning up the codebase. It also opens the door for future improvements such as the removal of `hostImmutableCycle` and replacing them with better solutions, the implementation of which is broken at the moment regardless.
While moving to `Context`-based locking the question of multiple GPU workloads being in-flight while using overlapping resources came up which brought a fundamental limitation of `FenceCycle` to light which was that only one resource could be concurrently attached to a cycle and it could not adequately represent multi-cycle dependencies. `FenceCycle` chaining was designed to fix this inadequacy and allows for several different GPU workloads to be in-flight concurrently while utilizing the same resources as long as they can ensure GPU-GPU synchronization.
If we want to allow submitting multiple pieces of work to the GPU at once while still requiring CPU synchronization, we'll need to track all past fence cycles associated with a resource alongside the current one. To solve this the concept of chaining fences has been introduced, fences from past usages can be chained to the latest fence which'll then recursively forward operations to chained fences.
This change also ends up mandating a move away from `FenceCycleDependency` as it would prevent fences from concurrently locking the same resources which is required for chaining to work as two fences being chained fundamentally means they're locking the same resources. The `AtomicForwardList` is therefore used as the new container.
An implementation of a singly-linked list with atomic access to allow for lock-free access semantics, it eliminates the requirement for a mutex which can introduce additional consideration for synchronization.
Resources on the GPU can be fairly convoluted and involve overlaps which can lead to the same GPU resources being utilized with different views, we previously utilized fences to lock resources to prevent concurrent access but this was overly harsh as it would block usage of resources till GPU completion of the commands associated with a resource.
Fences have now been replaced with locks but locks run into the issue of being per-view and therefore to add a common object for tracking usage the concept of "tags" was introduced to track a single context so locks can be skipped if they're from the same context. This is important to prevent a deadlock when locking a resource which has been already locked from the current context with a different view.
We do not want to allow saving of user data on unsigned builds as they don't have a stable signature and will not properly handle reinstallation. This can lead to a situation where the user has to resort to complex techniques to completely uninstall the package such as ADB or calling into PM directly.
We currently present all frames synchronously on the thread that calls into SurfaceFlinger functions, this is unoptimal as it doesn't match guest behavior which can lead to delaying the guest from working on the next frame. This commit queuing up frames to non-blocking and handles all waiting then presenting the frame on a dedicated thread.
We utilize `pthread_setname_np` to set the thread names but didn't check for any errors which resulted in the `Skyline-Choreographer` and `ChannelCmdFifo` not having proper names as they exceeded the 16 character limit on thread names for the pthread function. This has now been fixed by changing the names and introducing error checking to invocations of this function.
All our normal alignment functions are designed to only handle power of 2 (`POT`) multiples as we only align or check alignment to `POT` multiples but there are cases where this is not possible and we deal with `NPOT` multiples which is why this function is required.
We waited on the host GPU after `Execute` but this isn't optimal as it causes a major stall on the CPU which can lead to several adverse effects such as downclocking by the governor and losing the opportunity to work in parallel with the GPU.
This has now been fixed by splitting `Execute`'s functionality into two functions: `Submit` and `SubmitWithFlush` which both execute all nodes and submit the resulting command buffer to the GPU but flushing will wait on the GPU to complete while the non-flush variant will not wait and work ahead of the GPU.
We need move-assignment semantics to viably utilize these objects as class members, they cannot be replaced without move-assign (or copy-assign but that is undesirable here). This commit fixes that by introducing a move assignment operator to them while making the `slot` a pointer which has the necessary nullability semantics.
This change lets items get the updated position of their view holder in the adapter. Fixes an issue where the position of items was not updated after being removed from a `SelectableGenericAdapter`.
This preference launches `GpuDriverActivity` for managing custom gpu drivers. When the device has an incompatible GPU, the preference will be disabled and greyed out.
The activity adds the following functionalities:
* Lists installed drivers
* Allows the user to install new drivers, or remove installed ones
* Allows the user to select the driver that will be used by the emulator
At some point we will call Submit within draws or constant buffer updates, to avoid any infinite recursion mark draw/cbuf pending as false before performing any operation
The previous name was chosen as an afterthought and didn't clearly indicate what the purpose of the class is. We needed a separate, simple class without delegates members (like PreferenceSettings), so that its fields can be easily accessed via JNI to get settings values from native code.
The `Settings` class now has a pure virtual `Update` method, and uses inheritance over template specialization for platform-specific behavior override.
A `Setting` delegate class has been introduced, holding the raw value of the setting and adding support for registering callbacks to that setting. Callbacks will then be called when the value of that setting changes.
As a result of this, raw setting values have been made accessible through pointer dereference semantics.
SharedPreferences will be partially swapped out in the future to support per-game settings. In the meantime, make it clear from which class settings are coming from.
Settings are now shared to the native side by passing an instance of the Kotlin's `Settings` class. This way the C++ `Settings` class doesn't need to parse the SharedPreferences xml anymore.
Mali GPU drivers utilize the `ppoll()` syscall inside `waitForFences` which isn't correctly restarted after a signal, which we can receive at any time on a guest thread. This commit fixes that by recursively calling the function on failure till it succeeds or returns an unexpected error.
Co-authored-by: PixelyIon <pixelyion@protonmail.com>
Co-authored-by: Billy Laws <blaws05@gmail.com>
These applets are used by applications to display a custom error message to the user. Both the error message and the detailed error message are printed to the error log.
Co-authored-by: lynxnb <niccolo.betto@gmail.com>
This conforms to the C++ 'Allocator' named requirement allowing it to be used with any STL type and allows drastically reducing allocation times in cases which are suited for linear allocation.
Certain non-indexed quad draws would mistakenly take the indexed quad path because of the assumption that they would not have a bound index buffer. This resulted in a crash for most games using quads due to a faulty exception `Indexed quad conversion is not supported`, when in fact they were not using indexed quads.
Co-authored-by: PixelyIon <pixelyion@protonmail.com>
Co-authored-by: Billy Laws <blaws05@gmail.com>
This commit implements several key optimisations in megabuffering that are all inherently interlinked.
- Megabuffering is moved from per-buffer to per-view copies, this makes megabuffering possible for small views into larger underlying buffers which is often the case with even the simplest of games,
- Megabuffering is no longer the default option, it is only enabled for buffer views that have had inline GPU writes applied to them in the past as that is the only case where they are beneficial. In any other case the cost of copying, even with a 128KiB limit can be significant.
- With both of these changes, there is now possibility for overlapping views where one uses megabuffering and one does not. In order to allow GPU inline writes to work consistently in such cases a system of 'host immutability' has been implemented, when a buffer is marked as host immutable for a given cycle, all writes to the buffer from that point to the point the cycle is signalled will be performed on the GPU, ensuring that the backing contents are correctly sequenced
We don't always have access to CI secrets, such as, when a certain CI action is triggered by a PR from an external repository then it won't have access to secrets and be signed. While we likely will allow for this in the future as all workflows do have to be approved, it is still important to not crash when keys are unavailable and have a graceful fallback for those situations.
Has the same guarantees of pointer stabilty while also being significantly faster in cases where a buffer has thousands of views. This is the case in RE4 and this change leads to an almost 1000% performance improvement in that game.
Uses an API found through RE since none of the AOSP APIs work, additionaly the code for setting RR was consolidated to a single function that can be ran after all display updates.
We currently have a global `MegaBuffer` instance that is shared across all channels, this is very problematic as `MegaBuffer` fundamentally works like a state machine with allocations (especially resetting/freeing) and is thread-specific. Therefore, we now have a pool of several `MegaBuffer`s which is allocated from by the `CommandExecutor` and kept channel specific as a result which also limits its usage to a single thread, this allows for individually resetting or freeing any allocations.
There was a lot of redundant code in the `CommandScheduler` when the same functionality could be achieved with much shorter and cleaner code which this commit fixes. This includes no changes to the user-facing API and does not require any changes on the user side as a result.
Some games remap rendertargets or map them late which would lead to weird graphical bugs or crashes. Drop the caching since VMM lookup is fairly cheap anyway.
The `VkBufferImageCopy` offset calculations were wrong inside `CopyIntoStagingBuffer` as it multiplied the mip level's linear size by `levelCount` rather than `layerCount`. This led to substantial UB in games which called this function as it led to an overflow and resulted in writing to other areas of the buffer which caused major issues such as vertex/index buffer corruption and corresponding graphical glitches alongside likely being the cause of some crashes.
BC7 CPU decoding had the red and blue channels swapped around as it outputted a BGRA image after decoding while we expected an RGBA image to be produced. This should fix the colors of certain textures in titles such as Cuphead or Sonic Forces.
The syncpoint maximum value represents the maximum possible syncpt value at a given time, however due to PBs being submitted before max was incremented, for a brief moment of time this is not the case which could lead to crashes or other such behaviour if a game waits on the fence at the right moment.
We used a `FileProvider` for log sharing prior, this is no longer necessary since it comes under the `DocumentsProvider` now which can be utilized to share the log document directly.
Any documents with the same name existing in a directory that is copied to would cause an exception due to existing already, this fixes that by handling conflict resolution in those cases and automatically determining a file name that would avoid a conflict.
Previously a broken state value was returned from GetState that caused crashes in games using newer SDKs and NFP, correctly handle state now by updating it after initialisation.
We can't render to a 3D texture through a 3D view, we instead have to create a 2D array view into it and render to that. The texture manager previously didn't support having a different view type/layer count between a guest texture view and the underlying storage texture that is required to support this so that was also implemented by reading the view layer count from the dimensions depth instead if the underlying texture is 3D (and the view type is 2D array). Additionally move away from our own view type enum to Vulkan, inline with other guest texture member types.
Sampler anisotropy was made a required feature in an earlier commit due to its widespread availability but this was determined to be incorrect as certain Mali GPUs that can otherwise run 2D games in Skyline do not have this feature, while they are still not officially supported as this was the only roadblock to support them, it has now been made an optional feature.