The Maxwell GPU supports 3D textures which are tiled with the block-linear layout which didn't handle swizzling 3D textures correctly till now. This commit addresses that by implementing proper swizzling for 3D textures. Titles such as Cluster Truck and Super Mario Odyssey utilize 3D textures alongside a vast majority of other titles.
As per VMA docs: 'Allocation size returned in this variable may be greater than the size requested for the resource e.g. as VkBufferCreateInfo::size. Whole size of the allocation is accessible for operations on memory e.g. using a pointer after mapping with vmaMapMemory(), but operations on the resource e.g. using vkCmdCopyBuffer must be limited to the size of the resource.'
There were two issues here:
- If a skyline span was passed as a param then the 'T &object' version would be called, filling the span itself with random values rather than its contents
- Random numbers were repeated every call since independent_bits_engine copied generator state and thus it was never actually updated
This calculation for the amount of lines on the Y axis relative to the start of the last block was wrong and would instead determine the amount of lines to the last Y-axis GOB which wasn't accurate when padding was considered, this resulted in titles like Celeste having broken texture decoding (on a 1922x1082 texture) for the last ROB as most pixels would be masked out.
Certain titles such as BOTW trigger behavior to reuse an attachment within the same subpass, this caused an exception inside `RenderPassNode::AddAttachment` as it cannot find corresponding subpass for attachment. To fix this issue, we now assume that when it cannot find a subpass for an existing attachment, it is attached to the latest subpass and return the attachment.
Certain textures may be unaligned with a GOB's height of 8 lines, we already handle the case of being unaligned with a GOB's width of 64-bytes. This case occurs on titles such as SMO when going in-game.
The function now returns from a segmentation fault when a debugger is present, this allows the entire context to be intact which can allow the debugger to correctly pick up variables from all stack frames while it could not extrapolate most variables when trapped inside the signal handler without the values of all registers.
In the Maxwell 3D engine, instanced draws are implemented by repeating the exact same draw in sequence with special flag set in vertexBeginGl. This flag allows either incrementing the instance counter or resetting it, since we need to supply an instance count to the host API we defer all draws until state changes occur. If there are no state changes between draws we can skip them and count the occurences to get the number of instances to draw.
Implements register state that corresponds to the size of a single point sprite in Maxwell 3D, this is emitted by the shader compiler in the preamble but needs to be only applied if the input topology is a point primitive and it is invalid to set the point size in any other case.
Earlier texture locking design required the lock to be retained but since the introduction of `AttachTexture`, this no longer needs to be done. This being done caused deadlocks when the depth texture is sampled by the fragment shader while being bound as an RT since it would attempt to lock the texture again.
A basic `bcat:u` implementation to prevent titles such as "Kirby and the Forgotten Land" dependent on BCAT support from crashing due to the lack of an implementation.
This is a widely supported feature that games may require conditionally but due to it being supported on effectively all target devices, it was made mandatory. This is used by titles such as ARMS.
Improves the readability of the log and replaces the previously uninformative prefix of `operator()` due to being in a lambda with `Controller support`.
Maxwell3D has a register for linking the TIC/TSC index in bindless texture handles, this is used by games to implement bindless combined texture-sampler handles.
Implements `GraphicsEnvironment::ReadCbufValue` & `GraphicsEnvironment::ReadTextureType` with a framework of heterogeneous lookups for caching and callbacks for querying constant buffer or TIC values with validation checks for successive draws to ensure unique IR is generated.
The `descriptorSetWrites` being filled is now optional and the case of it being empty is handled correctly, this is done by certain titles such as ARMS and is entirely valid behavior. It should be noted that not doing this leads to errors in the guest due to invalid GPU state while working on the host GPU.
SVC `SignalToAddress` had a bug with the behavior of `SignalAndModifyBasedOnWaitingThreadCountIfEqual` which was entirely incorrect and led to deadlocks in titles such as ARMS that were dependent on it. This commit corrects the behavior and refactors both SVCs and moves their arbitration/waiting to inside the corresponding `KProcess` function rather than the SVC to avoid redundancies and improve code readability.
Filtering of validation logs is now extended beyond BCn formats and now covers other format which have their feature set misreported by the driver, this significantly drives down the amount of logs depending on the title.
Implements an algorithm to determine formats that can be aliased as views without needing `VK_IMAGE_CREATE_MUTABLE_FORMAT_BIT`, this avoids spamming warning logs on view creation when the aliased formats will function in practice.
There was an oversight with exclusive subpasses which could lead to RPs with more than one subpass could be created even though one pass was exclusive, this oversight was not finishing the render pass at the end of `AddSubpass`. This could lead to a future subpass adding to the end of that RP even though it was intended to exclusively have a single subpass.
This case occurs in titles such as Celeste (in-game) and breaks rendering on GPUs that may require exclusive subpasses for proper functionality.
The Khronos Validation Layer can often generate warning/error logs due to our intentional breakage from Vulkan specification, these can occur several times a frame resulting in the logs being spammed and making it difficult to extract useful information out of logs. The scope of these logs has now been reduced with more general filtering and the introduction of specialized filtering to handle complex cases such as BCn hacks with `libadrenotools` on Adreno devices.
Descriptor set updates were broken on the non-push-descriptor path due to lifetime issues with VkDescriptorSetLayout's usage during the execution phase which entirely broke rendering on AMD/Mali GPUs due to them not supporting `VK_KHR_push_descriptor`.
This commit addresses that by moving the allocation of a descriptor set to outside the lambda and into the recording phase, it also simplifies the semantics and resources passed into the lambda by removing redundancies.
The Vulkan render pass cache was fundamentally broken since it was designed around the Render Pass Compatibility clause due to being designed for framebuffer compatibility initially. As this scope was extended to a general render pass cache, the amount of data in the key was not extended to include everything it should have. This commit introduces the missing pieces in the RP cache and simplifies the underlying code in the process.
The backing for shader data would implicitly be zero-initialized due to a `resize` on every shader parse, this was entirely unnecessary as we would overwrite the entire range regardless.
We avoid this by using statically allocated storage and a span over it containing the shader bytecode which avoids any unnecessary clear semantics without resorting to more complex solutions such as a custom allocator.
Implements a cache for storing `VkFramebuffer` objects with a special path on devices with `VK_KHR_imageless_framebuffer` to allow for more cache hits due to an abstract image rather than a specific one.
Caching framebuffers is a fairly crucial optimization due to the cost of creating framebuffers on TBDRs since it involves calculating tiling memory allocations and in the case of Adreno's proprietary driver involves several kernel calls for mapping and allocating the corresponding framebuffer memory.
There are a lot of cases of `VkImageView` being recreated arbitrarily due to it being tied to the ephemeral object `TextureView` rather than `Texture`, this commit flips that by storing all `VkImageView`s inside `Texture` with `TextureView` simply holding a copy of the handle to them. Additionally, this change results in stable `VkImageView` handles and helps in paving the path for framebuffer caching when `VK_KHR_imageless_framebuffer` is unavailable.
As we desire more accurate profiling data in certain circumstances, making the app explicitly profilable will allow for this, it will also remove the (annoying) prompt to do this in the Android Studio profiler.
Implements a cache for storing `VkRenderPass` objects which are often reused, they are not extremely expensive to create generally but this is a required step to build up to a framebuffer cache which is an extremely expensive object to create on TBDRs generally since it involves calculating tiling memory allocations and in the case of Adreno's proprietary driver involves several kernel calls for mapping and allocating the corresponding memory.
We run into a lot of successive subpasses with the exact same framebuffer configuration which we now exploit to avoid the creation of a new subpass due to the overhead involved with this. This provides significant performance boosts in certain cases due to the magnitude of difference in the amount of subpasses being created while providing next to no benefit in other cases.
The check for the fence cycle being the same as the current cycle was incorrectly inverted to be the opposite of what it should have been, leading to bugs.
The responsibility for synchronizing a texture and locking it is now on the `PresentationEngine` rather than the API-user as this'll allow more fine grained locking and delay waiting until necessary.
As we require a relaxed version of the Vulkan render pass compatibility clause for caching multi-subpass render passes, we now utilize a quirk to determine if this is supported which it is on Nvidia/Adreno while AMD/Mali where it isn't supported we force single-subpass render passes.
We found out that certain vendors such as Nvidia had a limitation on the global priority of a queue and requesting `VK_QUEUE_GLOBAL_PRIORITY_HIGH_EXT` would result in `VK_ERROR_NOT_PERMITTED_EXT`. A quirk has been introduced to supply the maximum supported global priority which is currently set on a per-vendor basis to avoid future crashes.
Implements a cache for storing `VkPipeline` objects which are fairly expensive to create and doing so on a per-frame basis was rather wasteful and consumed a significant part of frametime. It should be noted that this is **not** compliant with the Vulkan specification and **will** break unless the driver supports a relaxed version of the Vulkan specification's Render Pass Compatibility clause.
We can use inline push descriptors for writing to descriptor rather than allocating a descriptor set for a one time write and freeing it as this is rather inefficient while an inline push descriptor generally ends up being a direct `memcpy` on the driver side designed for this use-case.
We want Skyline to have the most favorable GPU scheduling possible due to low latency and high throughput requirements, we request high priority scheduling due to this reason.
This implements all Maxwell3D registers and HLE Vulkan state for Tessellation including invalidation of the TCS (Tessellation Control Shader) state during state changes.
Previously constant buffer updates would be handled on the CPU and only the end result would be synced to the GPU before execute. This caused issues as if the constant buffer contents was changed between each draw in a renderpass (e.g. text rendering) the draws themselves would only see the final resulting constant buffer.
We had earlier tried to fix this by using vkCmdUpdateBuffer however this caused significant performance loss due to an oversight in Adreno drivers. We could have worked around this simply by using vkCmdCopy buffer however there would still be a performance loss due to renderpasses being split up with copies inbetween.
To avoid this we introduce 'megabuffers', a brand new technique not done before in any other switch emulators. Rather than replaying the copies in sequence on the GPU, we take advantage of the fact that buffers are generally small in order to replay buffers on the GPU instead. Each write and subsequent usage of a buffer will cause a copy of the buffer with that write, and all prior applied to be pushed into the megabuffer, this way at the start of execute the megabuffer will hold all used states of the buffer simultaneously. Draws then reference these individual states in sequence to allow everything to work without any copies. In order to support this buffers have been moved to an immediate sync model, with synchronisation being done at usage-time rather than execute (in order to keep contents properly sequenced) and GPU-side writes now need to be explictly marked (since they prevent megabuffering). It should also be noted that a fallback path using cmdCopyBuffer exists for the cases where buffers are too large or GPU dirty.
As bindings weren't correctly handled due to the fact that `EmitSPIRV` would change the bindings, the shader module cache would not correctly function and have no cache hits in `find` and rather have them in `try_emplace` which negated any performance benefit of it. This has now been fixed by retaining the initial cache key for insertion into the cache while also storing the post-emit bindings and restoring them during a cache hit.
Implements caching of the compiled shader module (`VkShaderModule`) in an associative map based on the supplied IR, bindings and runtime state to avoid constant recompilation of shaders. This doesn't entirely address shader compilation as an issue since host shader compilation is tied to Vulkan pipeline objects rather than Vulkan shader modules, they need to be cached to prevent costly host shader recompilation.
This implements the first step of a full shader cache with caching any IR by treating the shared pointer as a handle and key for an associative map alongside hashing the Maxwell shader bytecode, it supports both single shader program and dual vertex program caching.
We desire the ability to hash and check equality of data across spans to use associative containers such as `std::unordered_map` with spans. The implemented functions provide an easy way to do that.
Mostly based off of yuzu's implementation, this will need to be extended in the future to open up a UI for configuring controllers according to the applications requirements.
As there was no check for the lack of a `GuestTexture`/`GuestBuffer`, it would lead to UB when a texture/buffer that had no guest such as the `zeroTexture` from `GraphicsContext` would be marked as dirty they would cause a call to `NCE::RetrapRegions` with a `nullptr` handle that would be dereferenced and cause a segmentation fault.
In certain situations such as constant buffer updates, we desire to use the guest buffer as a shadow buffer forwarding all writes directly to it while we update the host using inline buffer updates so they happen in-sequence. This requires special behavior as we cannot let any synchronization operations take place as they would break the shadow buffer, as a result, an external synchronization flag has been added to prevent this from happening.
It should be noted that this flag is not respected for buffer recreation which will lead to UB, this can and will break updates in certain cases and this change isn't complete without buffer manager support.
The offset of the view wasn't added to the `vkCmdUpdateBuffer`, this would cause the offset to be incorrect given the buffer was a view of a larger buffer that wasn't the start of it. This commit fixes that by adding the offset of the view to the buffer update.
We didn't call `MarkGpuDirty` on textures/buffers prior to GPU usage, this would cause them to not be R/W protected when they should be and provide outdated copies if there were any read accesses from the CPU (which are not possible at the moment since we assume all accesses are writes at the moment). This has now been fixed by calling it after synchronizing the resource.
The terminology "Non-Graphics pass" was deemed to be fairly inaccurate since it simply covered all Vulkan commands (not "passes") outside the render-pass scope, these may be graphical operations such as blits and therefore it is more accurate to use the new terminology of "Outside-RenderPass command" due to the lack of such an implication while being consistent with the Vulkan specification.
Previously constant buffer updates would be handled on the CPU and only the end result would be synced to the GPU before execute. This caused issues as if the constant buffer contents was changed between each draw in a renderpass (e.g. text rendering) the draws themselves would only see the final resulting constant buffer. Fix this by updating cbufs on the GPU/CPU seperately, only ever syncing them back at the start or after a guest side CPU write, at the moment only a single word is updated at a time however this can be optimised in the future to batch all consecutive updates into one large one.
We require certain buffers to only be on the host while being accessible through the same abstractions as a guest buffer as they must be interchangeable in usage.
We needed to block stack frame lookups past JNI code as Java doesn't follow the ARMv8 frame pointer ABI which leads to invalid pointer dereferences. Any JNI function that throws or handles exceptions must do this now or it may lead to a `SIGSEGV`.
Some games may pass empty TICs as inputs to shaders while not actually using them within the shader. Create an empty texture and pass this in instead when we hit this case, the nullDescriptor feature could be used but it's not supported by all devices so we chose to do it this way instead.
Skyline's `exception` class now stores a list of all stack frames during the invocation of the exception. These can later be parsed by the exception handler to generate a human-readable stack trace. To assist with more complete stack traces, `-fno-omit-frame-pointer` is now passed on debug builds which forces the inclusion of frames on function calls.
NCE is implicitly depended on by the `GPU` class due to the NCE Memory Trapping API so the destruction of it must take place after the destruction of the `GPU` class. Additionally, to prevent bugs the NCE destructor must set `staticNce` to `nullptr` as the signal handler will potentially access a destroyed instance of NCE otherwise.
Without this sRGB textures would be interpreted as RGB leading to colours being slighly off. The sRGB flag isn't stored as part of format word so we reuse the _pad_ field of it to store the flag for the switch case.
We don't want to actually exit the process as it'll automatically be restarted gracefully due to a timeout after being unable to exit within a fixed duration so we just want to infinite sleep during termination. This should fix issues where exiting any game would cause the app to force close after some time as exception signal handling would fail in the background, the app should stay open now and automatically restart itself when another game is loaded in.
A lot of logs are incomplete due to being unable to flush inside the signal handler, now we flush after any exceptions so that there is a guarantee of any exceptions being logged as this is crucial for proper debugging.
B5G6R5 isn't generally supported by the swapchain and the format is used for R5G6B5 with swapped R/B channels to avoid aliasing so we reverse that by using R5G6B5 as the underlying Vulkan format for the swapchain which should be automatically handled by the driver for any copies from B5G6R5 textures and the data representation should be the same as B5G6R5 with swapped R/B channels so not reporting the correct texture::Format should be fine.
The DMA engine is used to perform DMA buffer/texture copies directly on the GPU. It can deswizzle arbritary regions of input textures, perform component remapping and swizzle into output textures.
This impl only supports 1D buffer copies, 2D ones will come later.
If we have a Nx1x1 image then determining the type from dimensions will result in a 1D image being created thus preventing us from creating a 2D view. By using the image view type we can avoid this for textures from TICs since we know in advance how they will be used
This enforces that the depth RT outlives the draw, without this the depth RT could be freed while in active use by command executor leading to UAFs and crashes.
This was erroneously included while migrating from older code where stack creation was entirely handled with host constructs such as `mmap` directly to using `KPrivateMemory` to manage it, we would create a guard page with `mprotect` that the guest was unaware about and would cause a segfault when a guest accessed the extents of the stack as reported to the guest.
A partial implementation of the `GetThreadContext3` SVC, we cannot return the whole thread context as the kernel only stores the registers we need according to the ARMv8 ABI convention and so far usages of this SVC do not require the unavailable registers but all future usage must be monitored and potentially require extending the amount of saved registers.
The vibration device had to be set manually prior which led to it generally not being set at all even though a user might want vibration, this commit fixes that by making controller #0 use the built-in vibrator by default.
Any Skyline files that should have been user-accessible were moved from `/data/data/skyline.emu/files` to `/sdcard/Android/data/skyline.emu/files` as the former directory is entirely private and cannot be accessed without either adb or root. This made retrieving certain data such as saves or loading custom driver shared objects extremely hard to do while this can be trivially done now.
In some games such as SMO thousands of constant buffers are bound per frame which was causing an unreasonable number of lookups in both vmm and the buffer manager. Work around this by introducing a simple hashmap based cache, eviction is currently unsupported but not really necessary yet due to the small size of the buffers in the cache.
We cannot ignore accesses from the host to a region protected by the NCE Memory Trapping API, there's often access to regions which have overlap with a protected region unintentionally and those accesses need to be handled correctly rather than leading to a crash. This is done by implementing an additional signal handler `NCE::HostSignalHandler` to lookup any potential traps on a `SIGSEGV` and handle them correctly or when there isn't a corresponding trap raise a `SIGTRAP` when debugger is connected or delegate to `signal::ExceptionalSignalHandler` when it isn't.
To cut down memory usage we now page out memory that is RW trapped via the NCE memory trapping API, the callbacks are supposed to page in the memory. This behavior is backed up by Texture/Buffer syncing which would read the host copies of data and write it to the guest, by paging the corresponding data on the guest we're avoiding redundant memory usage.
The `FileDescriptor` class is a RAII wrapper over FDs which handles their lifetimes alongside other C++ semantics such as moving and copying. It has been used in `skyline::kernel::MemoryManager` to handle the lifetime of the ashmem FD correctly, it wasn't being destroyed earlier which can result in leaking FDs across runs.
Initially this commit was only intended to update LLVM but due to a compilation error on latest LLVM libcxx due to the C++ stdlib header `<algorithm>` being a transitive dependency that is no longer transitively included on the latest LLVM libcxx (as of https://reviews.llvm.org/D119667), this required changes in Skyline and Oboe which were done in https://github.com/google/oboe/pull/1521 and the submodule has been updated to include those changes.
These are mostly used in 3D games like SMO, support is still quite basic and synchronising block linear 3D texture will crash in most cases due to them being unimplemented.
Some games crash due to requiring an `audren` version greater than 7. The `audren` version can be increased without any issues as `audren` is stubbed and therefore the reported version doesn't matter.
Older Adreno proprietary drivers (5xx and below) will segfault while destroying the renderpass and associated objects if more than 64 subpasses are within a renderpass due to internal driver implementation details. This commit introduces checks to automatically break up a renderpass when that limit is hit.
We have support for overlapping buffers which allows us to merge a lot of smaller buffers located on a single page into a single larger buffer which allows for better performance. It additionally ensures that all host buffers match the alignment guarantees of the guest and adequately fulfill host alignment requirements.
This commit encapsulates a complex sequence of cascading changes in the process of supporting overlaps for buffers:
* We determined that it is impossible to resolve overlaps with multiple intervals per buffer within the constraints of each overlap being a contiguous view, support for multiple intervals was therefore dropped. The older buffer manager code was entirely reworked to be simpler due to only handling one interval per buffer with code now being based off `IntervalMap` but tailored specifically for buffers.
* During overlap resolution, the problem of how existing views into the buffer being recreated would be updated, it had to be replaced with a larger buffer that could contain all overlaps and all existing views would need to be repointed to it. This was addressed by a buffer owning all views to itself, we could automatically recalculate the offset of all views and update the buffers with it.
* We still needed to update usage of existing views which was done by handling all access (such as inside a recorded draw) to buffer view properties via `BufferView::RegisterUsage` which dispatches a callback with the view and the corresponding backing buffer. This callback can be stored and called during overlap resolution with the new buffer.
* We had issues with lifetime of the buffer with the handle-like semantics of `BufferView` introduced in the last buffer-related commit, if we updated the view to be owned by a new buffer we'd need to extend the lifetime of the new buffer not the older one and the only way to do this was a proxy owner object `BufferDelegate` which holds a shared pointer to the real `Buffer` which in-turn holds a pointer to all `BufferDelegate` objects to update on repointing. A `BufferView` is effectively just a wrapper around `std::shared_ptr<BufferDelegate>` with more favorable semantics but generally just forwarding calls.
It should be additionally noted that to support usage of `RegisterUsage` the code around buffers in `GraphicsContext` was refactored to defer truly binding till the recording phase.
Due to an oversight, we weren't clearing the list of buffers that needed to be synced after every execution which led to them building up. Due to the relatively cheap synchronization of buffers and only doing so on faults this wasn't caught until now, it does depress the framerate significantly over time due to the size of the list growing to be in the range of 100k buffer views depending on the title.
The Kepler compute engine is used to run compute jobs encapsulated in to QMDs on the GPU, this commit doesn't implement compute itself but adds the register and QMD structs that will be needed for it in the future.