common/swap: Improve codegen of the default swap fallbacks

Uses arithmetic that can be identified more trivially by compilers for optimizations. e.g. Rather than shifting the halves of the value and then swapping and combining them, we can swap them in place. e.g. for the original swap32 code on x86-64, clang 8.0 would generate: mov ecx, edi rol cx, 8 shl ecx, 16 shr edi, 16 rol di, 8 movzx eax, di or eax, ecx ret while GCC 8.3 would generate the ideal: mov eax, edi bswap eax ret now both generate the same optimal output. MSVC used to generate the following with the old code: mov eax, ecx rol cx, 8 shr eax, 16 rol ax, 8 movzx ecx, cx movzx eax, ax shl ecx, 16 or eax, ecx ret 0 Now MSVC also generates a similar, but equally optimal result as clang/GCC: bswap ecx mov eax, ecx ret 0 ==== In the swap64 case, for the original code, clang 8.0 would generate: mov eax, edi bswap eax shl rax, 32 shr rdi, 32 bswap edi or rax, rdi ret (almost there, but still missing the mark) while, again, GCC 8.3 would generate the more ideal: mov rax, rdi bswap rax ret now clang also generates the optimal sequence for this fallback as well. This is a case where MSVC unfortunately falls short, despite the new code, this one still generates a doozy of an output. mov r8, rcx mov r9, rcx mov rax, 71776119061217280 mov rdx, r8 and r9, rax and edx, 65280 mov rax, rcx shr rax, 16 or r9, rax mov rax, rcx shr r9, 16 mov rcx, 280375465082880 and rax, rcx mov rcx, 1095216660480 or r9, rax mov rax, r8 and rax, rcx shr r9, 16 or r9, rax mov rcx, r8 mov rax, r8 shr r9, 8 shl rax, 16 and ecx, 16711680 or rdx, rax mov eax, -16777216 and rax, r8 shl rdx, 16 or rdx, rcx shl rdx, 16 or rax, rdx shl rax, 8 or rax, r9 ret 0 which is pretty unfortunate.
2024-11-30 13:24:16 +01:00 · 2019-04-11 21:20:22 -04:00 · 2019-04-11 21:20:22 -04:00 · b8d43d4dfb
commit b8d43d4dfb
parent 686d067271
1 changed files with 7 additions and 3 deletions
--- a/src/common/swap.h
+++ b/src/common/swap.h
@ -83,15 +83,19 @@ namespace Common {
    return __builtin_bswap64(data);
 }
 #else
-// Slow generic implementation.
+// Generic implementation.
 [[nodiscard]] inline u16 swap16(u16 data) noexcept {
    return (data >> 8) | (data << 8);
 }
 [[nodiscard]] inline u32 swap32(u32 data) noexcept {
-    return (swap16(data) << 16) | swap16(data >> 16);
+    return ((data & 0xFF000000U) >> 24) | ((data & 0x00FF0000U) >> 8) |
           ((data & 0x0000FF00U) << 8) | ((data & 0x000000FFU) << 24);
 }
 [[nodiscard]] inline u64 swap64(u64 data) noexcept {
-    return ((u64)swap32(data) << 32) | swap32(data >> 32);
+    return ((data & 0xFF00000000000000ULL) >> 56) | ((data & 0x00FF000000000000ULL) >> 40) |
           ((data & 0x0000FF0000000000ULL) >> 24) | ((data & 0x000000FF00000000ULL) >> 8) |
           ((data & 0x00000000FF000000ULL) << 8) | ((data & 0x0000000000FF0000ULL) << 24) |
           ((data & 0x000000000000FF00ULL) << 40) | ((data & 0x00000000000000FFULL) << 56);
 }
 #endif