Squashed 'external/parallel-rdp/parallel-rdp-standalone/' content from commit 3f59f61f2c

git-subtree-dir: external/parallel-rdp/parallel-rdp-standalone git-subtree-split: 3f59f61f2c1c56424356003041df5e4a10612049
2024-09-14 16:23:58 +02:00
commit d94eccab99
188 changed files with 126627 additions and 0 deletions
--- a/1
+++ b/1
@@ -0,0 +1 @@
+fe5becd13638873db90d46e7ba7d48255971f82a
--- a/20
+++ b/20
@@ -0,0 +1,20 @@
+Copyright (c) 2020 Themaister
+
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
--- a/README.md
+++ b/README.md
@@ -0,0 +1,265 @@
+# paraLLEl-RDP
+
+This project is a revival and complete rewrite of the old, defunct paraLLEl-RDP project.
+
+The goal is to implement the Nintendo 64 RDP graphics chip as accurately as possible using Vulkan compute.
+The implementation aims to be bitexact with the
+[Angrylion-Plus](https://github.com/ata4/angrylion-rdp-plus) reference renderer where possible.
+
+## Disclaimer
+
+While paraLLEl-RDP uses [Angrylion-Plus](https://github.com/ata4/angrylion-rdp-plus)
+as an implementation reference, it is not a port, and not a derived codebase of said project.
+It is written from scratch by studying [Angrylion-Plus](https://github.com/ata4/angrylion-rdp-plus)
+and trying to understand what is going on.
+The test suite uses [Angrylion-Plus](https://github.com/ata4/angrylion-rdp-plus) as a reference
+to validate implementation and cross-checking behavior.
+
+## Use cases
+
+- **Much** faster LLE RDP emulation of N64 compared to a CPU implementation
+  as parallel graphics workloads are offloaded to the GPU.
+  Emulation performance is now completely bound by CPU and LLE RSP performance.
+  Early benchmarking results suggest 2000 - 5000 VI/s being achieved on mid-range desktop GPUs based on timestamp data.
+  There is no way the CPU emulation can keep up with that, but that means this should
+  scale down to fairly gimped GPUs as well, assuming the driver requirements are met.
+- A backend renderer for standalone engines which aim to efficiently reproduce faithful N64 graphics.
+- Hopefully, an easier to understand implementation than the reference renderer.
+- An esoteric use case of advanced Vulkan compute programming.
+
+## Missing features
+
+The implementation is quite complete, and compatibility is very high in the limited amount of content I've tested.
+However, not every single feature is supported at this moment.
+Ticking the last boxes depends mostly on real content making use of said features.
+
+- Color combiner chroma keying
+- Various "bugs" / questionable behavior that seems meaningless to emulate
+- Certain extreme edge cases in TMEM upload. The implementation has tests for many "crazy" edge cases though.
+- ... possibly other obscure features
+
+The VI is essentially complete. A fancy deinterlacer might be useful to add since we have plenty of GPU cycles to spare in the graphics queue.
+The VI filtering is always turned on if game requests it, but features can selectively be turned off for the pixel purists.
+
+## Environment variables for development / testing
+
+### `RDP_DEBUG` / `RDP_DEBUG_X` / `RDP_DEBUG_Y`
+
+Supports printf in shaders, which is extremely useful to drill down difficult bugs.
+Only printfs from certain pixels can be filtered through to avoid spam.
+
+### `VI_DEBUG` / `VI_DEBUG_X` / `VI_DEBUG_Y`
+
+Same as `RDP_DEBUG` but for the VI.
+
+### `PARALLEL_RDP_MEASURE_SYNC_TIME`
+
+Measures time stalled in `CommandProcessor::wait_for_timeline`. Useful to measure
+CPU overhead in hard-synced emulator integrations.
+
+### `PARALLEL_RDP_SMALL_TYPES=0`
+
+Force-disables 8/16-bit arithmetic support. Useful when suspecting driver bugs.
+
+### `PARALLEL_RDP_UBERSHADER=1`
+
+Forces the use of ubershaders. Can be extremely slow depending on the shader compiler.
+
+### `PARALLEL_RDP_FORCE_SYNC_SHADER=1`
+
+Disabled async pipeline optimization, and blocks for every shader compiler.
+Only use if the ubershader crashes, since this adds the dreaded shader compilation stalls.
+
+### `PARALLEL_RDP_BENCH=1`
+
+Measures RDP rendering time spent on GPU using Vulkan timestamps.
+At end of a run, reports average time spent per render pass,
+and how many render passes are flushed per frame.
+
+### `PARALLEL_RDP_SUBGROUP=0`
+
+Force-disables use of Vulkan subgroup operations,
+which are used to optimize the tile binning algorithm.
+
+### `PARALLEL_RDP_ALLOW_EXTERNAL_HOST=0`
+
+Disables use of `VK_EXT_external_memory_host`. For testing.
+
+## Vulkan driver requirements
+
+paraLLEl-RDP requires up-to-date Vulkan implementations. A lot of the great improvements over the previous implementation
+comes from the idea that we can implement N64's UMA by simply importing RDRAM directly as an SSBO and perform 8 and 16-bit
+data access over the bus. With the tile based architecture in paraLLEl-RDP, this works very well and actual
+PCI-e traffic is massively reduced. The bandwidth for doing this is also trivial. On iGPU systems, this also works really well, since
+it's all the same memory anyways.
+
+Thus, the requirements are as follows. All of these features are widely supported, or will soon be in drivers.
+paraLLEl-RDP does not aim for compatibility with ancient hardware and drivers.
+Just use the reference renderer for that. This is enthusiast software for a niche audience.
+
+- Vulkan 1.1
+- VK_KHR_8bit_storage / VK_KHR_16bit_storage
+- Optionally VK_KHR_shader_float16_int8 which enables small integer arithmetic
+- Optionally subgroup support with VK_EXT_subgroup_size_control
+- For integration in emulators, VK_EXT_external_memory_host is currently required (may be relaxed later at some performance cost)
+
+### Tested drivers
+
+paraLLEl-RDP has been tested on Linux and Windows on all desktop vendors.
+
+- Intel Mesa (20.0.6) - Passes conformance
+- Intel Windows - Passes conformance (**CAVEAT**. Intel Windows requires 64 KiB alignment for host memory import, make sure to add some padding around RDRAM in an emulator to make this work well.)
+- AMD RADV LLVM (20.0.6) - Passes conformance
+- AMD RADV ACO - Passes conformance with bleeding edge drivers and `PARALLEL_RDP_SMALL_TYPES=0`.
+- Linux AMDGPU-PRO - Passes conformance, with caveat that 8/16-bit arithmetic does not work correctly for some tests.
+  paraLLEl-RDP automatically disables small integer arithmetic for proprietary AMD driver.
+- AMD Windows - Passes conformance with same caveat and workaround as AMDGPU-PRO.
+- NVIDIA Linux - Passes conformance (**MAJOR CAVEAT**, NVIDIA Linux does not support VK_EXT_external_memory_host as of 2020-05-12.)
+- NVIDIA Windows - Passes conformance
+
+## Implementation strategy
+
+This project uses Vulkan compute shaders to implement a fully programmable rasterization pipeline.
+The overall rendering architecture is reused from [RetroWarp](https://github.com/Themaister/RetroWarp)
+with some further refinements.
+
+The lower level Vulkan backend comes from [Granite](https://github.com/Themaister/Granite).
+
+### Asynchronous pipeline optimization
+
+Toggleable paths in RDP state is expressed as specialization constants. The rendering thread will
+detect new state combinations and kick off building pipelines which only specify exact state needed to render.
+This is a massive performance optimization.
+
+The same shaders are used for an "ubershader" fallback when pipelines are not ready.
+In this case, specialization constants are simply not used.
+The same SPIR-V modules are reused to great effect using this Vulkan feature.
+
+### Tile-based rendering
+
+See [RetroWarp](https://github.com/Themaister/RetroWarp) for more details.
+
+### GPU-driven TMEM management
+
+TMEM management is fully GPU-driven, but this is a very complicated implementation.
+Certain combinations of formats are not supported, but such cases would produce
+meaningless results, and it is unclear that applications can make meaningful use of these "weird" uploads.
+
+### Synchronization
+
+Synchronizing the GPU and CPU emulation is one of the hot button issues of N64 emulation.
+The integration code is designed around a timeline of synchronization points which can be waited on by the CPU
+when appropriate. For accurate emulation, an OpSyncFull is generally followed by a full wait,
+but most games can be more relaxed and only synchronize with the CPU N frames later.
+Implementation of this behavior is outside the scope of paraLLEl-RDP, and is left up to the integration code.
+
+### Asynchronous compute
+
+GPUs with a dedicated compute queue is recommended for optimal performance since
+RDP shading work can happen on the compute queue, and won't be blocked by graphics workloads happening
+in the graphics queue, which will typically be VI scanout and frontend applying shaders on top.
+
+## Project structure
+
+This project implements several submodules which are quite useful.
+
+### rdp-replayer
+
+This app replays RDP dump files, which are produced by running content through an RDP dumper.
+An implementation can be found in e.g. parallel-N64. The file format is very simple and essentially
+contains a record of RDRAM changes and RDP command streams.
+This dump is replayed and a live comparison between the reference renderer can be compared to paraLLEl-RDP
+with visual output. The UI is extremely crude, and is not user-friendly, but good enough for my use.
+
+### rdp-conformance
+
+I made a somewhat comprehensive test suite for the RDP, with a custom higher level RDP command stream generator.
+There are roughly ~150 fuzz tests which exercise many aspects of the RDP.
+In order to pass the test, paraLLEl-RDP must produce bit-exact results compared to Angrylion,
+so the test condition is as stringent as possible.
+
+#### A note on bitexactness
+
+There are a few cases where bit-exactness is a meaningless term, such as the noise feature of the RDP.
+It is not particularly meaningful to exactly reproduce noise, since it is by its very nature unpredictable.
+For that reason, this repo references a fork of the reference renderer which implements deterministic "undefined behavior"
+where appropriate. The exact formulation of the noise generator is not very interesting as long as
+correct entropy and output range is reproduced.
+
+##### Intentional differences from reference renderer
+
+Certain effects invoke "undefined behavior" in the RDP and requires cycle accuracy to resolve bit-accurately with real RDP.
+Reference renderer attempts to emulate these effects, but to reproduce this behavior breaks any form of multi-threading.
+To be able to validate dumps in a sensible way with buggy content, I modified the reference slightly to make certain
+"undefined behavior" deterministic. This doesn't meaningfully change the rendered output in the cases I've seen in the wild.
+Some of these effects would be possible to emulate,
+but at the cost of lots of added complexity and it wouldn't be quite correct anyways given the cycle accuracy issue.
+
+- CombinedColor/Alpha in first cycle is cleared to zero. Some games read this in first cycle,
+  and reference renderer will read whatever was generated last pixel.
+  This causes issues in some cases, where cycle accuracy would have caused the feedback to converge to zero over time.
+- Reading LODFrac in 1 cycle mode. This is currently ignored. The results generated seem non-sensical. Never seen this in the wild.
+- Using TexLOD in copy mode. This is currently ignored. The results generated seem non-sensical. Never seen this in the wild.
+- Reading MemoryColor in first blender cycle in 2-cycle mode. Reference seems to wait until the second cycle before updating this value,
+  despite memory coverage being updated right away. The sensible thing to do is to allow reading memory color in first cycle.
+- Alpha testing in 2-cycle mode reads combined alpha from next pixel in reference.
+  Just doing alpha testing in first cycle on current pixel is good enough.
+  If this is correct hardware behavior, I consider this a hardware bug.
+- Reading Texel1 in cycle 1 of 2-cycle mode reads the Texel0 from next pixel.
+  In the few cases I've seen this, the rendered output is slightly buggy, but it's hardly visible in motion.
+  The workaround is just to read Texel0 from current pixel which still renders fine.
+
+### vi-conformance
+
+This is a conformance suite, except for the video interface (VI) unit.
+
+### rdp-validate-dump
+
+This tool replays an RDP dump headless and compares outputs between reference renderer and paraLLEl-RDP.
+To pass, bitexact output must be generated.
+
+## Build
+
+Checkout submodules. This pulls in Angrylion-Plus as well as Granite.
+
+```
+git submodule update --init --recursive
+```
+
+Standard CMake build.
+
+```
+mkdir build
+cd build
+cmake ..
+cmake --build . --parallel (--config Release on MSVC)
+```
+
+### Run test suite
+
+You can run rdp-conformance and vi-conformance with ctest to verify if your driver is behaving correctly.
+
+```
+ctest (-C Release on MSVC)
+```
+
+### Embedding shaders in a C++ header
+
+If embedding paraLLEl-RDP in an emulator project, it is helpful to pre-compile and bake SPIR-V shaders in a C++ header.
+Build slangmosh from Granite, and then run:
+
+```
+slangmosh parallel-rdp/shaders/slangmosh.json --output slangmosh.hpp --vk11 --strip -O --namespace RDP
+```
+
+### Generating a standalone code base for emulator integration
+
+Run the `generate_standalone_codebase.sh $OUTDIR` script with an output directory `$OUTDIR/` as argument to generate a standalone code base which can be built without any special build system support.
+Include `$OUTDIR/config.mk` if building with Make to make your life easier.
+Note that `slangmosh` must be in your path for this script to run. It executes the command above to build `slangmosh.hpp`.
+
+## License
+
+paraLLEl-RDP is licensed under the permissive license MIT. See included LICENSE file.
+This implementation builds heavily on the knowledge (but not code) gained from studying the reference implementation,
+thus it felt fair to release it under a permissive license, so my work could be reused more easily.
--- a/config.mk
+++ b/config.mk
@@ -0,0 +1,56 @@
+# For use in standalone implementations.
+
+PARALLEL_RDP_CFLAGS :=
+PARALLEL_RDP_CXXFLAGS :=
+
+PARALLEL_RDP_SOURCES_CXX := \
+        $(wildcard $(PARALLEL_RDP_IMPLEMENTATION)/parallel-rdp/*.cpp) \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/buffer.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/buffer_pool.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/command_buffer.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/command_pool.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/context.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/cookie.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/descriptor_set.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/device.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/event_manager.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/fence.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/fence_manager.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/image.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/indirect_layout.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/memory_allocator.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/pipeline_event.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/query_pool.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/render_pass.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/sampler.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/semaphore.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/semaphore_manager.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/shader.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/vulkan/texture/texture_format.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/util/arena_allocator.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/util/logging.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/util/thread_id.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/util/aligned_alloc.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/util/timer.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/util/timeline_trace_file.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/util/environment.cpp \
+        $(PARALLEL_RDP_IMPLEMENTATION)/util/thread_name.cpp
+
+PARALLEL_RDP_SOURCES_C := \
+        $(PARALLEL_RDP_IMPLEMENTATION)/volk/volk.c
+
+PARALLEL_RDP_INCLUDE_DIRS := \
+        -I$(PARALLEL_RDP_IMPLEMENTATION)/parallel-rdp \
+        -I$(PARALLEL_RDP_IMPLEMENTATION)/volk \
+        -I$(PARALLEL_RDP_IMPLEMENTATION)/vulkan \
+        -I$(PARALLEL_RDP_IMPLEMENTATION)/vulkan-headers/include \
+        -I$(PARALLEL_RDP_IMPLEMENTATION)/util
+
+PARALLEL_RDP_LDFLAGS := -pthread
+ifeq (,$(findstring win,$(platform)))
+    PARALLEL_RDP_LDFLAGS += -ldl
+else
+    PARALLEL_RDP_CFLAGS += -DVK_USE_PLATFORM_WIN32_KHR
+    PARALLEL_RDP_LDFLAGS += -lwinmm
+endif
+
--- a/parallel-rdp/command_ring.cpp
+++ b/parallel-rdp/command_ring.cpp
@@ -0,0 +1,135 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#include <chrono>
+#include "command_ring.hpp"
+#include "rdp_device.hpp"
+#include "thread_id.hpp"
+#include <assert.h>
+
+namespace RDP
+{
+void CommandRing::init(
+#ifdef PARALLEL_RDP_SHADER_DIR
+		Granite::Global::GlobalManagersHandle global_handles_,
+#endif
+		CommandProcessor *processor_, unsigned count)
+{
+	assert((count & (count - 1)) == 0);
+	teardown_thread();
+	processor = processor_;
+	ring.resize(count);
+	write_count = 0;
+	read_count = 0;
+#ifdef PARALLEL_RDP_SHADER_DIR
+	global_handles = std::move(global_handles_);
+#endif
+	thr = std::thread(&CommandRing::thread_loop, this);
+}
+
+void CommandRing::teardown_thread()
+{
+	if (thr.joinable())
+	{
+		enqueue_command(0, nullptr);
+		thr.join();
+	}
+}
+
+CommandRing::~CommandRing()
+{
+	teardown_thread();
+}
+
+void CommandRing::drain()
+{
+	std::unique_lock<std::mutex> holder{lock};
+	cond.wait(holder, [this]() {
+		return write_count == completed_count;
+	});
+}
+
+void CommandRing::enqueue_command(unsigned num_words, const uint32_t *words)
+{
+	std::unique_lock<std::mutex> holder{lock};
+	cond.wait(holder, [this, num_words]() {
+		return write_count + num_words + 1 <= read_count + ring.size();
+	});
+
+	size_t mask = ring.size() - 1;
+	ring[write_count++ & mask] = num_words;
+	for (unsigned i = 0; i < num_words; i++)
+		ring[write_count++ & mask] = words[i];
+
+	cond.notify_one();
+}
+
+void CommandRing::thread_loop()
+{
+	Util::register_thread_index(0);
+
+#ifdef PARALLEL_RDP_SHADER_DIR
+	// Here to let the RDP play nice with full Granite.
+	// When we move to standalone Granite, we won't need to interact with global subsystems like this.
+	Granite::Global::set_thread_context(*global_handles);
+	global_handles.reset();
+#endif
+
+	std::vector<uint32_t> tmp_buffer;
+	tmp_buffer.reserve(64);
+	size_t mask = ring.size() - 1;
+
+	for (;;)
+	{
+		bool is_idle = false;
+		{
+			std::unique_lock<std::mutex> holder{lock};
+			if (cond.wait_for(holder, std::chrono::microseconds(500), [this]() { return write_count > read_count; }))
+			{
+				uint32_t num_words = ring[read_count++ & mask];
+				tmp_buffer.resize(num_words);
+				for (uint32_t i = 0; i < num_words; i++)
+					tmp_buffer[i] = ring[read_count++ & mask];
+			}
+			else
+			{
+				// If we don't receive commands at a steady pace,
+				// notify rendering thread that we should probably kick some work.
+				tmp_buffer.resize(1);
+				tmp_buffer[0] = uint32_t(Op::MetaIdle) << 24;
+				is_idle = true;
+			}
+		}
+
+		if (tmp_buffer.empty())
+			break;
+
+		processor->enqueue_command_direct(tmp_buffer.size(), tmp_buffer.data());
+		if (!is_idle)
+		{
+			std::lock_guard<std::mutex> holder{lock};
+			completed_count = read_count;
+			cond.notify_one();
+		}
+	}
+}
+}
--- a/parallel-rdp/command_ring.hpp
+++ b/parallel-rdp/command_ring.hpp
@@ -0,0 +1,67 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include <thread>
+#include <mutex>
+#include <condition_variable>
+#include <vector>
+
+#ifdef PARALLEL_RDP_SHADER_DIR
+#include "global_managers.hpp"
+#endif
+
+namespace RDP
+{
+class CommandProcessor;
+class CommandRing
+{
+public:
+	void init(
+#ifdef PARALLEL_RDP_SHADER_DIR
+			Granite::Global::GlobalManagersHandle global_handles,
+#endif
+			CommandProcessor *processor, unsigned count);
+	~CommandRing();
+	void drain();
+
+	void enqueue_command(unsigned num_words, const uint32_t *words);
+
+private:
+	CommandProcessor *processor = nullptr;
+	std::thread thr;
+	std::mutex lock;
+	std::condition_variable cond;
+
+	std::vector<uint32_t> ring;
+	uint64_t write_count = 0;
+	uint64_t read_count = 0;
+	uint64_t completed_count = 0;
+
+	void thread_loop();
+	void teardown_thread();
+#ifdef PARALLEL_RDP_SHADER_DIR
+	Granite::Global::GlobalManagersHandle global_handles;
+#endif
+};
+}
--- a/parallel-rdp/luts.hpp
+++ b/parallel-rdp/luts.hpp
--- a/parallel-rdp/rdp_common.hpp
+++ b/parallel-rdp/rdp_common.hpp
@@ -0,0 +1,410 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+namespace Vulkan
+{
+class Program;
+class Shader;
+}
+
+namespace RDP
+{
+template <typename Program, typename Shader> struct Shaders;
+using ShaderBank = Shaders<Vulkan::Program *, Vulkan::Shader *>;
+
+// list of command IDs
+enum class Op
+{
+	Nop = 0,
+
+	MetaSignalTimeline = 1,
+	MetaFlush = 2,
+	MetaIdle = 3,
+	MetaSetQuirks = 4,
+
+	FillTriangle = 0x08,
+	FillZBufferTriangle = 0x09,
+	TextureTriangle = 0x0a,
+	TextureZBufferTriangle = 0x0b,
+	ShadeTriangle = 0x0c,
+	ShadeZBufferTriangle = 0x0d,
+	ShadeTextureTriangle = 0x0e,
+	ShadeTextureZBufferTriangle = 0x0f,
+	TextureRectangle = 0x24,
+	TextureRectangleFlip = 0x25,
+	SyncLoad = 0x26,
+	SyncPipe = 0x27,
+	SyncTile = 0x28,
+	SyncFull = 0x29,
+	SetKeyGB = 0x2a,
+	SetKeyR = 0x2b,
+	SetConvert = 0x2c,
+	SetScissor = 0x2d,
+	SetPrimDepth = 0x2e,
+	SetOtherModes = 0x2f,
+	LoadTLut = 0x30,
+	SetTileSize = 0x32,
+	LoadBlock = 0x33,
+	LoadTile = 0x34,
+	SetTile = 0x35,
+	FillRectangle = 0x36,
+	SetFillColor = 0x37,
+	SetFogColor = 0x38,
+	SetBlendColor = 0x39,
+	SetPrimColor = 0x3a,
+	SetEnvColor = 0x3b,
+	SetCombine = 0x3c,
+	SetTextureImage = 0x3d,
+	SetMaskImage = 0x3e,
+	SetColorImage = 0x3f
+};
+
+enum class RGBMul : uint8_t
+{
+	Combined = 0,
+	Texel0 = 1,
+	Texel1 = 2,
+	Primitive = 3,
+	Shade = 4,
+	Env = 5,
+	KeyScale = 6,
+	CombinedAlpha = 7,
+	Texel0Alpha = 8,
+	Texel1Alpha = 9,
+	PrimitiveAlpha = 10,
+	ShadeAlpha = 11,
+	EnvAlpha = 12,
+	LODFrac = 13,
+	PrimLODFrac = 14,
+	ConvertK5 = 15,
+	Zero = 16
+};
+
+enum class RGBMulAdd : uint8_t
+{
+	Combined = 0,
+	Texel0 = 1,
+	Texel1 = 2,
+	Primitive = 3,
+	Shade = 4,
+	Env = 5,
+	One = 6,
+	Noise = 7,
+	Zero = 8
+};
+
+enum class RGBMulSub : uint8_t
+{
+	Combined = 0,
+	Texel0 = 1,
+	Texel1 = 2,
+	Primitive = 3,
+	Shade = 4,
+	Env = 5,
+	KeyCenter = 6,
+	ConvertK4 = 7,
+	Zero = 8
+};
+
+enum class RGBAdd : uint8_t
+{
+	Combined = 0,
+	Texel0 = 1,
+	Texel1 = 2,
+	Primitive = 3,
+	Shade = 4,
+	Env = 5,
+	One = 6,
+	Zero = 7
+};
+
+enum class AlphaAddSub : uint8_t
+{
+	CombinedAlpha = 0,
+	Texel0Alpha = 1,
+	Texel1Alpha = 2,
+	PrimitiveAlpha = 3,
+	ShadeAlpha = 4,
+	EnvAlpha = 5,
+	One = 6,
+	Zero = 7
+};
+
+enum class AlphaMul : uint8_t
+{
+	LODFrac = 0,
+	Texel0Alpha = 1,
+	Texel1Alpha = 2,
+	PrimitiveAlpha = 3,
+	ShadeAlpha = 4,
+	EnvAlpha = 5,
+	PrimLODFrac = 6,
+	Zero = 7
+};
+
+enum class TextureSize : uint8_t
+{
+	Bpp4 = 0,
+	Bpp8 = 1,
+	Bpp16 = 2,
+	Bpp32 = 3
+};
+
+enum class TextureFormat : uint8_t
+{
+	RGBA = 0,
+	YUV = 1,
+	CI = 2,
+	IA = 3,
+	I = 4
+};
+
+enum class RGBDitherMode : uint8_t
+{
+	Magic = 0,
+	Bayer = 1,
+	Noise = 2,
+	Off = 3
+};
+
+enum class AlphaDitherMode : uint8_t
+{
+	Pattern = 0,
+	InvPattern = 1,
+	Noise = 2,
+	Off = 3
+};
+
+enum class CycleType : uint8_t
+{
+	Cycle1 = 0,
+	Cycle2 = 1,
+	Copy = 2,
+	Fill = 3
+};
+
+enum class BlendMode1A : uint8_t
+{
+	PixelColor = 0,
+	MemoryColor = 1,
+	BlendColor = 2,
+	FogColor = 3
+};
+
+enum class BlendMode1B : uint8_t
+{
+	PixelAlpha = 0,
+	FogAlpha = 1,
+	ShadeAlpha = 2,
+	Zero = 3
+};
+
+enum class BlendMode2A : uint8_t
+{
+	PixelColor = 0,
+	MemoryColor = 1,
+	BlendColor = 2,
+	FogColor = 3
+};
+
+enum class BlendMode2B : uint8_t
+{
+	InvPixelAlpha = 0,
+	MemoryAlpha = 1,
+	One = 2,
+	Zero = 3
+};
+
+enum class CoverageMode : uint8_t
+{
+	Clamp = 0,
+	Wrap = 1,
+	Zap = 2,
+	Save = 3
+};
+
+enum class ZMode : uint8_t
+{
+	Opaque = 0,
+	Interpenetrating = 1,
+	Transparent = 2,
+	Decal = 3
+};
+
+enum TileInfoFlagBits
+{
+	TILE_INFO_CLAMP_S_BIT = 1 << 0,
+	TILE_INFO_MIRROR_S_BIT = 1 << 1,
+	TILE_INFO_CLAMP_T_BIT = 1 << 2,
+	TILE_INFO_MIRROR_T_BIT = 1 << 3
+};
+using TileInfoFlags = uint8_t;
+
+struct TileSize
+{
+	uint32_t slo = 0;
+	uint32_t shi = 0;
+	uint32_t tlo = 0;
+	uint32_t thi = 0;
+};
+
+struct TileMeta
+{
+	uint32_t offset = 0;
+	uint32_t stride = 0;
+	TextureFormat fmt = TextureFormat::RGBA;
+	TextureSize size = TextureSize::Bpp16;
+	uint8_t palette = 0;
+	uint8_t mask_s = 0;
+	uint8_t shift_s = 0;
+	uint8_t mask_t = 0;
+	uint8_t shift_t = 0;
+	TileInfoFlags flags = 0;
+};
+
+struct TileInfo
+{
+	TileSize size;
+	TileMeta meta;
+};
+
+struct CombinerInputsRGB
+{
+	RGBMulAdd muladd;
+	RGBMulSub mulsub;
+	RGBMul mul;
+	RGBAdd add;
+};
+
+struct CombinerInputsAlpha
+{
+	AlphaAddSub muladd;
+	AlphaAddSub mulsub;
+	AlphaMul mul;
+	AlphaAddSub add;
+};
+
+struct CombinerInputs
+{
+	CombinerInputsRGB rgb;
+	CombinerInputsAlpha alpha;
+};
+
+struct BlendModes
+{
+	BlendMode1A blend_1a;
+	BlendMode1B blend_1b;
+	BlendMode2A blend_2a;
+	BlendMode2B blend_2b;
+};
+
+static_assert(sizeof(TileInfo) == 32, "TileInfo must be 32 bytes.");
+
+enum class VIRegister
+{
+	Control = 0,
+	Origin,
+	Width,
+	Intr,
+	VCurrentLine,
+	Timing,
+	VSync,
+	HSync,
+	Leap,
+	HStart,
+	VStart,
+	VBurst,
+	XScale,
+	YScale,
+	Count
+};
+
+enum VIControlFlagBits
+{
+	VI_CONTROL_TYPE_BLANK_BIT = 0 << 0,
+	VI_CONTROL_TYPE_RESERVED_BIT = 1 << 0,
+	VI_CONTROL_TYPE_RGBA5551_BIT = 2 << 0,
+	VI_CONTROL_TYPE_RGBA8888_BIT = 3 << 0,
+	VI_CONTROL_TYPE_MASK = 3 << 0,
+	VI_CONTROL_GAMMA_DITHER_ENABLE_BIT = 1 << 2,
+	VI_CONTROL_GAMMA_ENABLE_BIT = 1 << 3,
+	VI_CONTROL_DIVOT_ENABLE_BIT = 1 << 4,
+	VI_CONTROL_SERRATE_BIT = 1 << 6,
+	VI_CONTROL_AA_MODE_RESAMP_EXTRA_ALWAYS_BIT = 0 << 8,
+	VI_CONTROL_AA_MODE_RESAMP_EXTRA_BIT = 1 << 8,
+	VI_CONTROL_AA_MODE_RESAMP_ONLY_BIT = 2 << 8,
+	VI_CONTROL_AA_MODE_RESAMP_REPLICATE_BIT = 3 << 8,
+	VI_CONTROL_AA_MODE_MASK = 3 << 8,
+	VI_CONTROL_DITHER_FILTER_ENABLE_BIT = 1 << 16,
+	VI_CONTROL_META_AA_BIT = 1 << 17,
+	VI_CONTROL_META_SCALE_BIT = 1 << 18
+};
+using VIControlFlags = uint32_t;
+
+static inline uint32_t make_vi_start_register(uint32_t start_value, uint32_t end_value)
+{
+	return ((start_value & 0x3ff) << 16) | (end_value & 0x3ff);
+}
+
+static inline uint32_t make_vi_scale_register(uint32_t scale_factor, uint32_t bias)
+{
+	return ((bias & 0xfff) << 16) | (scale_factor & 0xfff);
+}
+
+constexpr uint32_t VI_V_SYNC_NTSC = 525;
+constexpr uint32_t VI_V_SYNC_PAL = 625;
+constexpr uint32_t VI_H_OFFSET_NTSC = 108;
+constexpr uint32_t VI_H_OFFSET_PAL = 128;
+constexpr uint32_t VI_V_OFFSET_NTSC = 34;
+constexpr uint32_t VI_V_OFFSET_PAL = 44;
+constexpr uint32_t VI_V_RES_NTSC = 480;
+constexpr uint32_t VI_V_RES_PAL = 576;
+constexpr int VI_SCANOUT_WIDTH = 640;
+
+constexpr uint32_t VI_V_RES_MAX = VI_V_RES_PAL > VI_V_RES_NTSC ? VI_V_RES_PAL : VI_V_RES_NTSC;
+// Handle odd v_start as well. Needed for rounding to work in our favor.
+constexpr uint32_t VI_V_END_PAL = (VI_V_OFFSET_PAL + VI_V_RES_PAL) | 1;
+constexpr uint32_t VI_V_END_NTSC = (VI_V_OFFSET_NTSC + VI_V_RES_NTSC) | 1;
+constexpr uint32_t VI_V_END_MAX = VI_V_END_PAL > VI_V_END_NTSC ? VI_V_END_PAL : VI_V_END_NTSC;
+
+constexpr uint32_t VI_MAX_OUTPUT_SCANLINES = (VI_V_RES_MAX >> 1);
+
+static inline uint32_t make_default_v_start()
+{
+	return make_vi_start_register(VI_V_OFFSET_NTSC, VI_V_OFFSET_NTSC + 224 * 2);
+}
+
+static inline uint32_t make_default_h_start()
+{
+	return make_vi_start_register(VI_H_OFFSET_NTSC, VI_H_OFFSET_NTSC + VI_SCANOUT_WIDTH);
+}
+
+template <int bits>
+static int32_t sext(int32_t v)
+{
+	struct { int32_t dummy : bits; } d;
+	d.dummy = v;
+	return d.dummy;
+}
+}
--- a/parallel-rdp/rdp_data_structures.hpp
+++ b/parallel-rdp/rdp_data_structures.hpp
@@ -0,0 +1,392 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include <assert.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <string.h>
+#include "rdp_common.hpp"
+
+namespace RDP
+{
+enum TriangleSetupFlagBits
+{
+	TRIANGLE_SETUP_FLIP_BIT = 1 << 0,
+	TRIANGLE_SETUP_DO_OFFSET_BIT = 1 << 1,
+	TRIANGLE_SETUP_SKIP_XFRAC_BIT = 1 << 2,
+	TRIANGLE_SETUP_INTERLACE_FIELD_BIT = 1 << 3,
+	TRIANGLE_SETUP_INTERLACE_KEEP_ODD_BIT = 1 << 4,
+	TRIANGLE_SETUP_DISABLE_UPSCALING_BIT = 1 << 5,
+	TRIANGLE_SETUP_NATIVE_LOD_BIT = 1 << 6,
+	TRIANGLE_SETUP_FILL_COPY_RASTER_BIT = 1 << 7
+};
+using TriangleSetupFlags = uint8_t;
+
+enum StaticRasterizationFlagBits
+{
+	RASTERIZATION_INTERLACE_FIELD_BIT = 1 << 0,
+	RASTERIZATION_INTERLACE_KEEP_ODD_BIT = 1 << 1,
+	RASTERIZATION_AA_BIT = 1 << 2,
+	RASTERIZATION_PERSPECTIVE_CORRECT_BIT = 1 << 3,
+	RASTERIZATION_TLUT_BIT = 1 << 4,
+	RASTERIZATION_TLUT_TYPE_BIT = 1 << 5,
+	RASTERIZATION_CVG_TIMES_ALPHA_BIT = 1 << 6,
+	RASTERIZATION_ALPHA_CVG_SELECT_BIT = 1 << 7,
+	RASTERIZATION_MULTI_CYCLE_BIT = 1 << 8,
+	RASTERIZATION_TEX_LOD_ENABLE_BIT = 1 << 9,
+	RASTERIZATION_SHARPEN_LOD_ENABLE_BIT = 1 << 10,
+	RASTERIZATION_DETAIL_LOD_ENABLE_BIT = 1 << 11,
+	RASTERIZATION_FILL_BIT = 1 << 12,
+	RASTERIZATION_COPY_BIT = 1 << 13,
+	RASTERIZATION_SAMPLE_MODE_BIT = 1 << 14,
+	RASTERIZATION_ALPHA_TEST_BIT = 1 << 15,
+	RASTERIZATION_ALPHA_TEST_DITHER_BIT = 1 << 16,
+	RASTERIZATION_SAMPLE_MID_TEXEL_BIT = 1 << 17,
+	RASTERIZATION_USES_TEXEL0_BIT = 1 << 18,
+	RASTERIZATION_USES_TEXEL1_BIT = 1 << 19,
+	RASTERIZATION_USES_LOD_BIT = 1 << 20,
+	RASTERIZATION_USES_PIPELINED_TEXEL1_BIT = 1 << 21,
+	RASTERIZATION_CONVERT_ONE_BIT = 1 << 22,
+	RASTERIZATION_BILERP_0_BIT = 1 << 23,
+	RASTERIZATION_BILERP_1_BIT = 1 << 24,
+	RASTERIZATION_NEED_NOISE_DUAL_BIT = 1 << 25,
+	RASTERIZATION_UPSCALING_LOG2_BIT_OFFSET = 26,
+	// Bit 26 and 27 holds upscaling factor in LOG2.
+	RASTERIZATION_NEED_NOISE_BIT = 1 << 28,
+	RASTERIZATION_USE_STATIC_TEXTURE_SIZE_FORMAT_BIT = 1 << 29,
+	RASTERIZATION_USE_SPECIALIZATION_CONSTANT_BIT = 1 << 30
+};
+using StaticRasterizationFlags = uint32_t;
+
+enum DepthBlendFlagBits
+{
+	DEPTH_BLEND_DEPTH_TEST_BIT = 1 << 0,
+	DEPTH_BLEND_DEPTH_UPDATE_BIT = 1 << 1,
+	DEPTH_BLEND_FORCE_BLEND_BIT = 1 << 3,
+	DEPTH_BLEND_IMAGE_READ_ENABLE_BIT = 1 << 4,
+	DEPTH_BLEND_COLOR_ON_COVERAGE_BIT = 1 << 5,
+	DEPTH_BLEND_MULTI_CYCLE_BIT = 1 << 6,
+	DEPTH_BLEND_AA_BIT = 1 << 7,
+	DEPTH_BLEND_DITHER_ENABLE_BIT = 1 << 8
+};
+using DepthBlendFlags = uint32_t;
+
+struct TriangleSetup
+{
+	int32_t xh, xm, xl;
+	int16_t yh, ym;
+
+	int32_t dxhdy, dxmdy, dxldy;
+	int16_t yl;
+	TriangleSetupFlags flags;
+	uint8_t tile;
+};
+
+struct AttributeSetup
+{
+	int32_t r, g, b, a;
+	int32_t drdx, dgdx, dbdx, dadx;
+	int32_t drde, dgde, dbde, dade;
+	int32_t drdy, dgdy, dbdy, dady;
+
+	int32_t s, t, z, w;
+	int32_t dsdx, dtdx, dzdx, dwdx;
+	int32_t dsde, dtde, dzde, dwde;
+	int32_t dsdy, dtdy, dzdy, dwdy;
+};
+
+struct ConstantCombinerInputs
+{
+	uint8_t muladd[4];
+	uint8_t mulsub[4];
+	uint8_t mul[4];
+	uint8_t add[4];
+};
+
+// Per-primitive state which is very dynamic in nature and does not change anything about the shader itself.
+struct DerivedSetup
+{
+	ConstantCombinerInputs constants[2];
+	uint8_t fog_color[4];
+	uint8_t blend_color[4];
+	uint32_t fill_color;
+	uint16_t dz;
+	uint8_t dz_compressed;
+	uint8_t min_lod;
+	int16_t convert_factors[4];
+};
+
+static_assert((sizeof(TriangleSetup) & 15) == 0, "TriangleSetup must be aligned to 16 bytes.");
+static_assert((sizeof(AttributeSetup) & 15) == 0, "AttributeSetup must be aligned to 16 bytes.");
+static_assert(sizeof(DerivedSetup) == 56, "DerivedSetup is not 56 bytes.");
+
+struct ScissorState
+{
+	uint32_t xlo;
+	uint32_t ylo;
+	uint32_t xhi;
+	uint32_t yhi;
+};
+
+struct StaticRasterizationState
+{
+	CombinerInputs combiner[2];
+	StaticRasterizationFlags flags;
+	uint32_t dither;
+	uint32_t texture_size;
+	uint32_t texture_fmt;
+};
+static_assert(sizeof(StaticRasterizationState) == 32, "StaticRasterizationState must be 32 bytes.");
+
+struct DepthBlendState
+{
+	BlendModes blend_cycles[2];
+	DepthBlendFlags flags;
+	CoverageMode coverage_mode;
+	ZMode z_mode;
+	uint8_t padding[2];
+};
+static_assert(sizeof(DepthBlendState) == 16, "DepthBlendState must be 16 bytes.");
+
+struct InstanceIndices
+{
+	uint8_t static_index;
+	uint8_t depth_blend_index;
+	uint8_t tile_instance_index;
+	uint8_t padding[5];
+	uint8_t tile_indices[8];
+};
+static_assert((sizeof(InstanceIndices) & 15) == 0, "InstanceIndices must be aligned to 16 bytes.");
+
+struct UploadInfo
+{
+	int32_t width, height;
+	float min_t_mod, max_t_mod;
+
+	int32_t vram_addr;
+	int32_t vram_width;
+	int32_t vram_size;
+	int32_t vram_effective_width;
+
+	int32_t tmem_offset;
+	int32_t tmem_stride_words;
+	int32_t tmem_size;
+	int32_t tmem_fmt;
+
+	int32_t mode;
+	float inv_tmem_stride_words;
+	int32_t dxt;
+	int32_t padding;
+};
+static_assert((sizeof(UploadInfo) & 15) == 0, "UploadInfo must be aligned to 16 bytes.");
+
+struct SpanSetup
+{
+	int32_t r, g, b, a;
+	int32_t s, t, w, z;
+
+	int16_t xlo[4];
+	int16_t xhi[4];
+
+	int32_t interpolation_base_x;
+	int32_t start_x;
+	int32_t end_x;
+	int16_t lodlength;
+	uint16_t valid_line;
+};
+static_assert((sizeof(SpanSetup) & 15) == 0, "SpanSetup is not aligned to 16 bytes.");
+
+struct SpanInfoOffsets
+{
+	int32_t offset, ylo, yhi, padding;
+};
+static_assert((sizeof(SpanInfoOffsets) == 16), "SpanInfoOffsets is not 16 bytes.");
+
+struct SpanInterpolationJob
+{
+	uint16_t primitive_index, base_y, max_y, padding;
+};
+static_assert((sizeof(SpanInterpolationJob) == 8), "SpanInterpolationJob is not 8 bytes.");
+
+struct GlobalState
+{
+	uint32_t addr_index;
+	uint32_t depth_addr_index;
+	uint32_t fb_width, fb_height;
+	uint32_t group_mask;
+};
+
+struct TileRasterWork
+{
+	uint32_t tile_x, tile_y;
+	uint32_t tile_instance;
+	uint32_t primitive;
+};
+static_assert((sizeof(TileRasterWork) == 16), "TileRasterWork is not 16 bytes.");
+
+struct GlobalFBInfo
+{
+	uint32_t dx_shift;
+	uint32_t dx_mask;
+	uint32_t fb_size;
+	uint32_t base_primitive_index;
+};
+
+template <typename T, unsigned N>
+class StateCache
+{
+public:
+	unsigned add(const T &t)
+	{
+		if (cached_index >= 0)
+			if (memcmp(&elements[cached_index], &t, sizeof(T)) == 0)
+				return unsigned(cached_index);
+
+		for (int i = int(count) - 1; i >= 0; i--)
+		{
+			if (memcmp(&elements[i], &t, sizeof(T)) == 0)
+			{
+				cached_index = i;
+				return unsigned(i);
+			}
+		}
+
+		assert(count < N);
+		memcpy(elements + count, &t, sizeof(T));
+		unsigned ret = count++;
+		cached_index = int(ret);
+		return ret;
+	}
+
+	bool full() const
+	{
+		return count == N;
+	}
+
+	unsigned size() const
+	{
+		return count;
+	}
+
+	unsigned byte_size() const
+	{
+		return size() * sizeof(T);
+	}
+
+	const T *data() const
+	{
+		return elements;
+	}
+
+	void reset()
+	{
+		count = 0;
+		cached_index = -1;
+	}
+
+	bool empty() const
+	{
+		return count == 0;
+	}
+
+private:
+	unsigned count = 0;
+	int cached_index = -1;
+	T elements[N];
+};
+
+template <typename T, unsigned N>
+class StreamCache
+{
+public:
+	void add(const T &t)
+	{
+		assert(count < N);
+		memcpy(&elements[count++], &t, sizeof(T));
+	}
+
+	bool full() const
+	{
+		return count == N;
+	}
+
+	unsigned size() const
+	{
+		return count;
+	}
+
+	unsigned byte_size() const
+	{
+		return size() * sizeof(T);
+	}
+
+	const T *data() const
+	{
+		return elements;
+	}
+
+	void reset()
+	{
+		count = 0;
+	}
+
+	bool empty() const
+	{
+		return count == 0;
+	}
+
+private:
+	unsigned count = 0;
+	T elements[N];
+};
+
+namespace Limits
+{
+constexpr unsigned MaxPrimitives = 256;
+constexpr unsigned MaxStaticRasterizationStates = 64;
+constexpr unsigned MaxDepthBlendStates = 64;
+constexpr unsigned MaxTileInfoStates = 256;
+constexpr unsigned NumSyncStates = 32;
+constexpr unsigned MaxNumTiles = 8;
+constexpr unsigned MaxTMEMInstances = 256;
+constexpr unsigned MaxSpanSetups = 32 * 1024;
+constexpr unsigned MaxWidth = 1024;
+constexpr unsigned MaxHeight = 1024;
+constexpr unsigned MaxTileInstances = 0x8000;
+}
+
+namespace ImplementationConstants
+{
+constexpr unsigned DefaultWorkgroupSize = 64;
+
+constexpr unsigned TileWidth = 8;
+constexpr unsigned TileHeight = 8;
+constexpr unsigned MaxTilesX = Limits::MaxWidth / TileWidth;
+constexpr unsigned MaxTilesY = Limits::MaxHeight / TileHeight;
+constexpr unsigned IncoherentPageSize = 1024;
+constexpr unsigned MaxPendingRenderPassesBeforeFlush = 8;
+constexpr unsigned MinimumPrimitivesForIdleFlush = 32;
+constexpr unsigned MinimumRenderPassesForIdleFlush = 2;
+}
+}
--- a/parallel-rdp/rdp_device.cpp
+++ b/parallel-rdp/rdp_device.cpp
--- a/parallel-rdp/rdp_device.hpp
+++ b/parallel-rdp/rdp_device.hpp
@@ -0,0 +1,284 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include <memory>
+#include <thread>
+#include <queue>
+#include "device.hpp"
+#include "video_interface.hpp"
+#include "rdp_renderer.hpp"
+#include "rdp_common.hpp"
+#include "command_ring.hpp"
+#include "worker_thread.hpp"
+#include "rdp_dump_write.hpp"
+
+namespace RDP
+{
+struct RGBA
+{
+	uint8_t r, g, b, a;
+};
+
+enum CommandProcessorFlagBits
+{
+	COMMAND_PROCESSOR_FLAG_HOST_VISIBLE_HIDDEN_RDRAM_BIT = 1 << 0,
+	COMMAND_PROCESSOR_FLAG_HOST_VISIBLE_TMEM_BIT = 1 << 1,
+	COMMAND_PROCESSOR_FLAG_UPSCALING_2X_BIT = 1 << 2,
+	COMMAND_PROCESSOR_FLAG_UPSCALING_4X_BIT = 1 << 3,
+	COMMAND_PROCESSOR_FLAG_UPSCALING_8X_BIT = 1 << 4,
+	COMMAND_PROCESSOR_FLAG_SUPER_SAMPLED_READ_BACK_BIT = 1 << 5,
+	COMMAND_PROCESSOR_FLAG_SUPER_SAMPLED_DITHER_BIT = 1 << 6
+};
+using CommandProcessorFlags = uint32_t;
+
+struct CoherencyCopy
+{
+	size_t src_offset = 0;
+	size_t mask_offset = 0;
+	size_t dst_offset = 0;
+	size_t size = 0;
+	std::atomic_uint32_t *counter_base = nullptr;
+	unsigned counters = 0;
+};
+
+struct CoherencyOperation
+{
+	Vulkan::Fence fence;
+	uint64_t timeline_value = 0;
+
+	uint8_t *dst = nullptr;
+	const Vulkan::Buffer *src = nullptr;
+	std::vector<CoherencyCopy> copies;
+	std::atomic_uint32_t *unlock_cookie = nullptr;
+};
+
+// These options control various behavior when upscaling to workaround glitches which arise naturally as part of upscaling.
+struct Quirks
+{
+	inline Quirks()
+	{
+		u.options.native_resolution_tex_rect = true;
+		u.options.native_texture_lod = false;
+	}
+
+	inline void set_native_resolution_tex_rect(bool enable)
+	{
+		u.options.native_resolution_tex_rect = enable;
+	}
+
+	inline void set_native_texture_lod(bool enable)
+	{
+		u.options.native_texture_lod = enable;
+	}
+
+	union
+	{
+		struct Opts
+		{
+			// If true, force TEX_RECT and TEX_RECT_FLIP to render without upscaling.
+			// Works around bilinear filtering bugs in Cycle1/Cycle2 mode where game assumed 1:1 pixel transfer.
+			bool native_resolution_tex_rect;
+
+			// Forces LOD to be computed as 1x upscale.
+			// Fixes content which relies on LOD computation to select textures in clever ways.
+			bool native_texture_lod;
+		} options;
+		uint32_t words[1];
+	} u;
+};
+
+class CommandProcessor
+{
+public:
+	CommandProcessor(Vulkan::Device &device,
+	                 void *rdram_ptr,
+	                 size_t rdram_offset,
+	                 size_t rdram_size,
+	                 size_t hidden_rdram_size,
+	                 CommandProcessorFlags flags);
+
+	~CommandProcessor();
+
+	void set_validation_interface(ValidationInterface *iface);
+
+	bool device_is_supported() const;
+
+	// Synchronization.
+	void flush();
+	uint64_t signal_timeline();
+	void wait_for_timeline(uint64_t index);
+	void idle();
+	void begin_frame_context();
+
+	// Queues up state and drawing commands.
+	void enqueue_command(unsigned num_words, const uint32_t *words);
+	void enqueue_command_direct(unsigned num_words, const uint32_t *words);
+
+	void set_quirks(const Quirks &quirks);
+
+	// Interact with memory.
+	void *begin_read_rdram();
+	void end_write_rdram();
+	void *begin_read_hidden_rdram();
+	void end_write_hidden_rdram();
+	size_t get_rdram_size() const;
+	size_t get_hidden_rdram_size() const;
+	void *get_tmem();
+
+	// Sets VI register
+	void set_vi_register(VIRegister reg, uint32_t value);
+
+	Vulkan::ImageHandle scanout(const ScanoutOptions &opts = {});
+	void scanout_sync(std::vector<RGBA> &colors, unsigned &width, unsigned &height, const ScanoutOptions &opts = {});
+	void scanout_async_buffer(VIScanoutBuffer &buffer, const ScanoutOptions &opts = {});
+
+	// Support for modifying certain registers per-scanline.
+	// The idea is that before we scanout(), we use set_vi_register() to
+	// set frame-global VI register state.
+	// While scanning out, we can support changing some state, in particular HStart and XStart
+	// which allows various raster effects ala HDMA.
+	// For sanity's sake, scanout() reads all memory at once. A fully beam-raced implementation
+	// would render out images every scanline, but that would cripple performance and it's questionable
+	// how this is useful, especially on a 3D console. The only failure case of this style of implementation
+	// would be if a demo attempted to modify VRAM *after* it has been scanned out, i.e. a write-after-read
+	// hazard.
+
+	// Latch registers are initialized to the values in set_vi_register() for each respective register.
+	// After scanout(), the flags state is cleared to 0.
+	void begin_vi_register_per_scanline(VideoInterface::PerScanlineRegisterFlags flags);
+	void set_vi_register_for_scanline(VideoInterface::PerScanlineRegisterBits reg, uint32_t value);
+
+	// Between begin_vi_register_per_scanline() and scanout(), line must be monotonically increasing,
+	// or the call is ignored. Initial value for the line counter is 0
+	// (to set parameters for line 0, use global VI register state).
+	// Currently set registers in set_vi_register_for_scanline() are considered to be the active VI register
+	// values starting with VI line "vi_line", until the bottom of the frame or a new vi_line is set.
+	// Register state is assumed to have been fixed from the last latched scanline up until vi_line.
+	//
+	// The units used for this value matches the hardware YStart registers,
+	// i.e. the first active scanline is not 0, but VI_H_OFFSET_{NTSC,PAL}.
+	// For every scanned line, vi_line should increment by 2.
+	// vi_line must be less than VI_V_END_MAX (really, VI_V_END_{NTSC,PAL}), or it is ignored.
+	void latch_vi_register_for_scanline(unsigned vi_line);
+
+	// Assumes that scanline register state does not change until end of frame.
+	// Must be called before scanout(), or all per-scanline register state is ignored for the scanout.
+	void end_vi_register_per_scanline();
+
+	// Intended flow is something like:
+	// set_vi_register(reg, value0) // value0 used for line [0, 99]
+	// begin_vi_register_per_scanline(flags);
+	// set_vi_register_for_scanline(reg, value1); // value1 used for line [100, 199]
+	// latch_vi_register_for_scanline(100);
+	// set_vi_register_for_scanline(reg, value2);
+	// latch_vi_register_for_scanline(200); // value2 used for line [200, VBlank]
+	// end_vi_register_per_scanline();
+	// scanout();
+
+private:
+	Vulkan::Device &device;
+	Vulkan::BufferHandle rdram;
+	Vulkan::BufferHandle hidden_rdram;
+	Vulkan::BufferHandle tmem;
+	size_t rdram_offset;
+	size_t rdram_size;
+	CommandProcessorFlags flags;
+#ifndef PARALLEL_RDP_SHADER_DIR
+	std::unique_ptr<ShaderBank> shader_bank;
+#endif
+
+	// Tear-down order is important here.
+	Renderer renderer;
+	VideoInterface vi;
+	CommandRing ring;
+
+	void clear_hidden_rdram();
+	void clear_tmem();
+	void clear_buffer(Vulkan::Buffer &buffer, uint32_t value);
+	void init_renderer();
+	void enqueue_command_inner(unsigned num_words, const uint32_t *words);
+
+	Vulkan::ImageHandle scanout(const ScanoutOptions &opts, VkImageLayout target_layout);
+
+#define OP(x) void op_##x(const uint32_t *words)
+	OP(fill_triangle); OP(fill_z_buffer_triangle); OP(texture_triangle); OP(texture_z_buffer_triangle);
+	OP(shade_triangle); OP(shade_z_buffer_triangle); OP(shade_texture_triangle); OP(shade_texture_z_buffer_triangle);
+	OP(texture_rectangle); OP(texture_rectangle_flip); OP(sync_load); OP(sync_pipe);
+	OP(sync_tile); OP(sync_full); OP(set_key_gb); OP(set_key_r);
+	OP(set_convert); OP(set_scissor); OP(set_prim_depth); OP(set_other_modes);
+	OP(load_tlut); OP(set_tile_size); OP(load_block);
+	OP(load_tile); OP(set_tile); OP(fill_rectangle); OP(set_fill_color);
+	OP(set_fog_color); OP(set_blend_color); OP(set_prim_color); OP(set_env_color);
+	OP(set_combine); OP(set_texture_image); OP(set_mask_image); OP(set_color_image);
+#undef OP
+
+	ScissorState scissor_state = {};
+	StaticRasterizationState static_state = {};
+	DepthBlendState depth_blend = {};
+
+	struct
+	{
+		uint32_t addr;
+		uint32_t width;
+		TextureFormat fmt;
+		TextureSize size;
+	} texture_image = {};
+
+	uint64_t timeline_value = 0;
+	uint64_t thread_timeline_value = 0;
+
+	struct FenceExecutor
+	{
+		explicit inline FenceExecutor(Vulkan::Device *device_, uint64_t *ptr)
+			: device(device_), value(ptr)
+		{
+		}
+
+		Vulkan::Device *device;
+		uint64_t *value;
+		bool is_sentinel(const CoherencyOperation &work) const;
+		void perform_work(CoherencyOperation &work);
+		void notify_work_locked(const CoherencyOperation &work);
+	};
+	WorkerThread<CoherencyOperation, FenceExecutor> timeline_worker;
+
+	uint8_t *host_rdram = nullptr;
+	bool measure_stall_time = false;
+	bool single_threaded_processing = false;
+	bool is_supported = false;
+	bool is_host_coherent = true;
+	bool timestamp = false;
+
+	friend class Renderer;
+
+	void enqueue_coherency_operation(CoherencyOperation &&op);
+	void drain_command_ring();
+	void decode_triangle_setup(TriangleSetup &setup, const uint32_t *words) const;
+
+	Quirks quirks;
+
+	std::unique_ptr<RDPDumpWriter> dump_writer;
+	bool dump_in_command_list = false;
+};
+}
--- a/parallel-rdp/rdp_dump_write.cpp
+++ b/parallel-rdp/rdp_dump_write.cpp
@@ -0,0 +1,151 @@
+/* Copyright (c) 2021 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#include "rdp_dump_write.hpp"
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+namespace RDP
+{
+RDPDumpWriter::~RDPDumpWriter()
+{
+	end();
+	if (file)
+		fclose(file);
+}
+
+bool RDPDumpWriter::init(const char *path, uint32_t dram_size, uint32_t hidden_dram_size)
+{
+	if (file)
+		return false;
+
+	rdp_dram_cache.clear();
+	rdp_dram_cache.resize(dram_size);
+	rdp_hidden_dram_cache.clear();
+	rdp_hidden_dram_cache.resize(hidden_dram_size);
+
+	file = fopen(path, "wb");
+	if (!file)
+		return false;
+
+	fwrite("RDPDUMP2", 8, 1, file);
+	fwrite(&dram_size, sizeof(dram_size), 1, file);
+	fwrite(&hidden_dram_size, sizeof(hidden_dram_size), 1, file);
+	return true;
+}
+
+void RDPDumpWriter::end_frame()
+{
+	if (!file)
+		return;
+
+	uint32_t cmd = RDP_DUMP_CMD_END_FRAME;
+	fwrite(&cmd, sizeof(cmd), 1, file);
+}
+
+void RDPDumpWriter::end()
+{
+	if (!file)
+		return;
+
+	uint32_t cmd = RDP_DUMP_CMD_EOF;
+	fwrite(&cmd, sizeof(cmd), 1, file);
+
+	fclose(file);
+	file = nullptr;
+
+	rdp_dram_cache.clear();
+	rdp_hidden_dram_cache.clear();
+}
+
+void RDPDumpWriter::flush(const void *dram_, uint32_t size,
+                          RDPDumpCmd block_cmd, RDPDumpCmd flush_cmd,
+                          uint8_t *cache)
+{
+	if (!file)
+		return;
+
+	const auto *dram = static_cast<const uint8_t *>(dram_);
+	const uint32_t block_size = 4 * 1024;
+	uint32_t i = 0;
+
+	for (i = 0; i < size; i += block_size)
+	{
+		if (memcmp(dram + i, cache + i, block_size) != 0)
+		{
+			uint32_t cmd = block_cmd;
+			fwrite(&cmd, sizeof(cmd), 1, file);
+			fwrite(&i, sizeof(i), 1, file);
+			fwrite(&block_size, sizeof(block_size), 1, file);
+			fwrite(dram + i, 1, block_size, file);
+			memcpy(cache + i, dram + i, block_size);
+		}
+	}
+
+	uint32_t cmd = flush_cmd;
+	fwrite(&cmd, sizeof(cmd), 1, file);
+
+}
+
+void RDPDumpWriter::flush_dram(const void *dram_, uint32_t size)
+{
+	flush(dram_, size, RDP_DUMP_CMD_UPDATE_DRAM, RDP_DUMP_CMD_UPDATE_DRAM_FLUSH, rdp_dram_cache.data());
+}
+
+void RDPDumpWriter::flush_hidden_dram(const void *dram_, uint32_t size)
+{
+	flush(dram_, size, RDP_DUMP_CMD_UPDATE_HIDDEN_DRAM, RDP_DUMP_CMD_UPDATE_HIDDEN_DRAM_FLUSH, rdp_hidden_dram_cache.data());
+}
+
+void RDPDumpWriter::signal_complete()
+{
+	if (!file)
+		return;
+
+	uint32_t cmd = RDP_DUMP_CMD_SIGNAL_COMPLETE;
+	fwrite(&cmd, sizeof(cmd), 1, file);
+}
+
+void RDPDumpWriter::emit_command(uint32_t command, const uint32_t *cmd_data, uint32_t cmd_words)
+{
+	if (!file)
+		return;
+
+	uint32_t cmd = RDP_DUMP_CMD_RDP_COMMAND;
+	fwrite(&cmd, sizeof(cmd), 1, file);
+	fwrite(&command, sizeof(command), 1, file);
+	fwrite(&cmd_words, sizeof(cmd_words), 1, file);
+	fwrite(cmd_data, sizeof(*cmd_data), cmd_words, file);
+}
+
+void RDPDumpWriter::set_vi_register(uint32_t vi_register, uint32_t value)
+{
+	if (!file)
+		return;
+
+	uint32_t cmd = RDP_DUMP_CMD_SET_VI_REGISTER;
+	fwrite(&cmd, sizeof(cmd), 1, file);
+	fwrite(&vi_register, sizeof(vi_register), 1, file);
+	fwrite(&value, sizeof(value), 1, file);
+}
+}
--- a/parallel-rdp/rdp_dump_write.hpp
+++ b/parallel-rdp/rdp_dump_write.hpp
@@ -0,0 +1,65 @@
+/* Copyright (c) 2021 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include <stdint.h>
+#include <stdio.h>
+#include <vector>
+
+namespace RDP
+{
+class RDPDumpWriter
+{
+public:
+	~RDPDumpWriter();
+	bool init(const char *path, uint32_t dram_size, uint32_t hidden_dram_size);
+	void flush_dram(const void *dram, uint32_t size);
+	void flush_hidden_dram(const void *dram, uint32_t size);
+	void signal_complete();
+	void emit_command(uint32_t command, const uint32_t *cmd_data, uint32_t cmd_words);
+	void set_vi_register(uint32_t vi_register, uint32_t value);
+	void end_frame();
+
+private:
+	enum RDPDumpCmd : uint32_t
+	{
+		RDP_DUMP_CMD_INVALID = 0,
+		RDP_DUMP_CMD_UPDATE_DRAM = 1,
+		RDP_DUMP_CMD_RDP_COMMAND = 2,
+		RDP_DUMP_CMD_SET_VI_REGISTER = 3,
+		RDP_DUMP_CMD_END_FRAME = 4,
+		RDP_DUMP_CMD_SIGNAL_COMPLETE = 5,
+		RDP_DUMP_CMD_EOF = 6,
+		RDP_DUMP_CMD_UPDATE_DRAM_FLUSH = 7,
+		RDP_DUMP_CMD_UPDATE_HIDDEN_DRAM = 8,
+		RDP_DUMP_CMD_UPDATE_HIDDEN_DRAM_FLUSH = 9,
+		RDP_DUMP_CMD_INT_MAX = 0x7fffffff
+	};
+
+	FILE *file = nullptr;
+	std::vector<uint8_t> rdp_dram_cache;
+	std::vector<uint8_t> rdp_hidden_dram_cache;
+	void flush(const void *dram_, uint32_t size, RDPDumpCmd block_cmd, RDPDumpCmd flush_cmd, uint8_t *cache);
+	void end();
+};
+}
--- a/parallel-rdp/rdp_renderer.cpp
+++ b/parallel-rdp/rdp_renderer.cpp
--- a/parallel-rdp/rdp_renderer.hpp
+++ b/parallel-rdp/rdp_renderer.hpp
@@ -0,0 +1,420 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include "rdp_data_structures.hpp"
+#include "device.hpp"
+#include "rdp_common.hpp"
+#include "worker_thread.hpp"
+#include <unordered_set>
+
+namespace RDP
+{
+struct CoherencyOperation;
+
+struct SyncObject
+{
+	Vulkan::Fence fence;
+};
+
+enum class FBFormat : uint32_t
+{
+	I4 = 0,
+	I8 = 1,
+	RGBA5551 = 2,
+	IA88 = 3,
+	RGBA8888 = 4
+};
+
+enum class UploadMode : uint32_t
+{
+	Tile = 0,
+	TLUT = 1,
+	Block = 2
+};
+
+struct LoadTileInfo
+{
+	uint32_t tex_addr;
+	uint32_t tex_width;
+	uint16_t slo, tlo, shi, thi;
+	TextureFormat fmt;
+	TextureSize size;
+	UploadMode mode;
+};
+
+class CommandProcessor;
+
+struct RendererOptions
+{
+	unsigned upscaling_factor = 1;
+	bool super_sampled_readback = false;
+	bool super_sampled_readback_dither = false;
+};
+
+enum class ValidationError
+{
+	Fill4bpp,
+	LoadTile4bpp,
+	InvalidMultilineLoadTlut,
+	FillDepthTest,
+	FillDepthWrite,
+	FillImageReadEnable,
+	Copy32bpp
+};
+
+class ValidationInterface
+{
+public:
+	virtual ~ValidationInterface() = default;
+	// Validation errors may be called from a thread as errors are encountered.
+	// Reports situations that would cause fatal error on a real RDP.
+	// We only opt to report these situations rather than deliberately crashing the renderer.
+	// Handling crashes is only relevant during development of N64 homebrew.
+	virtual void report_rdp_crash(ValidationError err, const char *msg) = 0;
+};
+
+class Renderer : public Vulkan::DebugChannelInterface
+{
+public:
+	explicit Renderer(CommandProcessor &processor);
+	~Renderer();
+	void set_device(Vulkan::Device *device);
+
+	void set_validation_interface(ValidationInterface *iface);
+
+	// If coherent is false, RDRAM is a buffer split into data in lower half and writemask state in upper half, each part being size large.
+	// offset must be 0 in this case.
+	void set_rdram(Vulkan::Buffer *buffer, uint8_t *host_rdram, size_t offset, size_t size, bool coherent);
+	void set_hidden_rdram(Vulkan::Buffer *buffer);
+	void set_tmem(Vulkan::Buffer *buffer);
+	void set_shader_bank(const ShaderBank *bank);
+
+	bool init_renderer(const RendererOptions &options);
+
+	// setup may be mutated to apply various fixups to triangle setup.
+	void draw_flat_primitive(TriangleSetup &setup);
+	void draw_shaded_primitive(TriangleSetup &setup, const AttributeSetup &attr);
+
+	void set_color_framebuffer(uint32_t addr, uint32_t width, FBFormat fmt);
+	void set_depth_framebuffer(uint32_t addr);
+
+	void set_scissor_state(const ScissorState &state);
+	void set_static_rasterization_state(const StaticRasterizationState &state);
+	void set_depth_blend_state(const DepthBlendState &state);
+
+	void set_tile(uint32_t tile, const TileMeta &info);
+	void set_tile_size(uint32_t tile, uint32_t slo, uint32_t shi, uint32_t tlo, uint32_t thi);
+	void load_tile(uint32_t tile, const LoadTileInfo &info);
+	void load_tile_iteration(uint32_t tile, const LoadTileInfo &info, uint32_t tmem_offset);
+
+	void set_blend_color(uint32_t color);
+	void set_fog_color(uint32_t color);
+	void set_env_color(uint32_t color);
+	void set_primitive_color(uint8_t min_level, uint8_t prim_lod_frac, uint32_t color);
+	void set_fill_color(uint32_t color);
+	void set_primitive_depth(uint16_t prim_depth, uint16_t prim_dz);
+	void set_enable_primitive_depth(bool enable);
+	void set_convert(uint16_t k0, uint16_t k1, uint16_t k2, uint16_t k3, uint16_t k4, uint16_t k5);
+	void set_color_key(unsigned component, uint32_t width, uint32_t center, uint32_t scale);
+
+	// Called when the command thread has not seen any activity in a given period of time.
+	// This is useful so we don't needlessly queue up work when we might as well kick it to the GPU.
+	void notify_idle_command_thread();
+	void flush_and_signal();
+
+	int resolve_shader_define(const char *name, const char *define) const;
+
+	void resolve_coherency_external(unsigned offset, unsigned length);
+	void submit_update_upscaled_domain_external(Vulkan::CommandBuffer &cmd,
+	                                            unsigned addr, unsigned pixels, unsigned pixel_size_log2);
+	unsigned get_scaling_factor() const;
+
+	const Vulkan::Buffer *get_upscaled_rdram_buffer() const;
+	const Vulkan::Buffer *get_upscaled_hidden_rdram_buffer() const;
+
+	void lock_command_processing();
+	void unlock_command_processing();
+
+private:
+	CommandProcessor &processor;
+	Vulkan::Device *device = nullptr;
+	Vulkan::Buffer *rdram = nullptr;
+	ValidationInterface *validation_iface = nullptr;
+
+	Vulkan::BufferHandle upscaling_reference_rdram;
+	Vulkan::BufferHandle upscaling_multisampled_rdram;
+	Vulkan::BufferHandle upscaling_multisampled_hidden_rdram;
+
+	void validate_draw_state() const;
+
+	struct
+	{
+		uint8_t *host_rdram = nullptr;
+		Vulkan::BufferHandle staging_rdram;
+		Vulkan::BufferHandle staging_readback;
+		std::unique_ptr<std::atomic_uint32_t[]> pending_writes_for_page;
+		std::vector<uint32_t> page_to_direct_copy;
+		std::vector<uint32_t> page_to_masked_copy;
+		std::vector<uint32_t> page_to_pending_readback;
+		unsigned num_pages = 0;
+		unsigned staging_readback_pages = 0;
+		unsigned staging_readback_index = 0; // Ringbuffer the readbacks.
+	} incoherent;
+
+	size_t rdram_offset = 0;
+	size_t rdram_size = 0;
+	bool is_host_coherent = false;
+	Vulkan::Buffer *hidden_rdram = nullptr;
+	Vulkan::Buffer *tmem = nullptr;
+	const ShaderBank *shader_bank = nullptr;
+
+	bool init_caps();
+	void init_blender_lut();
+	void init_buffers(const RendererOptions &options);
+	bool init_internal_upscaling_factor(const RendererOptions &options);
+
+	struct
+	{
+		uint32_t addr = 0;
+		uint32_t depth_addr = 0;
+		uint32_t width = 0;
+		uint32_t deduced_height = 0;
+		FBFormat fmt = FBFormat::I8;
+		bool depth_write_pending = false;
+		bool color_write_pending = false;
+	} fb;
+
+	struct StreamCaches
+	{
+		ScissorState scissor_state = {};
+		StaticRasterizationState static_raster_state = {};
+		DepthBlendState depth_blend_state = {};
+
+		StateCache<StaticRasterizationState, Limits::MaxStaticRasterizationStates> static_raster_state_cache;
+		StateCache<DepthBlendState, Limits::MaxDepthBlendStates> depth_blend_state_cache;
+		StateCache<TileInfo, Limits::MaxTileInfoStates> tile_info_state_cache;
+
+		StreamCache<TriangleSetup, Limits::MaxPrimitives> triangle_setup;
+		StreamCache<ScissorState, Limits::MaxPrimitives> scissor_setup;
+		StreamCache<AttributeSetup, Limits::MaxPrimitives> attribute_setup;
+		StreamCache<DerivedSetup, Limits::MaxPrimitives> derived_setup;
+		StreamCache<InstanceIndices, Limits::MaxPrimitives> state_indices;
+		StreamCache<SpanInfoOffsets, Limits::MaxPrimitives> span_info_offsets;
+		StreamCache<SpanInterpolationJob, Limits::MaxSpanSetups> span_info_jobs;
+
+		std::vector<UploadInfo> tmem_upload_infos;
+		unsigned max_shaded_tiles = 0;
+		Vulkan::CommandBufferHandle cmd;
+	} stream;
+
+	void ensure_command_buffer();
+
+	TileInfo tiles[Limits::MaxNumTiles];
+	Vulkan::BufferHandle tmem_instances;
+	Vulkan::BufferHandle span_setups;
+	Vulkan::BufferHandle blender_divider_lut_buffer;
+	Vulkan::BufferViewHandle blender_divider_buffer;
+
+	Vulkan::BufferHandle tile_binning_buffer;
+	Vulkan::BufferHandle tile_binning_buffer_coarse;
+
+	Vulkan::BufferHandle indirect_dispatch_buffer;
+	Vulkan::BufferHandle tile_work_list;
+	Vulkan::BufferHandle per_tile_offsets;
+	Vulkan::BufferHandle per_tile_shaded_color;
+	Vulkan::BufferHandle per_tile_shaded_depth;
+	Vulkan::BufferHandle per_tile_shaded_shaded_alpha;
+	Vulkan::BufferHandle per_tile_shaded_coverage;
+
+	struct MappedBuffer
+	{
+		Vulkan::BufferHandle buffer;
+		bool is_host = false;
+	};
+
+	struct RenderBuffers
+	{
+		void init(Vulkan::Device &device, Vulkan::BufferDomain domain, RenderBuffers *borrow);
+		static MappedBuffer create_buffer(Vulkan::Device &device, Vulkan::BufferDomain domain, VkDeviceSize size, MappedBuffer *borrow);
+
+		MappedBuffer triangle_setup;
+		MappedBuffer attribute_setup;
+		MappedBuffer derived_setup;
+		MappedBuffer scissor_setup;
+
+		MappedBuffer static_raster_state;
+		MappedBuffer depth_blend_state;
+		MappedBuffer tile_info_state;
+
+		MappedBuffer state_indices;
+		MappedBuffer span_info_offsets;
+
+		MappedBuffer span_info_jobs;
+		Vulkan::BufferViewHandle span_info_jobs_view;
+	};
+
+	struct RenderBuffersUpdater
+	{
+		void init(Vulkan::Device &device);
+		void upload(Vulkan::Device &device, const StreamCaches &caches, Vulkan::CommandBuffer &cmd);
+
+		template <typename Cache>
+		void upload(Vulkan::CommandBuffer &cmd, Vulkan::Device &device,
+		            const MappedBuffer &gpu, const MappedBuffer &cpu, const Cache &cache, bool &did_upload);
+
+		RenderBuffers cpu, gpu;
+	};
+
+	struct InternalSynchronization
+	{
+		Vulkan::Fence fence;
+	};
+
+	struct Constants
+	{
+		uint32_t blend_color = 0;
+		uint32_t fog_color = 0;
+		uint32_t env_color = 0;
+		uint32_t primitive_color = 0;
+		uint32_t fill_color = 0;
+		uint8_t min_level = 0;
+		uint8_t prim_lod_frac = 0;
+		int32_t prim_depth = 0;
+		uint16_t prim_dz = 0;
+		uint16_t convert[6] = {};
+
+		uint16_t key_width[3] = {};
+		uint8_t key_center[3] = {};
+		uint8_t key_scale[3] = {};
+
+		bool use_prim_depth = false;
+	} constants;
+
+	RenderBuffersUpdater buffer_instances[Limits::NumSyncStates];
+	InternalSynchronization internal_sync[Limits::NumSyncStates];
+	uint32_t sync_indices_needs_flush = 0;
+	unsigned buffer_instance = 0;
+	uint32_t base_primitive_index = 0;
+	unsigned pending_render_passes = 0;
+	unsigned pending_render_passes_upscaled = 0;
+	unsigned pending_primitives = 0;
+	unsigned pending_primitives_upscaled = 0;
+
+	bool tmem_upload_needs_flush(uint32_t addr) const;
+
+	bool render_pass_is_upscaled() const;
+	bool should_render_upscaled() const;
+
+	void flush_queues();
+	void submit_render_pass(Vulkan::CommandBuffer &cmd);
+	void submit_render_pass_upscaled(Vulkan::CommandBuffer &cmd);
+	void submit_render_pass_end(Vulkan::CommandBuffer &cmd);
+	void submit_to_queue();
+	void begin_new_context();
+	void reset_context();
+	bool need_flush() const;
+	void maintain_queues();
+	void maintain_queues_idle();
+	void update_tmem_instances(Vulkan::CommandBuffer &cmd);
+	void submit_span_setup_jobs(Vulkan::CommandBuffer &cmd, bool upscaled);
+	void update_deduced_height(const TriangleSetup &setup);
+	void submit_tile_binning_combined(Vulkan::CommandBuffer &cmd, bool upscaled);
+	void clear_indirect_buffer(Vulkan::CommandBuffer &cmd);
+	void submit_rasterization(Vulkan::CommandBuffer &cmd, Vulkan::Buffer &tmem, bool upscaled);
+	void submit_depth_blend(Vulkan::CommandBuffer &cmd, Vulkan::Buffer &tmem, bool upscaled, bool force_write_mask);
+
+	enum class ResolveStage { Pre, Post, SSAAResolve };
+	void submit_update_upscaled_domain(Vulkan::CommandBuffer &cmd, ResolveStage stage);
+	void submit_update_upscaled_domain(Vulkan::CommandBuffer &cmd, ResolveStage stage,
+	                                   unsigned addr, unsigned depth_addr,
+	                                   unsigned width, unsigned height,
+	                                   unsigned pixel_size_log2);
+	void submit_clear_super_sample_write_mask(Vulkan::CommandBuffer &cmd, unsigned width, unsigned height);
+
+	SpanInfoOffsets allocate_span_jobs(const TriangleSetup &setup);
+
+	DerivedSetup build_derived_attributes(const AttributeSetup &attr) const;
+	void build_combiner_constants(DerivedSetup &setup, unsigned cycle) const;
+	int filter_debug_channel_x = -1;
+	int filter_debug_channel_y = -1;
+	bool debug_channel = false;
+
+	void message(const std::string &tag, uint32_t code,
+	             uint32_t x, uint32_t y, uint32_t z,
+	             uint32_t num_words, const Vulkan::DebugChannelInterface::Word *words) override;
+
+	bool can_support_minimum_subgroup_size(unsigned size) const;
+	bool supports_subgroup_size_control(uint32_t minimum_size, uint32_t maximum_size) const;
+
+	std::unordered_set<Util::Hash> pending_async_pipelines;
+
+	unsigned compute_conservative_max_num_tiles(const TriangleSetup &setup) const;
+
+	void deduce_static_texture_state(unsigned tile, unsigned max_lod_level);
+	void deduce_noise_state();
+	static StaticRasterizationState normalize_static_state(StaticRasterizationState state);
+	void fixup_triangle_setup(TriangleSetup &setup) const;
+
+	struct Caps
+	{
+		int timestamp = 0;
+		bool force_sync = false;
+		bool ubershader = false;
+		bool supports_small_integer_arithmetic = false;
+		bool subgroup_tile_binning = false;
+		bool subgroup_depth_blend = false;
+		bool super_sample_readback = false;
+		bool super_sample_readback_dither = false;
+		unsigned upscaling = 1;
+		unsigned max_num_tile_instances = Limits::MaxTileInstances;
+		unsigned max_tiles_x = ImplementationConstants::MaxTilesX;
+		unsigned max_tiles_y = ImplementationConstants::MaxTilesY;
+		unsigned max_width = Limits::MaxWidth;
+		unsigned max_height = Limits::MaxHeight;
+	} caps;
+
+	struct PipelineExecutor
+	{
+		Vulkan::Device *device;
+		bool is_sentinel(const Vulkan::DeferredPipelineCompile &compile) const;
+		void perform_work(const Vulkan::DeferredPipelineCompile &compile) const;
+		void notify_work_locked(const Vulkan::DeferredPipelineCompile &compile) const;
+	};
+
+	std::unique_ptr<WorkerThread<Vulkan::DeferredPipelineCompile, PipelineExecutor>> pipeline_worker;
+
+	void resolve_coherency_host_to_gpu(Vulkan::CommandBuffer &cmd);
+	void resolve_coherency_gpu_to_host(CoherencyOperation &op, Vulkan::CommandBuffer &cmd);
+	uint32_t get_byte_size_for_bound_color_framebuffer() const;
+	uint32_t get_byte_size_for_bound_depth_framebuffer() const;
+	void mark_pages_for_gpu_read(uint32_t base_addr, uint32_t byte_count);
+	void lock_pages_for_gpu_write(uint32_t base_addr, uint32_t byte_count);
+
+	std::atomic_uint32_t active_submissions;
+	void enqueue_fence_wait(Vulkan::Fence fence);
+	uint64_t last_submit_ns = 0;
+
+	std::mutex idle_lock;
+};
+}
--- a/parallel-rdp/shaders/binning.h
+++ b/parallel-rdp/shaders/binning.h
@@ -0,0 +1,146 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef BINNING_H_
+#define BINNING_H_
+
+// There are 4 critical Y coordinates to test when binning. Top, bottom, mid, and mid - 1.
+
+const int SUBPIXELS_Y = 4;
+
+ivec4 quantize_x(ivec4 x)
+{
+	return x >> 15;
+}
+
+int minimum4(ivec4 v)
+{
+	ivec2 minimum2 = min(v.xy, v.zw);
+	return min(minimum2.x, minimum2.y);
+}
+
+int maximum4(ivec4 v)
+{
+	ivec2 maximum2 = max(v.xy, v.zw);
+	return max(maximum2.x, maximum2.y);
+}
+
+ivec4 madd_32_64(ivec4 a, int b, int c, out ivec4 hi_bits)
+{
+	ivec4 lo, hi;
+	imulExtended(a, ivec4(b), hi, lo);
+	uvec4 carry;
+	lo = ivec4(uaddCarry(lo, uvec4(c), carry));
+	hi += ivec4(carry);
+	hi_bits = hi;
+	return lo;
+}
+
+ivec2 interpolate_xs(TriangleSetup setup, ivec4 ys, bool flip, int scaling)
+{
+	int yh_interpolation_base = setup.yh & ~(SUBPIXELS_Y - 1);
+	int ym_interpolation_base = setup.ym;
+
+	yh_interpolation_base *= scaling;
+	ym_interpolation_base *= scaling;
+
+	// Interpolate in 64-bit so we can detect quirky overflow scenarios.
+	ivec4 xh_hi, xm_hi, xl_hi;
+	ivec4 xh = madd_32_64(ys - yh_interpolation_base, setup.dxhdy, scaling * setup.xh, xh_hi);
+	ivec4 xm = madd_32_64(ys - yh_interpolation_base, setup.dxmdy, scaling * setup.xm, xm_hi);
+	ivec4 xl = madd_32_64(ys - ym_interpolation_base, setup.dxldy, scaling * setup.xl, xl_hi);
+	xl = mix(xl, xm, lessThan(ys, ivec4(scaling * setup.ym)));
+	xl_hi = mix(xl_hi, xm_hi, lessThan(ys, ivec4(scaling * setup.ym)));
+
+	// Handle overflow scenarios. Saturate 64-bit signed to 32-bit signed without 64-bit math.
+	xh = mix(xh, ivec4(0x7fffffff), greaterThan(xh_hi, ivec4(0)));
+	xh = mix(xh, ivec4(-0x80000000), lessThan(xh_hi, ivec4(-1)));
+	xl = mix(xl, ivec4(0x7fffffff), greaterThan(xl_hi, ivec4(0)));
+	xl = mix(xl, ivec4(-0x80000000), lessThan(xl_hi, ivec4(-1)));
+
+	ivec4 xh_shifted = quantize_x(xh);
+	ivec4 xl_shifted = quantize_x(xl);
+
+	ivec4 xleft, xright;
+	if (flip)
+	{
+		xleft = xh_shifted;
+		xright = xl_shifted;
+	}
+	else
+	{
+		xleft = xl_shifted;
+		xright = xh_shifted;
+	}
+
+	// If one of the results are out of range, we have overflow, and we need to be conservative when binning.
+	int max_range = maximum4(max(abs(xleft), abs(xright)));
+	ivec2 range;
+	if (max_range <= 2047 * scaling)
+		range = ivec2(minimum4(xleft), maximum4(xright));
+	else
+		range = ivec2(0, 0x7fffffff);
+
+	return range;
+}
+
+bool bin_primitive(TriangleSetup setup, ivec2 lo, ivec2 hi, int scaling, ScissorState scissor)
+{
+	// First clip Y range based on scissor.
+	lo.y = max(lo.y, scaling * (scissor.ylo >> 2));
+	hi.y = min(hi.y, scaling * ((scissor.yhi + 3) >> 2) - 1);
+
+	int start_y = lo.y * SUBPIXELS_Y;
+	int end_y = (hi.y * SUBPIXELS_Y) + (SUBPIXELS_Y - 1);
+
+	// First, we clip start/end against y_lo, y_hi.
+	start_y = max(start_y, scaling * int(setup.yh));
+	end_y = min(end_y, scaling * int(setup.yl) - 1);
+
+	// Y is clipped out, exit early.
+	if (end_y < start_y)
+		return false;
+
+	bool flip = (setup.flags & TRIANGLE_SETUP_FLIP_BIT) != 0;
+
+	// Sample the X ranges for min and max Y, and potentially the mid-point as well.
+	ivec4 ys = ivec4(start_y, end_y, clamp(setup.ym * scaling + ivec2(-1, 0), ivec2(start_y), ivec2(end_y)));
+	ivec2 x_range = interpolate_xs(setup, ys, flip, scaling);
+
+	// For FILL_COPY_RASTER_BIT we're inclusive, if not, exclusive.
+	int x_bias = (setup.flags & TRIANGLE_SETUP_FILL_COPY_RASTER_BIT) != 0 ? 4 : 3;
+	ivec2 scissor_x = ivec2(scaling * (scissor.xlo >> 2), scaling * ((scissor.xhi + x_bias) >> 2) - 1);
+
+	// Scissor is applied through a clamp with a mask being generated for overshoot which affects if the line is valid.
+	// Since this is a conservative test we don't compute valid line here, so we have to assume it is valid.
+	// We can end up creating fake coverage in FILL/COPY modes in some cases
+	// if we clamp scissor to outside the primitive's range as long as at least one sub-line passes the scissor test.
+	// The x_range ends up being degenerate, but these fill modes are conservative and generate one pixel of coverage
+	// anyways.
+	x_range = clamp(x_range, scissor_x.xx, scissor_x.yy);
+
+	x_range.x = max(x_range.x, lo.x);
+	x_range.y = min(x_range.y, hi.x);
+	return x_range.x <= x_range.y;
+}
+
+#endif
--- a/parallel-rdp/shaders/blender.h
+++ b/parallel-rdp/shaders/blender.h
@@ -0,0 +1,145 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef BLENDER_H_
+#define BLENDER_H_
+
+struct BlendInputs
+{
+	u8x4 pixel_color;
+	u8x4 memory_color;
+	u8x4 fog_color;
+	u8x4 blend_color;
+	u8 shade_alpha;
+};
+
+const int BLEND_MODE_1A_PIXEL_COLOR = 0;
+const int BLEND_MODE_1A_MEMORY_COLOR = 1;
+const int BLEND_MODE_1A_BLEND_COLOR = 2;
+const int BLEND_MODE_1A_FOG_COLOR = 3;
+
+const int BLEND_MODE_1B_PIXEL_ALPHA = 0;
+const int BLEND_MODE_1B_FOG_ALPHA = 1;
+const int BLEND_MODE_1B_SHADE_ALPHA = 2;
+const int BLEND_MODE_1B_ZERO = 3;
+
+const int BLEND_MODE_2A_PIXEL_COLOR = 0;
+const int BLEND_MODE_2A_MEMORY_COLOR = 1;
+const int BLEND_MODE_2A_BLEND_COLOR = 2;
+const int BLEND_MODE_2A_FOG_COLOR = 3;
+
+const int BLEND_MODE_2B_INV_PIXEL_ALPHA = 0;
+const int BLEND_MODE_2B_MEMORY_ALPHA = 1;
+const int BLEND_MODE_2B_ONE = 2;
+const int BLEND_MODE_2B_ZERO = 3;
+
+u8x3 blender(BlendInputs inputs, u8x4 blend_modes,
+             bool force_blend, bool blend_en, bool color_on_coverage, bool coverage_wrap, u8x2 blend_shift,
+             bool final_cycle)
+{
+	u8x3 rgb1;
+	switch (int(blend_modes.z))
+	{
+	case BLEND_MODE_2A_PIXEL_COLOR: rgb1 = inputs.pixel_color.rgb; break;
+	case BLEND_MODE_2A_MEMORY_COLOR: rgb1 = inputs.memory_color.rgb; break;
+	case BLEND_MODE_2A_BLEND_COLOR: rgb1 = inputs.blend_color.rgb; break;
+	case BLEND_MODE_2A_FOG_COLOR: rgb1 = inputs.fog_color.rgb; break;
+	}
+
+	if (final_cycle)
+	{
+		if (color_on_coverage && !coverage_wrap)
+			return rgb1;
+	}
+
+	u8x3 rgb0;
+	switch (int(blend_modes.x))
+	{
+	case BLEND_MODE_1A_PIXEL_COLOR: rgb0 = inputs.pixel_color.rgb; break;
+	case BLEND_MODE_1A_MEMORY_COLOR: rgb0 = inputs.memory_color.rgb; break;
+	case BLEND_MODE_1A_BLEND_COLOR: rgb0 = inputs.blend_color.rgb; break;
+	case BLEND_MODE_1A_FOG_COLOR: rgb0 = inputs.fog_color.rgb; break;
+	}
+
+	if (final_cycle)
+	{
+		if (!blend_en || (blend_modes.y == BLEND_MODE_1B_PIXEL_ALPHA &&
+						  blend_modes.w == BLEND_MODE_2B_INV_PIXEL_ALPHA &&
+						  inputs.pixel_color.a == U8_C(0xff)))
+		{
+			return rgb0;
+		}
+	}
+
+	u8 a0;
+	u8 a1;
+
+	switch (int(blend_modes.y))
+	{
+	case BLEND_MODE_1B_PIXEL_ALPHA: a0 = inputs.pixel_color.a; break;
+	case BLEND_MODE_1B_FOG_ALPHA: a0 = inputs.fog_color.a; break;
+	case BLEND_MODE_1B_SHADE_ALPHA: a0 = inputs.shade_alpha; break;
+	case BLEND_MODE_1B_ZERO: a0 = U8_C(0); break;
+	}
+
+	switch (int(blend_modes.w))
+	{
+	case BLEND_MODE_2B_INV_PIXEL_ALPHA: a1 = ~a0 & U8_C(0xff); break;
+	case BLEND_MODE_2B_MEMORY_ALPHA: a1 = inputs.memory_color.a; break;
+	case BLEND_MODE_2B_ONE: a1 = U8_C(0xff); break;
+	case BLEND_MODE_2B_ZERO: a1 = U8_C(0); break;
+	}
+
+	a0 >>= U8_C(3);
+	a1 >>= U8_C(3);
+
+	if (blend_modes.w == BLEND_MODE_2B_MEMORY_ALPHA)
+	{
+		a0 = (a0 >> blend_shift.x) & U8_C(0x3c);
+		a1 = (a1 >> blend_shift.y) | U8_C(3);
+	}
+
+	i16x3 blended = i16x3(rgb0) * i16(a0) + i16x3(rgb1) * (i16(a1) + I16_C(1));
+
+	if (!final_cycle || force_blend)
+	{
+		rgb0 = u8x3(blended >> I16_C(5));
+	}
+	else
+	{
+		// Serious funk here. Somehow the RDP implemented a divider to deal with weighted average.
+		// Typically relevant when using blender shifters from interpenetrating Z mode.
+		// Under normal condition, this is implemented as a straight integer divider, but
+		// for edge cases, we need a look-up table. The results make no sense.
+		int blend_sum = (int(a0) >> 2) + (int(a1) >> 2) + 1;
+		blended >>= I16_C(2);
+		blended &= I16_C(0x7ff);
+
+		rgb0.r = u8(texelFetch(uBlenderDividerLUT, (blend_sum << 11) | blended.x).x);
+		rgb0.g = u8(texelFetch(uBlenderDividerLUT, (blend_sum << 11) | blended.y).x);
+		rgb0.b = u8(texelFetch(uBlenderDividerLUT, (blend_sum << 11) | blended.z).x);
+	}
+
+	return rgb0 & U8_C(0xff);
+}
+
+#endif
--- a/parallel-rdp/shaders/clamping.h
+++ b/parallel-rdp/shaders/clamping.h
@@ -0,0 +1,78 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef CLAMPING_H_
+#define CLAMPING_H_
+
+#if SMALL_TYPES && 0
+// This path is buggy on RADV LLVM, disable for time being.
+i16x4 clamp_9bit_notrunc(i16x4 color)
+{
+	// [-129, -256] should clamp to 0xff, subtracting by 0x80 will underflow back to positive numbers.
+	// [-128, -1] should clamp to 0.
+	color -= I16_C(0x80);
+	// Sign-extend to 9-bit.
+	color <<= I16_C(7);
+	color >>= I16_C(7);
+	color += I16_C(0x80);
+	return clamp(color, i16x4(0), i16x4(0xff));
+}
+#else
+i16x4 clamp_9bit_notrunc(ivec4 color)
+{
+	// [-129, -256] should clamp to 0xff, subtracting by 0x80 will underflow back to positive numbers.
+	// [-128, -1] should clamp to 0.
+	color -= 0x80;
+	// Sign-extend to 9-bit.
+	color = bitfieldExtract(color, 0, 9);
+	color += 0x80;
+	return i16x4(clamp(color, ivec4(0), ivec4(0xff)));
+}
+#endif
+
+u8x4 clamp_9bit(i16x4 color)
+{
+	return u8x4(clamp_9bit_notrunc(color));
+}
+
+int clamp_9bit(int color)
+{
+	return clamp(bitfieldExtract(color - 0x80, 0, 9) + 0x80, 0, 0xff);
+}
+
+// Returns 18-bit UNORM depth.
+int clamp_z(int z)
+{
+	// Similar to RGBA, we reserve an extra bit to deal with overflow and underflow.
+	z -= (1 << 17);
+	z <<= (31 - 18);
+	z >>= (31 - 18);
+	z += (1 << 17);
+
+	// [0x00000, 0x3ffff] maps to self.
+	// [0x40000, 0x5ffff] maps to 0x3ffff.
+	// [0x60000, 0x7ffff] maps to 0.
+
+	return clamp(z, 0, 0x3ffff);
+}
+
+#endif
--- a/parallel-rdp/shaders/clear_indirect_buffer.comp
+++ b/parallel-rdp/shaders/clear_indirect_buffer.comp
@@ -0,0 +1,33 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+layout(local_size_x_id = 0) in;
+
+layout(set = 0, binding = 0, std430) writeonly buffer ClearIndirectBuffer
+{
+    uvec4 indirects[];
+};
+
+void main()
+{
+    indirects[gl_GlobalInvocationID.x] = uvec4(0, 1, 1, 0);
+}
--- a/parallel-rdp/shaders/clear_super_sampled_write_mask.comp
+++ b/parallel-rdp/shaders/clear_super_sampled_write_mask.comp
@@ -0,0 +1,34 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+layout(local_size_x_id = 0) in;
+
+layout(set = 0, binding = 0, std430) writeonly buffer ToClear
+{
+    uint elems[];
+} mask_ram;
+
+void main()
+{
+    mask_ram.elems[gl_GlobalInvocationID.x] = 0u;
+}
--- a/parallel-rdp/shaders/clear_write_mask.comp
+++ b/parallel-rdp/shaders/clear_write_mask.comp
@@ -0,0 +1,42 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+layout(local_size_x_id = 0) in;
+layout(constant_id = 1) const int PAGE_STRIDE = 256;
+
+layout(set = 0, binding = 0, std430) writeonly buffer SSBO
+{
+    uint write_mask[];
+};
+
+layout(set = 1, binding = 0, std140) uniform UBO
+{
+    uvec4 offsets[1024];
+};
+
+void main()
+{
+    uint offset = offsets[gl_WorkGroupID.x >> 2u][gl_WorkGroupID.x & 3u];
+    offset *= PAGE_STRIDE;
+    write_mask[offset + gl_LocalInvocationIndex] = 0u;
+}
--- a/parallel-rdp/shaders/combiner.h
+++ b/parallel-rdp/shaders/combiner.h
@@ -0,0 +1,284 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef COMBINER_H_
+#define COMBINER_H_
+
+#include "clamping.h"
+
+ivec4 special_expand(ivec4 value)
+{
+	// Special sign-extend without explicit clamp.
+	return bitfieldExtract(value - 0x80, 0, 9) + 0x80;
+}
+
+i16x4 combiner_equation(ivec4 a, ivec4 b, ivec4 c, ivec4 d)
+{
+	// Sign-extend multiplier to 9 bits.
+	c = bitfieldExtract(c, 0, 9);
+
+	// Need this to deal with very specific 9-bit sign bits ...
+	a = special_expand(a);
+	b = special_expand(b);
+	d = special_expand(d);
+
+	ivec4 color = (a - b) * c;
+	color += 0x80;
+	return i16x4(color >> 8) + i16x4(d);
+}
+
+struct CombinerInputs
+{
+	u8x4 constant_muladd;
+	u8x4 constant_mulsub;
+	u8x4 constant_mul;
+	u8x4 constant_add;
+
+	u8x4 shade;
+	i16x4 combined;
+	i16x4 texel0;
+	i16x4 texel1;
+	i16 lod_frac;
+	i16 noise;
+};
+
+const int RGB_MULADD_COMBINED = 0;
+const int RGB_MULADD_TEXEL0 = 1;
+const int RGB_MULADD_TEXEL1 = 2;
+const int RGB_MULADD_SHADE = 4;
+const int RGB_MULADD_ONE = 6;
+const int RGB_MULADD_NOISE = 7;
+
+const int RGB_MULSUB_COMBINED = 0;
+const int RGB_MULSUB_TEXEL0 = 1;
+const int RGB_MULSUB_TEXEL1 = 2;
+const int RGB_MULSUB_SHADE = 4;
+const int RGB_MULSUB_K4 = 7;
+
+const int RGB_MUL_COMBINED = 0;
+const int RGB_MUL_TEXEL0 = 1;
+const int RGB_MUL_TEXEL1 = 2;
+const int RGB_MUL_SHADE = 4;
+const int RGB_MUL_COMBINED_ALPHA = 7;
+const int RGB_MUL_TEXEL0_ALPHA = 8;
+const int RGB_MUL_TEXEL1_ALPHA = 9;
+const int RGB_MUL_SHADE_ALPHA = 11;
+const int RGB_MUL_LOD_FRAC = 13;
+const int RGB_MUL_K5 = 15;
+
+const int RGB_ADD_COMBINED = 0;
+const int RGB_ADD_TEXEL0 = 1;
+const int RGB_ADD_TEXEL1 = 2;
+const int RGB_ADD_SHADE = 4;
+const int RGB_ADD_ONE = 6;
+
+const int ALPHA_ADDSUB_COMBINED = 0;
+const int ALPHA_ADDSUB_TEXEL0_ALPHA = 1;
+const int ALPHA_ADDSUB_TEXEL1_ALPHA = 2;
+const int ALPHA_ADDSUB_SHADE_ALPHA = 4;
+const int ALPHA_ADDSUB_ONE = 6;
+
+const int ALPHA_MUL_LOD_FRAC = 0;
+const int ALPHA_MUL_TEXEL0_ALPHA = 1;
+const int ALPHA_MUL_TEXEL1_ALPHA = 2;
+const int ALPHA_MUL_SHADE_ALPHA = 4;
+
+ivec4 select_muladd(CombinerInputs inputs, int selector_rgb, int selector_alpha)
+{
+	ivec3 res;
+	switch (selector_rgb)
+	{
+	case RGB_MULADD_COMBINED: res = inputs.combined.rgb; break;
+	case RGB_MULADD_TEXEL0: res = inputs.texel0.rgb; break;
+	case RGB_MULADD_TEXEL1: res = inputs.texel1.rgb; break;
+	case RGB_MULADD_SHADE: res = inputs.shade.rgb; break;
+	case RGB_MULADD_NOISE: res = ivec3(inputs.noise); break;
+	case RGB_MULADD_ONE: res = ivec3(0x100); break;
+	default: res = inputs.constant_muladd.rgb; break;
+	}
+
+	int alpha;
+	switch (selector_alpha)
+	{
+	case ALPHA_ADDSUB_COMBINED: alpha = inputs.combined.a; break;
+	case ALPHA_ADDSUB_TEXEL0_ALPHA: alpha = inputs.texel0.a; break;
+	case ALPHA_ADDSUB_TEXEL1_ALPHA: alpha = inputs.texel1.a; break;
+	case ALPHA_ADDSUB_SHADE_ALPHA: alpha = inputs.shade.a; break;
+	case ALPHA_ADDSUB_ONE: alpha = 0x100; break;
+	default: alpha = inputs.constant_muladd.a; break;
+	}
+	return ivec4(res, alpha);
+}
+
+ivec4 select_mulsub(CombinerInputs inputs, int selector_rgb, int selector_alpha)
+{
+	ivec3 res;
+	switch (selector_rgb)
+	{
+	case RGB_MULSUB_COMBINED: res = inputs.combined.rgb; break;
+	case RGB_MULSUB_TEXEL0: res = inputs.texel0.rgb; break;
+	case RGB_MULSUB_TEXEL1: res = inputs.texel1.rgb; break;
+	case RGB_MULSUB_SHADE: res = inputs.shade.rgb; break;
+	case RGB_MULSUB_K4: res = ivec3((int(inputs.constant_mulsub.g) << 8) | inputs.constant_mulsub.b); break;
+	default: res = inputs.constant_mulsub.rgb; break;
+	}
+
+	int alpha;
+	switch (selector_alpha)
+	{
+	case ALPHA_ADDSUB_COMBINED: alpha = inputs.combined.a; break;
+	case ALPHA_ADDSUB_TEXEL0_ALPHA: alpha = inputs.texel0.a; break;
+	case ALPHA_ADDSUB_TEXEL1_ALPHA: alpha = inputs.texel1.a; break;
+	case ALPHA_ADDSUB_SHADE_ALPHA: alpha = inputs.shade.a; break;
+	case ALPHA_ADDSUB_ONE: alpha = 0x100; break;
+	default: alpha = inputs.constant_mulsub.a; break;
+	}
+	return ivec4(res, alpha);
+}
+
+ivec4 select_mul(CombinerInputs inputs, int selector_rgb, int selector_alpha)
+{
+	ivec3 res;
+	switch (selector_rgb)
+	{
+	case RGB_MUL_COMBINED: res = inputs.combined.rgb; break;
+	case RGB_MUL_COMBINED_ALPHA: res = inputs.combined.aaa; break;
+	case RGB_MUL_TEXEL0: res = inputs.texel0.rgb; break;
+	case RGB_MUL_TEXEL1: res = inputs.texel1.rgb; break;
+	case RGB_MUL_SHADE: res = inputs.shade.rgb; break;
+	case RGB_MUL_TEXEL0_ALPHA: res = inputs.texel0.aaa; break;
+	case RGB_MUL_TEXEL1_ALPHA: res = inputs.texel1.aaa; break;
+	case RGB_MUL_SHADE_ALPHA: res = inputs.shade.aaa; break;
+	case RGB_MUL_LOD_FRAC: res = ivec3(inputs.lod_frac); break;
+	case RGB_MUL_K5: res = ivec3((int(inputs.constant_mul.g) << 8) | inputs.constant_mul.b); break;
+	default: res = inputs.constant_mul.rgb; break;
+	}
+
+	int alpha;
+	switch (selector_alpha)
+	{
+	case ALPHA_MUL_LOD_FRAC: alpha = inputs.lod_frac; break;
+	case ALPHA_MUL_TEXEL0_ALPHA: alpha = inputs.texel0.a; break;
+	case ALPHA_MUL_TEXEL1_ALPHA: alpha = inputs.texel1.a; break;
+	case ALPHA_MUL_SHADE_ALPHA: alpha = inputs.shade.a; break;
+	default: alpha = inputs.constant_mul.a; break;
+	}
+	return ivec4(res, alpha);
+}
+
+ivec4 select_add(CombinerInputs inputs, int selector_rgb, int selector_alpha)
+{
+	ivec3 res;
+	switch (selector_rgb)
+	{
+	case RGB_ADD_COMBINED: res = inputs.combined.rgb; break;
+	case RGB_ADD_TEXEL0: res = inputs.texel0.rgb; break;
+	case RGB_ADD_TEXEL1: res = inputs.texel1.rgb; break;
+	case RGB_ADD_SHADE: res = inputs.shade.rgb; break;
+	case RGB_ADD_ONE: res = ivec3(0x100); break;
+	default: res = inputs.constant_add.rgb; break;
+	}
+
+	int alpha;
+	switch (selector_alpha)
+	{
+	case ALPHA_ADDSUB_COMBINED: alpha = inputs.combined.a; break;
+	case ALPHA_ADDSUB_TEXEL0_ALPHA: alpha = inputs.texel0.a; break;
+	case ALPHA_ADDSUB_TEXEL1_ALPHA: alpha = inputs.texel1.a; break;
+	case ALPHA_ADDSUB_SHADE_ALPHA: alpha = inputs.shade.a; break;
+	case ALPHA_ADDSUB_ONE: alpha = 0x100; break;
+	default: alpha = inputs.constant_add.a; break;
+	}
+	return ivec4(res, alpha);
+}
+
+i16x4 combiner_cycle0(CombinerInputs inputs, u8x4 combiner_inputs_rgb, u8x4 combiner_inputs_alpha, int alpha_dith,
+                      int coverage, bool cvg_times_alpha, bool alpha_cvg_select, bool alpha_test, out u8 alpha_test_reference)
+{
+	ivec4 muladd = select_muladd(inputs, combiner_inputs_rgb.x, combiner_inputs_alpha.x);
+	ivec4 mulsub = select_mulsub(inputs, combiner_inputs_rgb.y, combiner_inputs_alpha.y);
+	ivec4 mul = select_mul(inputs, combiner_inputs_rgb.z, combiner_inputs_alpha.z);
+	ivec4 add = select_add(inputs, combiner_inputs_rgb.w, combiner_inputs_alpha.w);
+
+	i16x4 combined = combiner_equation(muladd, mulsub, mul, add);
+
+	if (alpha_test)
+	{
+		int clamped_alpha = clamp_9bit(combined.a);
+		// Expands 0xff to 0x100 to avoid having to divide by 2**n - 1.
+		int expanded_alpha = clamped_alpha + ((clamped_alpha + 1) >> 8);
+
+		if (alpha_cvg_select)
+		{
+			int modulated_alpha;
+			if (cvg_times_alpha)
+				modulated_alpha = (expanded_alpha * coverage + 4) >> 3;
+			else
+				modulated_alpha = coverage << 5;
+			expanded_alpha = modulated_alpha;
+		}
+		else
+			expanded_alpha += alpha_dith;
+
+		alpha_test_reference = u8(clamp(expanded_alpha, 0, 0xff));
+	}
+	else
+		alpha_test_reference = U8_C(0);
+
+	return combined;
+}
+
+i16x4 combiner_cycle1(CombinerInputs inputs, u8x4 combiner_inputs_rgb, u8x4 combiner_inputs_alpha, int alpha_dith,
+		              inout int coverage, bool cvg_times_alpha, bool alpha_cvg_select)
+{
+	ivec4 muladd = select_muladd(inputs, combiner_inputs_rgb.x, combiner_inputs_alpha.x);
+	ivec4 mulsub = select_mulsub(inputs, combiner_inputs_rgb.y, combiner_inputs_alpha.y);
+	ivec4 mul = select_mul(inputs, combiner_inputs_rgb.z, combiner_inputs_alpha.z);
+	ivec4 add = select_add(inputs, combiner_inputs_rgb.w, combiner_inputs_alpha.w);
+
+	i16x4 combined = combiner_equation(muladd, mulsub, mul, add);
+
+	combined = clamp_9bit_notrunc(combined);
+
+	// Expands 0xff to 0x100 to avoid having to divide by 2**n - 1.
+	int expanded_alpha = combined.a + ((combined.a + 1) >> 8);
+
+	int modulated_alpha;
+	if (cvg_times_alpha)
+	{
+		modulated_alpha = (expanded_alpha * coverage + 4) >> 3;
+		coverage = modulated_alpha >> 5;
+	}
+	else
+		modulated_alpha = coverage << 5;
+
+	if (alpha_cvg_select)
+		expanded_alpha = modulated_alpha;
+	else
+		expanded_alpha += alpha_dith;
+
+	combined.a = i16(clamp(expanded_alpha, 0, 0xff));
+
+	return combined;
+}
+
+#endif
--- a/parallel-rdp/shaders/coverage.h
+++ b/parallel-rdp/shaders/coverage.h
@@ -0,0 +1,81 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef COVERAGE_H_
+#define COVERAGE_H_
+
+#include "data_structures.h"
+
+const int SUBPIXELS_LOG2 = 2;
+const int SUBPIXELS = 1 << SUBPIXELS_LOG2;
+
+u8 compute_coverage(u16x4 xleft, u16x4 xright, int x)
+{
+	u16x4 xshift = u16x4(0, 4, 2, 6) + (u16(x) << U16_C(3));
+	bvec4 clip_lo_x01 = lessThan(xshift, xleft.xxyy);
+	bvec4 clip_lo_x23 = lessThan(xshift, xleft.zzww);
+	bvec4 clip_hi_x01 = greaterThanEqual(xshift, xright.xxyy);
+	bvec4 clip_hi_x23 = greaterThanEqual(xshift, xright.zzww);
+
+	u8x4 clip_x0 = u8x4(clip_lo_x01) | u8x4(clip_hi_x01);
+	u8x4 clip_x1 = u8x4(clip_lo_x23) | u8x4(clip_hi_x23);
+	u8x4 clip_x = clip_x0 * u8x4(1, 2, 4, 8) + clip_x1 * u8x4(16, 32, 64, 128);
+	u8 clip_coverage = (clip_x.x | clip_x.y) | (clip_x.z | clip_x.w);
+	return ~clip_coverage & U8_C(0xff);
+}
+
+const int COVERAGE_CLAMP = 0;
+const int COVERAGE_WRAP = 1;
+const int COVERAGE_ZAP = 2;
+const int COVERAGE_SAVE = 3;
+
+int blend_coverage(int coverage, int memory_coverage, bool blend_en, int mode)
+{
+	int res = 0;
+	switch (mode)
+	{
+	case COVERAGE_CLAMP:
+	{
+		if (blend_en)
+			res = min(7, memory_coverage + coverage); // image_read_en to read memory coverage, otherwise, it's 7.
+		else
+			res = (coverage - 1) & 7;
+		break;
+	}
+
+	case COVERAGE_WRAP:
+		res = (coverage + memory_coverage) & 7;
+		break;
+
+	case COVERAGE_ZAP:
+		res = 7;
+		break;
+
+	case COVERAGE_SAVE:
+		res = memory_coverage;
+		break;
+	}
+
+	return res;
+}
+
+#endif
--- a/parallel-rdp/shaders/data_structures.h
+++ b/parallel-rdp/shaders/data_structures.h
@@ -0,0 +1,347 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef DATA_STRUCTURES_H_
+#define DATA_STRUCTURES_H_
+
+// Data structures which are supposed to match up with rdp_data_structures.hpp.
+// A little dirty to duplicate like this, but it's non-trivial to share headers with C++,
+// especially when we need to deal with small integer types.
+
+const int TRIANGLE_SETUP_FLIP_BIT = 1 << 0;
+const int TRIANGLE_SETUP_DO_OFFSET_BIT = 1 << 1;
+const int TRIANGLE_SETUP_SKIP_XFRAC_BIT = 1 << 2;
+const int TRIANGLE_SETUP_INTERLACE_FIELD_BIT = 1 << 3;
+const int TRIANGLE_SETUP_INTERLACE_KEEP_ODD_BIT = 1 << 4;
+const int TRIANGLE_SETUP_DISABLE_UPSCALING_BIT = 1 << 5;
+const int TRIANGLE_SETUP_NATIVE_LOD_BIT = 1 << 6;
+const int TRIANGLE_SETUP_FILL_COPY_RASTER_BIT = 1 << 7;
+
+const int RASTERIZATION_INTERLACE_FIELD_BIT = 1 << 0;
+const int RASTERIZATION_INTERLACE_KEEP_ODD_BIT = 1 << 1;
+const int RASTERIZATION_AA_BIT = 1 << 2;
+const int RASTERIZATION_PERSPECTIVE_CORRECT_BIT = 1 << 3;
+const int RASTERIZATION_TLUT_BIT = 1 << 4;
+const int RASTERIZATION_TLUT_TYPE_BIT = 1 << 5;
+const int RASTERIZATION_CVG_TIMES_ALPHA_BIT = 1 << 6;
+const int RASTERIZATION_ALPHA_CVG_SELECT_BIT = 1 << 7;
+const int RASTERIZATION_MULTI_CYCLE_BIT = 1 << 8;
+const int RASTERIZATION_TEX_LOD_ENABLE_BIT = 1 << 9;
+const int RASTERIZATION_SHARPEN_LOD_ENABLE_BIT = 1 << 10;
+const int RASTERIZATION_DETAIL_LOD_ENABLE_BIT = 1 << 11;
+const int RASTERIZATION_FILL_BIT = 1 << 12;
+const int RASTERIZATION_COPY_BIT = 1 << 13;
+const int RASTERIZATION_SAMPLE_MODE_BIT = 1 << 14;
+const int RASTERIZATION_ALPHA_TEST_BIT = 1 << 15;
+const int RASTERIZATION_ALPHA_TEST_DITHER_BIT = 1 << 16;
+const int RASTERIZATION_SAMPLE_MID_TEXEL_BIT = 1 << 17;
+const int RASTERIZATION_USES_TEXEL0_BIT = 1 << 18;
+const int RASTERIZATION_USES_TEXEL1_BIT = 1 << 19;
+const int RASTERIZATION_USES_LOD_BIT = 1 << 20;
+const int RASTERIZATION_USES_PIPELINED_TEXEL1_BIT = 1 << 21;
+const int RASTERIZATION_CONVERT_ONE_BIT = 1 << 22;
+const int RASTERIZATION_BILERP_0_BIT = 1 << 23;
+const int RASTERIZATION_BILERP_1_BIT = 1 << 24;
+const int RASTERIZATION_NEED_NOISE_DUAL_BIT = 1 << 25;
+const int RASTERIZATION_UPSCALING_LOG2_BIT_OFFSET = 26;
+const int RASTERIZATION_NEED_NOISE_BIT = 1 << 28;
+const int RASTERIZATION_USE_STATIC_TEXTURE_SIZE_FORMAT_BIT = 1 << 29;
+const int RASTERIZATION_USE_SPECIALIZATION_CONSTANT_BIT = 1 << 30;
+
+const int DEPTH_BLEND_DEPTH_TEST_BIT = 1 << 0;
+const int DEPTH_BLEND_DEPTH_UPDATE_BIT = 1 << 1;
+const int DEPTH_BLEND_FORCE_BLEND_BIT = 1 << 3;
+const int DEPTH_BLEND_IMAGE_READ_ENABLE_BIT = 1 << 4;
+const int DEPTH_BLEND_COLOR_ON_COVERAGE_BIT = 1 << 5;
+const int DEPTH_BLEND_MULTI_CYCLE_BIT = 1 << 6;
+const int DEPTH_BLEND_AA_BIT = 1 << 7;
+const int DEPTH_BLEND_DITHER_ENABLE_BIT = 1 << 8;
+
+struct TriangleSetupMem
+{
+	int xh, xm, xl;
+	mem_i16 yh, ym;
+	int dxhdy, dxmdy, dxldy;
+	mem_i16 yl; mem_u8 flags; mem_u8 tile;
+};
+
+#if SMALL_TYPES
+#define TriangleSetup TriangleSetupMem
+#else
+struct TriangleSetup
+{
+	int xh, xm, xl;
+	i16 yh, ym;
+	int dxhdy, dxmdy, dxldy;
+	i16 yl; u8 flags; u8 tile;
+};
+#endif
+
+struct AttributeSetupMem
+{
+	ivec4 rgba;
+	ivec4 drgba_dx;
+	ivec4 drgba_de;
+	ivec4 drgba_dy;
+
+	ivec4 stzw;
+	ivec4 dstzw_dx;
+	ivec4 dstzw_de;
+	ivec4 dstzw_dy;
+};
+#define AttributeSetup AttributeSetupMem
+
+struct SpanSetupMem
+{
+	ivec4 rgba;
+	ivec4 stzw;
+
+	mem_u16x4 xleft;
+	mem_u16x4 xright;
+
+	int interpolation_base_x;
+	int start_x;
+	int end_x;
+	mem_i16 lodlength;
+	mem_u16 valid_line;
+};
+#if SMALL_TYPES
+#define SpanSetup SpanSetupMem
+#else
+struct SpanSetup
+{
+	ivec4 rgba;
+	ivec4 stzw;
+
+	u16x4 xleft;
+	u16x4 xright;
+
+	int interpolation_base_x;
+	int start_x;
+	int end_x;
+	i16 lodlength;
+	u16 valid_line;
+};
+#endif
+
+struct SpanInfoOffsetsMem
+{
+	int offset;
+	int ylo;
+	int yhi;
+	int padding;
+};
+#define SpanInfoOffsets SpanInfoOffsetsMem
+
+struct DerivedSetupMem
+{
+	mem_u8x4 constant_muladd0;
+	mem_u8x4 constant_mulsub0;
+	mem_u8x4 constant_mul0;
+	mem_u8x4 constant_add0;
+
+	mem_u8x4 constant_muladd1;
+	mem_u8x4 constant_mulsub1;
+	mem_u8x4 constant_mul1;
+	mem_u8x4 constant_add1;
+
+	mem_u8x4 fog_color;
+	mem_u8x4 blend_color;
+	uint fill_color;
+
+	mem_u16 dz;
+	mem_u8 dz_compressed;
+	mem_u8 min_lod;
+
+	mem_i16x4 factors;
+};
+
+#if SMALL_TYPES
+#define DerivedSetup DerivedSetupMem
+#else
+struct DerivedSetup
+{
+	u8x4 constant_muladd0;
+	u8x4 constant_mulsub0;
+	u8x4 constant_mul0;
+	u8x4 constant_add0;
+
+	u8x4 constant_muladd1;
+	u8x4 constant_mulsub1;
+	u8x4 constant_mul1;
+	u8x4 constant_add1;
+
+	u8x4 fog_color;
+	u8x4 blend_color;
+	uint fill_color;
+
+	u16 dz;
+	u8 dz_compressed;
+	u8 min_lod;
+
+	i16x4 factors;
+};
+#endif
+
+#define ScissorStateMem ivec4
+
+struct ScissorState
+{
+	int xlo, ylo, xhi, yhi;
+};
+
+const int TILE_INFO_CLAMP_S_BIT = 1 << 0;
+const int TILE_INFO_MIRROR_S_BIT = 1 << 1;
+const int TILE_INFO_CLAMP_T_BIT = 1 << 2;
+const int TILE_INFO_MIRROR_T_BIT = 1 << 3;
+
+struct TileInfoMem
+{
+	uint slo;
+	uint shi;
+	uint tlo;
+	uint thi;
+	uint offset;
+	uint stride;
+	mem_u8 fmt;
+	mem_u8 size;
+	mem_u8 palette;
+	mem_u8 mask_s;
+	mem_u8 shift_s;
+	mem_u8 mask_t;
+	mem_u8 shift_t;
+	mem_u8 flags;
+};
+
+#if SMALL_TYPES
+#define TileInfo TileInfoMem
+#else
+struct TileInfo
+{
+	uint slo;
+	uint shi;
+	uint tlo;
+	uint thi;
+	uint offset;
+	uint stride;
+	u8 fmt;
+	u8 size;
+	u8 palette;
+	u8 mask_s;
+	u8 shift_s;
+	u8 mask_t;
+	u8 shift_t;
+	u8 flags;
+};
+#endif
+
+struct StaticRasterizationStateMem
+{
+	mem_u8x4 combiner_inputs_rgb0;
+	mem_u8x4 combiner_inputs_alpha0;
+	mem_u8x4 combiner_inputs_rgb1;
+	mem_u8x4 combiner_inputs_alpha1;
+	uint flags;
+	int dither;
+	int texture_size;
+	int texture_fmt;
+};
+
+#if SMALL_TYPES
+#define StaticRasterizationState StaticRasterizationStateMem
+#else
+struct StaticRasterizationState
+{
+	u8x4 combiner_inputs_rgb0;
+	u8x4 combiner_inputs_alpha0;
+	u8x4 combiner_inputs_rgb1;
+	u8x4 combiner_inputs_alpha1;
+	uint flags;
+	int dither;
+	int texture_size;
+	int texture_fmt;
+};
+#endif
+
+struct DepthBlendStateMem
+{
+	mem_u8x4 blend_modes0;
+	mem_u8x4 blend_modes1;
+	uint flags;
+	mem_u8 coverage_mode;
+	mem_u8 z_mode;
+	mem_u8 padding0;
+	mem_u8 padding1;
+};
+
+#if SMALL_TYPES
+#define DepthBlendState DepthBlendStateMem
+#else
+struct DepthBlendState
+{
+	u8x4 blend_modes0;
+	u8x4 blend_modes1;
+	uint flags;
+	u8 coverage_mode;
+	u8 z_mode;
+	u8 padding0;
+	u8 padding1;
+};
+#endif
+
+struct InstanceIndicesMem
+{
+	mem_u8x4 static_depth_tmem;
+	mem_u8x4 other;
+	mem_u8 tile_infos[8];
+};
+
+struct TMEMInstance16Mem
+{
+	mem_u16 elems[2048];
+};
+
+struct TMEMInstance8Mem
+{
+	mem_u8 elems[4096];
+};
+
+struct ShadedData
+{
+	u8x4 combined;
+	int z_dith;
+	u8 coverage_count;
+	u8 shade_alpha;
+};
+
+const int COVERAGE_FILL_BIT = 0x40;
+const int COVERAGE_COPY_BIT = 0x20;
+
+struct GlobalFBInfo
+{
+	int dx_shift;
+	int dx_mask;
+	int fb_size;
+	uint base_primitive_index;
+};
+
+#endif
--- a/parallel-rdp/shaders/data_structures_buffers.h
+++ b/parallel-rdp/shaders/data_structures_buffers.h
@@ -0,0 +1,134 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef DATA_STRUCTURES_BUFFERS_H_
+#define DATA_STRUCTURES_BUFFERS_H_
+
+#include "data_structures.h"
+
+layout(set = 0, binding = 0, std430) buffer VRAM32
+{
+	uint data[];
+} vram32;
+
+layout(set = 0, binding = 0, std430) buffer VRAM16
+{
+	mem_u16 data[];
+} vram16;
+
+layout(set = 0, binding = 0, std430) buffer VRAM8
+{
+	mem_u8 data[];
+} vram8;
+
+layout(set = 0, binding = 1, std430) buffer HiddenVRAM
+{
+	mem_u8 data[];
+} hidden_vram;
+
+layout(set = 0, binding = 2, std430) readonly buffer TMEM16
+{
+	TMEMInstance16Mem instances[];
+} tmem16;
+
+layout(set = 0, binding = 2, std430) readonly buffer TMEM8
+{
+	TMEMInstance8Mem instances[];
+} tmem8;
+
+layout(set = 1, binding = 0, std430) readonly buffer TriangleSetupBuffer
+{
+	TriangleSetupMem elems[];
+} triangle_setup;
+#include "load_triangle_setup.h"
+
+layout(set = 1, binding = 1, std430) readonly buffer AttributeSetupBuffer
+{
+	AttributeSetupMem elems[];
+} attribute_setup;
+#include "load_attribute_setup.h"
+
+layout(set = 1, binding = 2, std430) readonly buffer DerivedSetupBuffer
+{
+	DerivedSetupMem elems[];
+} derived_setup;
+#include "load_derived_setup.h"
+
+layout(set = 1, binding = 3, std430) readonly buffer ScissorStateBuffer
+{
+	ScissorStateMem elems[];
+} scissor_state;
+#include "load_scissor_state.h"
+
+layout(set = 1, binding = 4, std430) readonly buffer StaticRasterStateBuffer
+{
+	StaticRasterizationStateMem elems[];
+} static_raster_state;
+#include "load_static_raster_state.h"
+
+layout(set = 1, binding = 5, std430) readonly buffer DepthBlendStateBuffer
+{
+	DepthBlendStateMem elems[];
+} depth_blend_state;
+#include "load_depth_blend_state.h"
+
+layout(set = 1, binding = 6, std430) readonly buffer StateIndicesBuffer
+{
+	InstanceIndicesMem elems[];
+} state_indices;
+
+layout(set = 1, binding = 7, std430) readonly buffer TileInfoBuffer
+{
+	TileInfoMem elems[];
+} tile_infos;
+#include "load_tile_info.h"
+
+layout(set = 1, binding = 8, std430) readonly buffer SpanSetups
+{
+	SpanSetupMem elems[];
+} span_setups;
+#include "load_span_setup.h"
+
+layout(set = 1, binding = 9, std430) readonly buffer SpanInfoOffsetBuffer
+{
+	SpanInfoOffsetsMem elems[];
+} span_offsets;
+#include "load_span_offsets.h"
+
+layout(set = 1, binding = 10) uniform utextureBuffer uBlenderDividerLUT;
+
+layout(set = 1, binding = 11, std430) readonly buffer TileBinning
+{
+	uint elems[];
+} tile_binning;
+
+layout(set = 1, binding = 12, std430) readonly buffer TileBinningCoarse
+{
+	uint elems[];
+} tile_binning_coarse;
+
+layout(set = 2, binding = 0, std140) uniform GlobalConstants
+{
+	GlobalFBInfo fb_info;
+} global_constants;
+
+#endif
--- a/parallel-rdp/shaders/debug.h
+++ b/parallel-rdp/shaders/debug.h
@@ -0,0 +1,151 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef DEBUG_H_
+#define DEBUG_H_
+
+#if defined(DEBUG_ENABLE) && DEBUG_ENABLE
+#include "debug_channel.h"
+
+const uint CODE_ASSERT_EQUAL = 0;
+const uint CODE_ASSERT_NOT_EQUAL = 1;
+const uint CODE_ASSERT_LESS_THAN = 2;
+const uint CODE_ASSERT_LESS_THAN_EQUAL = 3;
+const uint CODE_GENERIC = 4;
+const uint CODE_HEX = 5;
+
+void ASSERT_EQUAL_(int line, int a, int b)
+{
+	if (a != b)
+		add_debug_message(CODE_ASSERT_EQUAL, gl_GlobalInvocationID, ivec3(line, a, b));
+}
+
+void ASSERT_NOT_EQUAL_(int line, int a, int b)
+{
+	if (a == b)
+		add_debug_message(CODE_ASSERT_NOT_EQUAL, gl_GlobalInvocationID, ivec3(line, a, b));
+}
+
+void ASSERT_LESS_THAN_(int line, int a, int b)
+{
+	if (a >= b)
+		add_debug_message(CODE_ASSERT_LESS_THAN, gl_GlobalInvocationID, ivec3(line, a, b));
+}
+
+void ASSERT_LESS_THAN_EQUAL_(int line, int a, int b)
+{
+	if (a > b)
+		add_debug_message(CODE_ASSERT_LESS_THAN_EQUAL, gl_GlobalInvocationID, ivec3(line, a, b));
+}
+
+void ASSERT_EQUAL_(int line, uint a, uint b)
+{
+	if (a != b)
+		add_debug_message(CODE_ASSERT_EQUAL, gl_GlobalInvocationID, ivec3(line, a, b));
+}
+
+void ASSERT_NOT_EQUAL_(int line, uint a, uint b)
+{
+	if (a == b)
+		add_debug_message(CODE_ASSERT_NOT_EQUAL, gl_GlobalInvocationID, ivec3(line, a, b));
+}
+
+void ASSERT_LESS_THAN_(int line, uint a, uint b)
+{
+	if (a >= b)
+		add_debug_message(CODE_ASSERT_LESS_THAN, gl_GlobalInvocationID, ivec3(line, a, b));
+}
+
+void ASSERT_LESS_THAN_EQUAL_(int line, uint a, uint b)
+{
+	if (a > b)
+		add_debug_message(CODE_ASSERT_LESS_THAN_EQUAL, gl_GlobalInvocationID, ivec3(line, a, b));
+}
+
+void GENERIC_MESSAGE_(int line)
+{
+	add_debug_message(CODE_GENERIC, gl_GlobalInvocationID, line);
+}
+
+void GENERIC_MESSAGE_(int line, uint v)
+{
+	add_debug_message(CODE_GENERIC, gl_GlobalInvocationID, uvec2(line, v));
+}
+
+void GENERIC_MESSAGE_(int line, uvec2 v)
+{
+	add_debug_message(CODE_GENERIC, gl_GlobalInvocationID, uvec3(line, v));
+}
+
+void GENERIC_MESSAGE_(int line, uvec3 v)
+{
+	add_debug_message(CODE_GENERIC, gl_GlobalInvocationID, uvec4(line, v));
+}
+
+void HEX_MESSAGE_(int line)
+{
+	add_debug_message(CODE_HEX, gl_GlobalInvocationID, line);
+}
+
+void HEX_MESSAGE_(int line, uint v)
+{
+	add_debug_message(CODE_HEX, gl_GlobalInvocationID, uvec2(line, v));
+}
+
+void HEX_MESSAGE_(int line, uvec2 v)
+{
+	add_debug_message(CODE_HEX, gl_GlobalInvocationID, uvec3(line, v));
+}
+
+void HEX_MESSAGE_(int line, uvec3 v)
+{
+	add_debug_message(CODE_HEX, gl_GlobalInvocationID, uvec4(line, v));
+}
+
+#define ASSERT_EQUAL(a, b) ASSERT_EQUAL_(__LINE__, a, b)
+#define ASSERT_NOT_EQUAL(a, b) ASSERT_NOT_EQUAL_(__LINE__, a, b)
+#define ASSERT_LESS_THAN(a, b) ASSERT_LESS_THAN_(__LINE__, a, b)
+#define ASSERT_LESS_THAN_EQUAL(a, b) ASSERT_LESS_THAN_EQUAL_(__LINE__, a, b)
+#define GENERIC_MESSAGE0() GENERIC_MESSAGE_(__LINE__)
+#define GENERIC_MESSAGE1(a) GENERIC_MESSAGE_(__LINE__, a)
+#define GENERIC_MESSAGE2(a, b) GENERIC_MESSAGE_(__LINE__, uvec2(a, b))
+#define GENERIC_MESSAGE3(a, b, c) GENERIC_MESSAGE_(__LINE__, uvec3(a, b, c))
+#define HEX_MESSAGE0() HEX_MESSAGE_(__LINE__)
+#define HEX_MESSAGE1(a) HEX_MESSAGE_(__LINE__, a)
+#define HEX_MESSAGE2(a, b) HEX_MESSAGE_(__LINE__, uvec2(a, b))
+#define HEX_MESSAGE3(a, b, c) HEX_MESSAGE_(__LINE__, uvec3(a, b, c))
+#else
+#define ASSERT_EQUAL(a, b)
+#define ASSERT_NOT_EQUAL(a, b)
+#define ASSERT_LESS_THAN(a, b)
+#define ASSERT_LESS_THAN_EQUAL(a, b)
+#define GENERIC_MESSAGE0()
+#define GENERIC_MESSAGE1(a)
+#define GENERIC_MESSAGE2(a, b)
+#define GENERIC_MESSAGE3(a, b, c)
+#define HEX_MESSAGE0()
+#define HEX_MESSAGE1(a)
+#define HEX_MESSAGE2(a, b)
+#define HEX_MESSAGE3(a, b, c)
+#endif
+
+#endif
--- a/parallel-rdp/shaders/depth_blend.comp
+++ b/parallel-rdp/shaders/depth_blend.comp
@@ -0,0 +1,149 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#if SUBGROUP
+#extension GL_KHR_shader_subgroup_basic : require
+#extension GL_KHR_shader_subgroup_vote : require
+#extension GL_KHR_shader_subgroup_ballot : require
+#extension GL_KHR_shader_subgroup_arithmetic : require
+#endif
+#include "small_types.h"
+
+layout(local_size_x_id = 3, local_size_y_id = 4) in;
+
+#include "noise.h"
+#include "debug.h"
+#include "data_structures_buffers.h"
+#include "memory_interfacing.h"
+
+layout(set = 0, binding = 3, std430) readonly buffer ColorBuffer
+{
+    mem_u8x4 elems[];
+} color;
+
+layout(set = 0, binding = 3, std430) readonly buffer ColorRawBuffer
+{
+    uint elems[];
+} raw_color;
+
+layout(set = 0, binding = 4, std430) readonly buffer DepthBuffer
+{
+    int elems[];
+} depth;
+
+layout(set = 0, binding = 5, std430) readonly buffer ShadeAlpha
+{
+    mem_u8 elems[];
+} shade_alpha;
+
+layout(set = 0, binding = 6, std430) readonly buffer Coverage
+{
+    mem_i8 elems[];
+} coverage;
+
+layout(std430, set = 0, binding = 7) readonly buffer TileInstanceOffset
+{
+    uint elems[];
+} tile_instance_offsets;
+
+layout(push_constant, std430) uniform Registers
+{
+    uint fb_addr_index;
+    uint fb_depth_addr_index;
+    uint fb_width;
+    uint fb_height;
+    uint group_mask;
+} registers;
+
+layout(constant_id = 5) const int MAX_PRIMITIVES = 256;
+layout(constant_id = 6) const int MAX_WIDTH = 1024;
+
+const int TILE_BINNING_STRIDE = MAX_PRIMITIVES / 32;
+const int MAX_TILES_X = MAX_WIDTH / int(gl_WorkGroupSize.x);
+
+// Overall architecture of the tiling is from RetroWarp.
+
+void main()
+{
+    int x = int(gl_GlobalInvocationID.x);
+    int y = int(gl_GlobalInvocationID.y);
+    ivec2 tile = ivec2(gl_WorkGroupID.xy);
+
+    int linear_tile = tile.x + tile.y * MAX_TILES_X;
+    int linear_tile_base = linear_tile * TILE_BINNING_STRIDE;
+
+    uint coarse_binned = tile_binning_coarse.elems[linear_tile] & registers.group_mask;
+    if (coarse_binned == 0u)
+        return;
+
+    init_tile(gl_GlobalInvocationID.xy,
+              registers.fb_width, registers.fb_height,
+              registers.fb_addr_index, registers.fb_depth_addr_index);
+
+    while (coarse_binned != 0u)
+    {
+        int mask_index = findLSB(coarse_binned);
+        coarse_binned &= ~uint(1 << mask_index);
+
+        uint tile_instance = tile_instance_offsets.elems[linear_tile_base + mask_index];
+        uint binned = tile_binning.elems[linear_tile_base + mask_index];
+
+        while (binned != 0u)
+        {
+            int i = findLSB(binned);
+            binned &= ~uint(1 << i);
+            uint primitive_index = uint(i + 32 * mask_index);
+
+            uint index = tile_instance * (gl_WorkGroupSize.x * gl_WorkGroupSize.y) + gl_LocalInvocationIndex;
+            int coverage = int(coverage.elems[index]);
+
+            if (coverage >= 0)
+            {
+                if ((coverage & COVERAGE_FILL_BIT) != 0)
+                {
+                    fill_color(derived_setup.elems[primitive_index].fill_color);
+                }
+                else if ((coverage & COVERAGE_COPY_BIT) != 0)
+                {
+                    uint word = raw_color.elems[index];
+                    copy_pipeline(word, primitive_index);
+                }
+                else
+                {
+                    ShadedData shaded;
+                    shaded.combined = u8x4(color.elems[index]);
+                    shaded.z_dith = depth.elems[index];
+                    shaded.shade_alpha = u8(shade_alpha.elems[index]);
+                    shaded.coverage_count = u8(coverage);
+                    depth_blend(x, y, primitive_index, shaded);
+                }
+            }
+
+            tile_instance++;
+        }
+    }
+
+    finish_tile(gl_GlobalInvocationID.xy,
+                registers.fb_width, registers.fb_height,
+                registers.fb_addr_index, registers.fb_depth_addr_index);
+}
--- a/parallel-rdp/shaders/depth_test.h
+++ b/parallel-rdp/shaders/depth_test.h
@@ -0,0 +1,146 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef DEPTH_TEST_H_
+#define DEPTH_TEST_H_
+
+#include "z_encode.h"
+
+const int Z_MODE_OPAQUE = 0;
+const int Z_MODE_INTERPENETRATING = 1;
+const int Z_MODE_TRANSPARENT = 2;
+const int Z_MODE_DECAL = 3;
+
+int combine_dz(int dz)
+{
+	// Find largest POT which is <= dz.
+	if (dz != 0)
+		dz = 1 << findMSB(dz);
+	return dz;
+}
+
+bool depth_test(int z, int dz, int dz_compressed,
+                u16 current_depth, u8 current_dz,
+                inout int coverage_count, int current_coverage_count,
+                bool z_compare, int z_mode,
+                bool force_blend, bool aa_enable,
+                out bool blend_en, out bool coverage_wrap, out u8x2 blend_shift)
+{
+	bool depth_pass;
+
+	if (z_compare)
+	{
+		int memory_z = z_decompress(current_depth);
+		int memory_dz = dz_decompress(current_dz);
+		int precision_factor = (int(current_depth) >> 11) & 0xf;
+		bool coplanar = false;
+
+		blend_shift.x = u8(clamp(dz_compressed - current_dz, 0, 4));
+		blend_shift.y = u8(clamp(current_dz - dz_compressed, 0, 4));
+
+		if (precision_factor < 3)
+		{
+			if (memory_dz != 0x8000)
+				memory_dz = max(memory_dz << 1, 16 >> precision_factor);
+			else
+			{
+				coplanar = true;
+				memory_dz = 0xffff;
+			}
+		}
+
+		int combined_dz = combine_dz(dz | memory_dz);
+		int combined_dz_interpenetrate = combined_dz;
+		combined_dz <<= 3;
+
+		bool farther = coplanar || ((z + combined_dz) >= memory_z);
+		bool overflow = (coverage_count + current_coverage_count) >= 8;
+
+		blend_en = force_blend || (!overflow && aa_enable && farther);
+		coverage_wrap = overflow;
+
+		depth_pass = false;
+		bool max_z = memory_z == 0x3ffff;
+		bool front = z < memory_z;
+		int z_closest_possible = z - combined_dz;
+		bool nearer = coplanar || (z_closest_possible <= memory_z);
+
+		switch (z_mode)
+		{
+		case Z_MODE_OPAQUE:
+		{
+			// The OPAQUE mode is normal less-than.
+			// However, if z is sufficiently close enough to memory Z, we assume that we have the same surface
+			// and we should simply increment coverage (blend_en).
+			// If we overflow coverage, it is clear that we have a different surface, and here we should only
+			// consider pure in-front test and overwrite coverage.
+			depth_pass = max_z || (overflow ? front : nearer);
+			break;
+		}
+
+		case Z_MODE_INTERPENETRATING:
+		{
+			// This one is ... interesting as it affects coverage.
+			if (!front || !farther || !overflow)
+			{
+				// If there is no decal-like intersect, treat this as normal opaque mode.
+				depth_pass = max_z || (overflow ? front : nearer);
+			}
+			else
+			{
+				// Modify coverage based on how far away current surface we are somehow?
+				combined_dz_interpenetrate = dz_compress(combined_dz_interpenetrate & 0xffff);
+				int cvg_coeff = ((memory_z >> combined_dz_interpenetrate) - (z >> combined_dz_interpenetrate)) & 0xf;
+				coverage_count = min((cvg_coeff * coverage_count) >> 3, 8);
+				depth_pass = true;
+			}
+			break;
+		}
+
+		case Z_MODE_TRANSPARENT:
+		{
+			depth_pass = front || max_z;
+			break;
+		}
+
+		case Z_MODE_DECAL:
+		{
+			// Decals pass if |z - memory_z| <= max(dz, memory_dz).
+			depth_pass = farther && nearer && !max_z;
+			break;
+		}
+		}
+	}
+	else
+	{
+		blend_shift.x = u8(0);
+		blend_shift.y = u8(min(0xf - dz_compressed, 4));
+
+		bool overflow = (coverage_count + current_coverage_count) >= 8;
+		blend_en = force_blend || (!overflow && aa_enable);
+		coverage_wrap = overflow;
+		depth_pass = true;
+	}
+	return depth_pass;
+}
+
+#endif
--- a/parallel-rdp/shaders/dither.h
+++ b/parallel-rdp/shaders/dither.h
@@ -0,0 +1,70 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef DITHER_H_
+#define DITHER_H_
+
+const u8 dither_matrices[2][16] = u8[][](
+		u8[](U8_C(0), U8_C(6), U8_C(1), U8_C(7), U8_C(4), U8_C(2), U8_C(5), U8_C(3), U8_C(3), U8_C(5), U8_C(2), U8_C(4), U8_C(7), U8_C(1), U8_C(6), U8_C(0)),
+		u8[](U8_C(0), U8_C(4), U8_C(1), U8_C(5), U8_C(4), U8_C(0), U8_C(5), U8_C(1), U8_C(3), U8_C(7), U8_C(2), U8_C(6), U8_C(7), U8_C(3), U8_C(6), U8_C(2)));
+
+u8x3 rgb_dither(ivec3 orig_rgb, int dith)
+{
+	ivec3 rgb_dith = (ivec3(dith) >> ivec3(0, 3, 6)) & 7;
+	ivec3 rgb = mix((orig_rgb & 0xf8) + 8, ivec3(255), greaterThan(orig_rgb, ivec3(247)));
+	ivec3 replace_sign = (rgb_dith - (orig_rgb & 7)) >> 31;
+	ivec3 dither_diff = rgb - orig_rgb;
+	rgb = orig_rgb + (dither_diff & replace_sign);
+	return u8x3(rgb & 0xff);
+}
+
+void dither_coefficients(int x, int y, int dither_mode_rgb, int dither_mode_alpha, out int rgb_dither, out int alpha_dither)
+{
+	const int DITHER_SPLAT = (1 << 0) | (1 << 3) | (1 << 6);
+
+	if (dither_mode_rgb < 2)
+		rgb_dither = int(dither_matrices[dither_mode_rgb][(y & 3) * 4 + (x & 3)]) * DITHER_SPLAT;
+	else if (dither_mode_rgb == 2)
+		rgb_dither = noise_get_dither_color();
+	else
+		rgb_dither = 0;
+
+	if (dither_mode_alpha == 3)
+		alpha_dither = 0;
+	else
+	{
+		if (dither_mode_alpha == 2)
+		{
+			alpha_dither = noise_get_dither_alpha();
+		}
+		else
+		{
+			alpha_dither = dither_mode_rgb >= 2 ?
+				int(dither_matrices[dither_mode_rgb & 1][(y & 3) * 4 + (x & 3)]) : (rgb_dither & 7);
+
+			if (dither_mode_alpha == 1)
+				alpha_dither = ~alpha_dither & 7;
+		}
+	}
+}
+
+#endif
--- a/parallel-rdp/shaders/extract_vram.comp
+++ b/parallel-rdp/shaders/extract_vram.comp
@@ -0,0 +1,107 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+#include "small_types.h"
+layout(local_size_x = 16, local_size_y = 8) in;
+
+// Copies VRAM into a texture which is then consumed by VI scanout.
+
+layout(set = 0, binding = 0, rgba8ui) uniform writeonly uimage2D uAAInput;
+layout(set = 0, binding = 1, std430) readonly buffer RDRAM16
+{
+	mem_u16 elems[];
+} vram16;
+
+layout(set = 0, binding = 1, std430) readonly buffer RDRAM32
+{
+	uint elems[];
+} vram32;
+
+layout(set = 0, binding = 2, std430) readonly buffer HiddenRDRAM
+{
+	mem_u8 elems[];
+} hidden_vram;
+
+layout(push_constant, std430) uniform Registers
+{
+	int fb_offset;
+	int fb_width;
+	ivec2 offset;
+	ivec2 resolution;
+} registers;
+
+layout(constant_id = 0) const int RDRAM_SIZE = 0;
+const int RDRAM_MASK_8 = RDRAM_SIZE - 1;
+const int RDRAM_MASK_16 = RDRAM_MASK_8 >> 1;
+const int RDRAM_MASK_32 = RDRAM_MASK_16 >> 1;
+layout(constant_id = 2) const int SCALING_LOG2 = 0;
+const int SCALING_FACTOR = 1 << SCALING_LOG2;
+
+#include "vi_status.h"
+
+uvec4 fetch_color(ivec2 coord)
+{
+	ivec2 slice2d = coord & (SCALING_FACTOR - 1);
+	coord >>= SCALING_LOG2;
+	int slice = slice2d.y * SCALING_FACTOR + slice2d.x;
+
+	uvec4 color;
+	if (FMT_RGBA8888)
+	{
+		int linear_coord = coord.y * registers.fb_width + coord.x + registers.fb_offset;
+		linear_coord &= RDRAM_MASK_32;
+		linear_coord += slice * (RDRAM_SIZE >> 2);
+		uint word = uint(vram32.elems[linear_coord]);
+		color = (uvec4(word) >> uvec4(24, 16, 8, 5)) & uvec4(0xff, 0xff, 0xff, 7);
+	}
+	else if (FMT_RGBA5551)
+	{
+		int linear_coord = coord.y * registers.fb_width + coord.x + registers.fb_offset;
+		linear_coord &= RDRAM_MASK_16;
+		linear_coord += slice * (RDRAM_SIZE >> 1);
+		uint word = uint(vram16.elems[linear_coord ^ 1]);
+		uint hidden_word = uint(hidden_vram.elems[linear_coord]);
+
+		uint r = (word >> 8u) & 0xf8u;
+		uint g = (word >> 3u) & 0xf8u;
+		uint b = (word << 2u) & 0xf8u;
+		uint a = ((word & 1u) << 2u) | hidden_word;
+		color = uvec4(r, g, b, a);
+	}
+	else
+		color = uvec4(0);
+
+	if (!FETCH_AA)
+		color.a = 7u;
+
+	return color;
+}
+
+void main()
+{
+	if (any(greaterThanEqual(gl_GlobalInvocationID.xy, registers.resolution)))
+		return;
+
+	ivec2 coord = ivec2(gl_GlobalInvocationID.xy) + registers.offset;
+	uvec4 col = fetch_color(coord);
+	imageStore(uAAInput, ivec2(gl_GlobalInvocationID.xy), col);
+}
--- a/parallel-rdp/shaders/fb_formats.h
+++ b/parallel-rdp/shaders/fb_formats.h
@@ -0,0 +1,10 @@
+#ifndef FB_FORMATS_H_
+#define FB_FORMATS_H_
+
+const int FB_FMT_I4 = 0;
+const int FB_FMT_I8 = 1;
+const int FB_FMT_RGBA5551 = 2;
+const int FB_FMT_IA88 = 3;
+const int FB_FMT_RGBA8888 = 4;
+
+#endif
--- a/parallel-rdp/shaders/fullscreen.vert
+++ b/parallel-rdp/shaders/fullscreen.vert
@@ -0,0 +1,32 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+void main()
+{
+    if (gl_VertexIndex == 0)
+        gl_Position = vec4(-1.0, -1.0, 0.0, 1.0);
+    else if (gl_VertexIndex == 1)
+        gl_Position = vec4(-1.0, +3.0, 0.0, 1.0);
+    else
+        gl_Position = vec4(+3.0, -1.0, 0.0, 1.0);
+}
--- a/parallel-rdp/shaders/interpolation.h
+++ b/parallel-rdp/shaders/interpolation.h
@@ -0,0 +1,255 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef INTERPOLATION_H_
+#define INTERPOLATION_H_
+
+#include "data_structures.h"
+#include "clamping.h"
+#include "perspective.h"
+
+u8x4 interpolate_rgba(ivec4 rgba, ivec4 drgba_dx, ivec4 drgba_dy, int dx, int coverage)
+{
+	rgba += ((drgba_dx & ~0x1f) >> SCALING_LOG2) * dx;
+
+	// RGBA is interpolated to 9-bit. The last bit is used to deal with clamping.
+	// Slight underflow below 0 is clamped to 0 and slight overflow above 0xff is clamped to 0xff.
+
+	// Keep 2 sign bits of precision before we complete the centroid interpolation.
+	i16x4 snapped_rgba = i16x4(rgba >> 14);
+
+	// Centroid clipping is based on the first coverage bit, and we interpolate at the first subpixel in scanline order.
+	// With this layout we can just use findLSB to get correct result.
+	// 0x01        0x02
+	//       0x04        0x08
+	// 0x10        0x20
+	//       0x40        0x80
+	int first_coverage = findLSB(coverage);
+	i16 yoff = i16(first_coverage >> 1);
+	i16 xoff = i16((first_coverage & 1) << 1) + (yoff & I16_C(1));
+	snapped_rgba <<= I16_C(2 + SCALING_LOG2);
+	snapped_rgba += xoff * i16x4(drgba_dx >> 14) + yoff * i16x4(drgba_dy >> 14);
+	snapped_rgba >>= I16_C(4 + SCALING_LOG2);
+	return clamp_9bit(snapped_rgba);
+}
+
+void interpolate_st_copy(SpanSetup span, ivec4 dstzw_dx, int x, bool perspective, bool flip,
+                         out ivec2 st, out int s_offset)
+{
+	int dx = flip ? (x - span.start_x) : (span.end_x - x);
+
+	// For copy pipe, we should duplicate pixels when scaling, there is no filtering we can (or should!) do.
+	dx >>= SCALING_LOG2;
+
+	// Snap DX to where we perform interpolation (once per N output pixels).
+	int snapped_dx = dx & global_constants.fb_info.dx_mask;
+	s_offset = dx - snapped_dx;
+	int lerp_dx = (dx >> global_constants.fb_info.dx_shift) * (flip ? 1 : -1);
+	ivec3 stw = span.stzw.xyw + (dstzw_dx.xyw & ~0x1f) * lerp_dx;
+
+	if (perspective)
+	{
+		bool st_overflow;
+		st = perspective_divide(stw >> 16, st_overflow);
+	}
+	else
+		st = no_perspective_divide(stw >> 16);
+}
+
+ivec2 interpolate_st_single(ivec4 stzw, ivec4 dstzw_dx, int dx, bool perspective)
+{
+	ivec3 stw = stzw.xyw + ((dstzw_dx.xyw & ~0x1f) >> SCALING_LOG2) * dx;
+	stw >>= 16;
+	ivec2 st;
+
+	if (perspective)
+	{
+		bool st_overflow;
+		st = perspective_divide(stw, st_overflow);
+	}
+	else
+		st = no_perspective_divide(stw);
+
+	return st;
+}
+
+void interpolate_stz(ivec4 stzw, ivec4 dstzw_dx, ivec4 dstzw_dy, int dx, int coverage, bool perspective, bool uses_lod,
+                     int flip_direction, out ivec2 st, out ivec2 st_dx, out ivec2 st_dy, out int z, inout bool st_overflow)
+{
+	ivec3 stw = stzw.xyw + ((dstzw_dx.xyw & ~0x1f) >> SCALING_LOG2) * dx;
+	ivec3 stw_dx, stw_dy;
+
+	if (uses_lod)
+	{
+		stw_dx = stw + flip_direction * ((dstzw_dx.xyw & ~0x1f) >> SCALING_LOG2);
+		if (SCALING_FACTOR > 1)
+			stw_dy = stw + abs(flip_direction) * ((dstzw_dy.xyw & ~0x7fff) >> SCALING_LOG2);
+		else
+			stw_dy = stw + ((dstzw_dy.xyw & ~0x7fff) >> SCALING_LOG2);
+	}
+
+	if (perspective)
+	{
+		st = perspective_divide(stw >> 16, st_overflow);
+		if (uses_lod)
+		{
+			st_dx = perspective_divide(stw_dx >> 16, st_overflow);
+			st_dy = perspective_divide(stw_dy >> 16, st_overflow);
+		}
+	}
+	else
+	{
+		st = no_perspective_divide(stw >> 16);
+		if (uses_lod)
+		{
+			st_dx = no_perspective_divide(stw_dx >> 16);
+			st_dy = no_perspective_divide(stw_dy >> 16);
+		}
+	}
+
+	// Ensure that interpolation snaps as we expect on every "main" pixel,
+	// for subpixels, interpolate with quantized step factor.
+	z = stzw.z + dstzw_dx.z * (dx >> SCALING_LOG2) + (dstzw_dx.z >> SCALING_LOG2) * (dx & (SCALING_FACTOR - 1));
+
+	int snapped_z = z >> 10;
+	int first_coverage = findLSB(coverage);
+	int yoff = first_coverage >> 1;
+	int xoff = ((first_coverage & 1) << 1) + (yoff & I16_C(1));
+	snapped_z <<= 2 + SCALING_LOG2;
+	snapped_z += xoff * (dstzw_dx.z >> 10) + yoff * (dstzw_dy.z >> 10);
+	snapped_z >>= 5 + SCALING_LOG2;
+
+	z = clamp_z(snapped_z);
+}
+
+#if 0
+u8x4 interpolate_rgba(TriangleSetup setup, AttributeSetup attr, int x, int y, int coverage)
+{
+	bool do_offset = (setup.flags & TRIANGLE_SETUP_DO_OFFSET_BIT) != 0;
+	int y_interpolation_base = int(setup.yh) >> 2;
+	int xh = setup.xh + (y - y_interpolation_base) * (setup.dxhdy << 2);
+
+	ivec4 drgba_diff = ivec4(0);
+
+	// In do_offset mode, varyings are latched at last subpixel line instead of first (for some reason).
+	if (do_offset)
+	{
+		xh += 3 * setup.dxhdy;
+		ivec4 drgba_deh = attr.drgba_de & ~0x1ff;
+		ivec4 drgba_dyh = attr.drgba_dy & ~0x1ff;
+		drgba_diff = drgba_deh - (drgba_deh >> 2) - drgba_dyh + (drgba_dyh >> 2);
+	}
+
+	int base_x = xh >> 16;
+	int xfrac = (xh >> 8) & 0xff;
+
+	ivec4 rgba = attr.rgba;
+	rgba += attr.drgba_de * (y - y_interpolation_base);
+	rgba = ((rgba & ~0x1ff) + drgba_diff - xfrac * ((attr.drgba_dx >> 8) & ~1)) & ~0x3ff;
+	rgba += (attr.drgba_dx & ~0x1f) * (x - base_x);
+
+	// RGBA is interpolated to 9-bit. The last bit is used to deal with clamping.
+	// Slight underflow below 0 is clamped to 0 and slight overflow above 0xff is clamped to 0xff.
+
+	// Keep 2 sign bits of precision before we complete the centroid interpolation.
+	i16x4 snapped_rgba = i16x4(rgba >> 14);
+
+	// Centroid clipping is based on the first coverage bit, and we interpolate at the first subpixel in scanline order.
+	// FWIW, Angrylion has a very different coverage bit assignment, but we need this layout to avoid an awkward LUT.
+	// With this layout we can just use findLSB instead.
+	// 0x01        0x02
+	//       0x04        0x08
+	// 0x10        0x20
+	//       0x40        0x80
+	int first_coverage = findLSB(coverage);
+	i16 yoff = i16(first_coverage >> 1);
+	i16 xoff = i16((first_coverage & 1) << 1) + (yoff & I16_C(1));
+	snapped_rgba <<= I16_C(2);
+	snapped_rgba += xoff * i16x4(attr.drgba_dx >> 14) + yoff * i16x4(attr.drgba_dy >> 14);
+	snapped_rgba >>= I16_C(4);
+	return clamp_9bit(snapped_rgba);
+}
+
+ivec3 interpolate_stw(TriangleSetup setup, AttributeSetup attr, int x, int y)
+{
+	bool do_offset = (setup.flags & TRIANGLE_SETUP_DO_OFFSET_BIT) != 0;
+	int y_interpolation_base = int(setup.yh) >> 2;
+	int xh = setup.xh + (y - y_interpolation_base) * (setup.dxhdy << 2);
+
+	ivec3 dstw_diff = ivec3(0);
+
+	// In do_offset mode, varyings are latched at last subpixel line instead of first (for some reason).
+	if (do_offset)
+	{
+		xh += 3 * setup.dxhdy;
+		ivec3 dstw_deh = attr.dstzw_de.xyw & ~0x1ff;
+		ivec3 dstw_dyh = attr.dstzw_dy.xyw & ~0x1ff;
+		dstw_diff = dstw_deh - (dstw_deh >> 2) - dstw_dyh + (dstw_dyh >> 2);
+	}
+
+	int base_x = xh >> 16;
+	int xfrac = (xh >> 8) & 0xff;
+
+	ivec3 stw = attr.stzw.xyw;
+	stw += attr.dstzw_de.xyw * (y - y_interpolation_base);
+	stw = ((stw & ~0x1ff) + dstw_diff - xfrac * ((attr.dstzw_dx.xyw >> 8) & ~1)) & ~0x3ff;
+	stw += (attr.dstzw_dx.xyw & ~0x1f) * (x - base_x);
+
+	ivec3 snapped_stw = stw >> 16;
+	return snapped_stw;
+}
+
+int interpolate_z(TriangleSetup setup, AttributeSetup attr, int x, int y, int coverage)
+{
+	bool do_offset = (setup.flags & TRIANGLE_SETUP_DO_OFFSET_BIT) != 0;
+	int y_interpolation_base = int(setup.yh) >> 2;
+	int xh = setup.xh + (y - y_interpolation_base) * (setup.dxhdy << 2);
+
+	int dzdiff = 0;
+	// In do_offset mode, varyings are latched at last subpixel line instead of first (for some reason).
+	if (do_offset)
+	{
+		xh += 3 * setup.dxhdy;
+		int dzdeh = attr.dstzw_de.z & ~0x1ff;
+		int dzdyh = attr.dstzw_dy.z & ~0x1ff;
+		dzdiff = dzdeh - (dzdeh >> 2) - dzdyh + (dzdyh >> 2);
+	}
+
+	int base_x = xh >> 16;
+	int xfrac = (xh >> 8) & 0xff;
+	int z = attr.stzw.z;
+	z += attr.dstzw_de.z * (y - y_interpolation_base);
+	z = ((z & ~0x1ff) + dzdiff - xfrac * ((attr.dstzw_dx.z >> 8) & ~1)) & ~0x3ff;
+	z += attr.dstzw_dx.z * (x - base_x);
+
+	int snapped_z = z >> 10;
+	int first_coverage = findLSB(coverage);
+	int yoff = first_coverage >> 1;
+	int xoff = ((first_coverage & 1) << 1) + (yoff & 1s);
+	snapped_z <<= 2;
+	snapped_z += xoff * (attr.dstzw_dx.z >> 10) + yoff * (attr.dstzw_dy.z >> 10);
+	snapped_z >>= 5;
+	return clamp_z(snapped_z);
+}
+#endif
+
+#endif
--- a/parallel-rdp/shaders/load_attribute_setup.h
+++ b/parallel-rdp/shaders/load_attribute_setup.h
@@ -0,0 +1,31 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef LOAD_ATTRIBUTE_SETUP_H_
+#define LOAD_ATTRIBUTE_SETUP_H_
+
+AttributeSetup load_attribute_setup(uint index)
+{
+	return attribute_setup.elems[index];
+}
+
+#endif
--- a/parallel-rdp/shaders/load_depth_blend_state.h
+++ b/parallel-rdp/shaders/load_depth_blend_state.h
@@ -0,0 +1,41 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef LOAD_DEPTH_BLEND_STATE_H_
+#define LOAD_DEPTH_BLEND_STATE_H_
+
+DepthBlendState load_depth_blend_state(uint index)
+{
+#if SMALL_TYPES
+	return depth_blend_state.elems[index];
+#else
+	return DepthBlendState(
+			u8x4(depth_blend_state.elems[index].blend_modes0),
+			u8x4(depth_blend_state.elems[index].blend_modes1),
+			depth_blend_state.elems[index].flags,
+			u8(depth_blend_state.elems[index].coverage_mode),
+			u8(depth_blend_state.elems[index].z_mode),
+			u8(0), u8(0));
+#endif
+}
+
+#endif
--- a/parallel-rdp/shaders/load_derived_setup.h
+++ b/parallel-rdp/shaders/load_derived_setup.h
@@ -0,0 +1,50 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef LOAD_DERIVED_SETUP_H_
+#define LOAD_DERIVED_SETUP_H_
+
+DerivedSetup load_derived_setup(uint index)
+{
+#if SMALL_TYPES
+	return derived_setup.elems[index];
+#else
+	return DerivedSetup(
+			u8x4(derived_setup.elems[index].constant_muladd0),
+			u8x4(derived_setup.elems[index].constant_mulsub0),
+			u8x4(derived_setup.elems[index].constant_mul0),
+			u8x4(derived_setup.elems[index].constant_add0),
+			u8x4(derived_setup.elems[index].constant_muladd1),
+			u8x4(derived_setup.elems[index].constant_mulsub1),
+			u8x4(derived_setup.elems[index].constant_mul1),
+			u8x4(derived_setup.elems[index].constant_add1),
+			u8x4(derived_setup.elems[index].fog_color),
+			u8x4(derived_setup.elems[index].blend_color),
+			uint(derived_setup.elems[index].fill_color),
+			u16(derived_setup.elems[index].dz),
+			u8(derived_setup.elems[index].dz_compressed),
+			u8(derived_setup.elems[index].min_lod),
+			i16x4(derived_setup.elems[index].factors));
+#endif
+}
+
+#endif
--- a/parallel-rdp/shaders/load_scissor_state.h
+++ b/parallel-rdp/shaders/load_scissor_state.h
@@ -0,0 +1,32 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef LOAD_SCISSOR_STATE_H_
+#define LOAD_SCISSOR_STATE_H_
+
+ScissorState load_scissor_state(uint index)
+{
+	ivec4 values = scissor_state.elems[index];
+	return ScissorState(values.x, values.y, values.z, values.w);
+}
+
+#endif
--- a/parallel-rdp/shaders/load_span_offsets.h
+++ b/parallel-rdp/shaders/load_span_offsets.h
@@ -0,0 +1,31 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef LOAD_SPAN_OFFSETS_H_
+#define LOAD_SPAN_OFFSETS_H_
+
+SpanInfoOffsets load_span_offsets(uint index)
+{
+	return span_offsets.elems[index];
+}
+
+#endif
--- a/parallel-rdp/shaders/load_span_setup.h
+++ b/parallel-rdp/shaders/load_span_setup.h
@@ -0,0 +1,44 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef LOAD_SPAN_SETUP_H_
+#define LOAD_SPAN_SETUP_H_
+
+SpanSetup load_span_setup(uint index)
+{
+#if SMALL_TYPES
+	return span_setups.elems[index];
+#else
+	return SpanSetup(
+			span_setups.elems[index].rgba,
+			span_setups.elems[index].stzw,
+			u16x4(uvec4(span_setups.elems[index].xleft)),
+			u16x4(uvec4(span_setups.elems[index].xright)),
+			span_setups.elems[index].interpolation_base_x,
+			span_setups.elems[index].start_x,
+			span_setups.elems[index].end_x,
+			i16(span_setups.elems[index].lodlength),
+			u16(span_setups.elems[index].valid_line));
+#endif
+}
+
+#endif
--- a/parallel-rdp/shaders/load_static_raster_state.h
+++ b/parallel-rdp/shaders/load_static_raster_state.h
@@ -0,0 +1,42 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef LOAD_STATIC_RASTER_STATE_H_
+#define LOAD_STATIC_RASTER_STATE_H_
+
+StaticRasterizationState load_static_rasterization_state(uint index)
+{
+#if SMALL_TYPES
+	return static_raster_state.elems[index];
+#else
+	return StaticRasterizationState(
+			u8x4(static_raster_state.elems[index].combiner_inputs_rgb0),
+			u8x4(static_raster_state.elems[index].combiner_inputs_alpha0),
+			u8x4(static_raster_state.elems[index].combiner_inputs_rgb1),
+			u8x4(static_raster_state.elems[index].combiner_inputs_alpha1),
+			static_raster_state.elems[index].flags,
+			static_raster_state.elems[index].dither,
+			0, 0);
+#endif
+}
+
+#endif
--- a/parallel-rdp/shaders/load_tile_info.h
+++ b/parallel-rdp/shaders/load_tile_info.h
@@ -0,0 +1,49 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef LOAD_TILE_INFO_H_
+#define LOAD_TILE_INFO_H_
+
+TileInfo load_tile_info(uint index)
+{
+#if SMALL_TYPES
+	return tile_infos.elems[index];
+#else
+	return TileInfo(
+			tile_infos.elems[index].slo,
+			tile_infos.elems[index].shi,
+			tile_infos.elems[index].tlo,
+			tile_infos.elems[index].thi,
+			tile_infos.elems[index].offset,
+			tile_infos.elems[index].stride,
+			u8(tile_infos.elems[index].fmt),
+			u8(tile_infos.elems[index].size),
+			u8(tile_infos.elems[index].palette),
+			u8(tile_infos.elems[index].mask_s),
+			u8(tile_infos.elems[index].shift_s),
+			u8(tile_infos.elems[index].mask_t),
+			u8(tile_infos.elems[index].shift_t),
+			u8(tile_infos.elems[index].flags));
+#endif
+}
+
+#endif
--- a/parallel-rdp/shaders/load_triangle_setup.h
+++ b/parallel-rdp/shaders/load_triangle_setup.h
@@ -0,0 +1,46 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef LOAD_TRIANGLE_SETUP_H_
+#define LOAD_TRIANGLE_SETUP_H_
+
+TriangleSetup load_triangle_setup(uint index)
+{
+#if SMALL_TYPES
+	return triangle_setup.elems[index];
+#else
+	return TriangleSetup(
+			triangle_setup.elems[index].xh,
+			triangle_setup.elems[index].xm,
+			triangle_setup.elems[index].xl,
+			i16(triangle_setup.elems[index].yh),
+			i16(triangle_setup.elems[index].ym),
+			triangle_setup.elems[index].dxhdy,
+			triangle_setup.elems[index].dxmdy,
+			triangle_setup.elems[index].dxldy,
+			i16(triangle_setup.elems[index].yl),
+			u8(triangle_setup.elems[index].flags),
+			u8(triangle_setup.elems[index].tile));
+#endif
+}
+
+#endif
--- a/parallel-rdp/shaders/masked_rdram_resolve.comp
+++ b/parallel-rdp/shaders/masked_rdram_resolve.comp
@@ -0,0 +1,70 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+layout(local_size_x_id = 0) in;
+layout(constant_id = 1) const int PAGE_STRIDE = 256;
+
+layout(set = 0, binding = 0, std430) buffer RDRAM
+{
+    uint rdram[];
+};
+
+layout(set = 0, binding = 1, std430) readonly buffer StagingRDRAM
+{
+    uint staging_rdram[];
+};
+
+layout(set = 0, binding = 2, std430) readonly buffer WriteMaskRDRAM
+{
+    uint writemask[];
+};
+
+layout(set = 1, binding = 0, std140) uniform UBO
+{
+    uvec4 offsets[1024];
+};
+
+void main()
+{
+    uint offset = offsets[gl_WorkGroupID.x >> 2u][gl_WorkGroupID.x & 3u];
+    offset *= PAGE_STRIDE;
+    offset += gl_LocalInvocationIndex;
+    uint mask = writemask[offset];
+
+    if (mask == ~0u)
+    {
+        return;
+    }
+    else if (mask == 0u)
+    {
+        uint staging = staging_rdram[offset];
+        rdram[offset] = staging;
+    }
+    else
+    {
+        uint word = rdram[offset];
+        uint staging = staging_rdram[offset];
+        word = (word & mask) | (staging & ~mask);
+        rdram[offset] = word;
+    }
+}
--- a/parallel-rdp/shaders/memory_interfacing.h
+++ b/parallel-rdp/shaders/memory_interfacing.h
@@ -0,0 +1,572 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef MEMORY_INTERFACING_H_
+#define MEMORY_INTERFACING_H_
+
+#include "dither.h"
+#include "z_encode.h"
+#include "blender.h"
+#include "depth_test.h"
+#include "coverage.h"
+#include "fb_formats.h"
+
+layout(constant_id = 0) const uint RDRAM_SIZE = 0;
+
+layout(constant_id = 7) const int RDRAM_INCOHERENT_SCALING = 0;
+const bool RDRAM_INCOHERENT = (RDRAM_INCOHERENT_SCALING & 1) != 0;
+const int SCALING_LOG2 = RDRAM_INCOHERENT_SCALING >> 1;
+const int SCALING_FACTOR = 1 << SCALING_LOG2;
+const bool RDRAM_UNSCALED_WRITE_MASK = RDRAM_INCOHERENT && SCALING_LOG2 == 0;
+const bool RDRAM_SCALED_WRITE_MASK = RDRAM_INCOHERENT && SCALING_LOG2 != 0;
+
+const uint RDRAM_MASK_8 = RDRAM_SIZE - 1u;
+const uint RDRAM_MASK_16 = RDRAM_MASK_8 >> 1u;
+const uint RDRAM_MASK_32 = RDRAM_MASK_8 >> 2u;
+
+layout(constant_id = 1) const int FB_FMT = 0;
+layout(constant_id = 2) const bool FB_COLOR_DEPTH_ALIAS = false;
+
+u8x4 current_color;
+bool current_color_dirty;
+
+u16 current_depth;
+u8 current_dz;
+bool current_depth_dirty;
+
+void load_vram_color(uint index, uint slice)
+{
+	switch (FB_FMT)
+	{
+	case FB_FMT_I4:
+	case FB_FMT_I8:
+	{
+		index &= RDRAM_MASK_8;
+		index += slice * RDRAM_SIZE;
+		u8 word = u8(vram8.data[index ^ 3u]);
+		current_color = u8x4(word, word, word, u8(hidden_vram.data[index >> 1]));
+		break;
+	}
+
+	case FB_FMT_RGBA5551:
+	{
+		index &= RDRAM_MASK_16;
+		index += slice * (RDRAM_SIZE >> 1);
+		uint word = uint(vram16.data[index ^ 1u]);
+		uvec3 rgb = uvec3(word >> 8u, word >> 3u, word << 2u) & 0xf8u;
+		current_color = u8x4(rgb, (u8(hidden_vram.data[index]) << U8_C(5)) | u8((word & 1) << 7));
+		break;
+	}
+
+	case FB_FMT_IA88:
+	{
+		index &= RDRAM_MASK_16;
+		index += slice * (RDRAM_SIZE >> 1);
+		uint word = uint(vram16.data[index ^ 1u]);
+		current_color = u8x4(u8x3(word >> 8u), word & 0xff);
+		break;
+	}
+
+	case FB_FMT_RGBA8888:
+	{
+		index &= RDRAM_MASK_32;
+		index += slice * (RDRAM_SIZE >> 2);
+		uint word = vram32.data[index];
+		current_color = u8x4((uvec4(word) >> uvec4(24, 16, 8, 0)) & uvec4(0xff));
+		break;
+	}
+	}
+}
+
+void alias_color_to_depth()
+{
+	/* Inherit memory depth from color. */
+	switch (FB_FMT)
+	{
+	case FB_FMT_RGBA5551:
+	{
+		current_dz = (current_color.a >> U8_C(3)) | (current_color.b & U8_C(8));
+		uint word = (current_color.r & 0xf8u) << 6u;
+		word |= (current_color.g & 0xf8u) << 1u;
+		word |= (current_color.b & 0xf8u) >> 4u;
+		current_depth = u16(word);
+		break;
+	}
+
+	case FB_FMT_IA88:
+	{
+		uvec2 col = current_color.ra;
+		uint word = (col.x << 8u) | col.y;
+		uint hidden_word = (word & 1u) * 3u;
+		current_depth = u16(word >> 2u);
+		current_dz = u8(((word & 3u) << 2u) | hidden_word);
+		break;
+	}
+	}
+}
+
+void alias_depth_to_color()
+{
+	uint word = (uint(current_depth) << 4u) | current_dz;
+
+	switch (FB_FMT)
+	{
+	case FB_FMT_RGBA5551:
+	{
+		current_color.r = u8((word >> 10u) & 0xf8u);
+		current_color.g = u8((word >> 5u) & 0xf8u);
+		current_color.b = u8((word >> 0u) & 0xf8u);
+		current_color.a = u8((word & 7u) << 5u);
+		break;
+	}
+
+	case FB_FMT_IA88:
+	{
+		current_color.r = u8((word >> 10u) & 0xffu);
+		current_color.a = u8((word >> 2u) & 0xffu);
+		break;
+	}
+	}
+
+	current_color_dirty = true;
+}
+
+void load_vram_depth(uint index, uint slice)
+{
+	index &= RDRAM_MASK_16;
+	index += slice * (RDRAM_SIZE >> 1);
+	u16 word = u16(vram16.data[index ^ 1u]);
+	current_depth = word >> U16_C(2);
+	current_dz = u8(hidden_vram.data[index]) | u8((word & U16_C(3)) << U16_C(2));
+}
+
+void store_unscaled_write_mask(uint index)
+{
+	if (current_color_dirty)
+	{
+		switch (FB_FMT)
+		{
+		case FB_FMT_I4:
+		case FB_FMT_I8:
+			vram8.data[(index ^ 3u) + RDRAM_SIZE] = mem_u8(0xff);
+			break;
+
+		case FB_FMT_RGBA5551:
+		case FB_FMT_IA88:
+			vram16.data[(index ^ 1u) + (RDRAM_SIZE >> 1u)] = mem_u16(0xffff);
+			break;
+
+		case FB_FMT_RGBA8888:
+			vram32.data[index + (RDRAM_SIZE >> 2u)] = ~0u;
+			break;
+		}
+	}
+}
+
+void store_vram_color(uint index, uint slice)
+{
+	if (current_color_dirty)
+	{
+		switch (FB_FMT)
+		{
+		case FB_FMT_I4:
+		{
+			index &= RDRAM_MASK_8;
+			index += slice * RDRAM_SIZE;
+			vram8.data[index ^ 3u] = mem_u8(0);
+			if ((index & 1u) != 0u)
+				hidden_vram.data[index >> 1u] = mem_u8(current_color.a);
+			break;
+		}
+
+		case FB_FMT_I8:
+		{
+			index &= RDRAM_MASK_8;
+			index += slice * RDRAM_SIZE;
+			vram8.data[index ^ 3u] = mem_u8(current_color.r);
+			if ((index & 1u) != 0u)
+				hidden_vram.data[index >> 1u] = mem_u8((current_color.r & 1) * 3);
+			break;
+		}
+
+		case FB_FMT_RGBA5551:
+		{
+			index &= RDRAM_MASK_16;
+			index += slice * (RDRAM_SIZE >> 1);
+			uvec4 c = uvec4(current_color);
+			c.rgb &= 0xf8u;
+			uint cov = c.w >> 5u;
+			uint word = (c.x << 8u) | (c.y << 3u) | (c.z >> 2u) | (cov >> 2u);
+			vram16.data[index ^ 1u] = mem_u16(word);
+			hidden_vram.data[index] = mem_u8(cov & U8_C(3));
+			break;
+		}
+
+		case FB_FMT_IA88:
+		{
+			index &= RDRAM_MASK_16;
+			index += slice * (RDRAM_SIZE >> 1);
+			uvec2 col = current_color.ra;
+			uint word = (col.x << 8u) | col.y;
+			vram16.data[index ^ 1u] = mem_u16(word);
+			hidden_vram.data[index] = mem_u8((col.y & 1) * 3);
+			break;
+		}
+
+		case FB_FMT_RGBA8888:
+		{
+			index &= RDRAM_MASK_32;
+			index += slice * (RDRAM_SIZE >> 2);
+			uvec4 col = current_color;
+			uint word = (col.r << 24u) | (col.g << 16u) | (col.b << 8u) | (col.a << 0u);
+			vram32.data[index] = word;
+			hidden_vram.data[2u * index] = mem_u8((current_color.g & 1) * 3);
+			hidden_vram.data[2u * index + 1u] = mem_u8((current_color.a & 1) * 3);
+			break;
+		}
+		}
+	}
+
+	if (RDRAM_UNSCALED_WRITE_MASK)
+	{
+		// Need this memory barrier to ensure the mask readback does not read
+		// an invalid value from RDRAM. If the mask is seen, the valid RDRAM value is
+		// also coherent.
+		memoryBarrierBuffer();
+		store_unscaled_write_mask(index);
+	}
+}
+
+void store_vram_depth(uint index, uint slice)
+{
+	if (!FB_COLOR_DEPTH_ALIAS)
+	{
+		if (current_depth_dirty)
+		{
+			index &= RDRAM_MASK_16;
+			index += slice * (RDRAM_SIZE >> 1);
+			vram16.data[index ^ 1u] = mem_u16((current_depth << U16_C(2)) | (current_dz >> U16_C(2)));
+			hidden_vram.data[index] = mem_u8(current_dz & U16_C(3));
+		}
+
+		if (RDRAM_UNSCALED_WRITE_MASK)
+		{
+			// Need this memory barrier to ensure the mask readback does not read
+			// an invalid value from RDRAM. If the mask is seen, the valid RDRAM value is
+			// also coherent.
+			memoryBarrierBuffer();
+			if (current_depth_dirty)
+				vram16.data[(index ^ 1) + (RDRAM_SIZE >> 1u)] = mem_u16(0xffff);
+		}
+	}
+}
+
+uint color_fb_index;
+
+void init_tile(uvec2 coord, uint fb_width, uint fb_height, uint fb_addr_index, uint fb_depth_addr_index)
+{
+	current_color_dirty = false;
+	current_depth_dirty = false;
+	if (all(lessThan(coord, uvec2(fb_width, fb_height))))
+	{
+		uvec2 slice2d = coord & (SCALING_FACTOR - 1);
+		coord >>= SCALING_LOG2;
+		uint slice = slice2d.y * SCALING_FACTOR + slice2d.x;
+
+		uint index = fb_addr_index + (fb_width >> SCALING_LOG2) * coord.y + coord.x;
+		color_fb_index = index;
+		load_vram_color(index, slice);
+
+		index = fb_depth_addr_index + (fb_width >> SCALING_LOG2) * coord.y + coord.x;
+		load_vram_depth(index, slice);
+	}
+}
+
+void emit_scaled_write_masks(uvec2 unscaled_coord, uint unscaled_fb_width)
+{
+	// Merge write masks across pixels.
+	// We reserved a chunk of memory after scaled RDRAM to store 2 bits per pixel holding
+	// a write mask for color and depth. The resolve stage will only resolve a pixel
+	// and trigger a write if any sub-sample was marked as written.
+
+	// Write masks are organized in 4x4 blocks of unscaled pixels for locality purposes.
+	// This guarantees a minimum number of loop iterations to resolve the write masks.
+	uint unscaled_block = (unscaled_coord.y >> 2u) * ((unscaled_fb_width + 3u) >> 2u) + (unscaled_coord.x >> 2u);
+	uvec2 unscaled_sub = unscaled_coord & 3u;
+	uint word = uint(current_color_dirty) + 2u * uint(current_depth_dirty);
+	word <<= 2u * (unscaled_sub.x + unscaled_sub.y * 4u);
+
+#if SUBGROUP
+	// This should only need one iteration .
+	bool is_active = true;
+	do
+	{
+		if (subgroupBroadcastFirst(unscaled_block) == unscaled_block)
+		{
+			uint merged = subgroupOr(word);
+			if (subgroupElect())
+				atomicOr(vram32.data[SCALING_FACTOR * SCALING_FACTOR * (RDRAM_SIZE >> 2) + unscaled_block], merged);
+			is_active = false;
+		}
+	} while (is_active);
+#else
+	// Just use atomics directly. With subgroup support, we can be a bit smarter about it.
+	if (word != 0u)
+		atomicOr(vram32.data[SCALING_FACTOR * SCALING_FACTOR * (RDRAM_SIZE >> 2) + unscaled_block], word);
+#endif
+}
+
+void finish_tile(uvec2 coord, uint fb_width, uint fb_height, uint fb_addr_index, uint fb_depth_addr_index)
+{
+	// MSL portability: Need to maintain uniform control flow.
+	if (any(greaterThanEqual(coord, uvec2(fb_width, fb_height))))
+	{
+		current_color_dirty = false;
+		current_depth_dirty = false;
+	}
+
+	uint unscaled_fb_width = fb_width >> SCALING_LOG2;
+
+	uvec2 slice2d = coord & (SCALING_FACTOR - 1);
+	coord >>= SCALING_LOG2;
+	uint slice = slice2d.y * SCALING_FACTOR + slice2d.x;
+
+	uint index = fb_addr_index + unscaled_fb_width * coord.y + coord.x;
+	store_vram_color(index, slice);
+
+	index = fb_depth_addr_index + unscaled_fb_width * coord.y + coord.x;
+	store_vram_depth(index, slice);
+
+	if (RDRAM_SCALED_WRITE_MASK)
+		emit_scaled_write_masks(coord, unscaled_fb_width);
+}
+
+u8x4 decode_memory_color(bool image_read_en)
+{
+	u8 memory_coverage = image_read_en ? (current_color.a & U8_C(0xe0)) : U8_C(0xe0);
+
+	u8x3 color;
+	switch (FB_FMT)
+	{
+	case FB_FMT_I4:
+		color = u8x3(0);
+		memory_coverage = U8_C(0xe0);
+		break;
+
+	case FB_FMT_I8:
+		color = current_color.rrr;
+		memory_coverage = U8_C(0xe0);
+		break;
+
+	case FB_FMT_RGBA5551:
+		color = current_color.rgb & U8_C(0xf8);
+		break;
+
+	case FB_FMT_IA88:
+		color = current_color.rrr;
+		break;
+
+	case FB_FMT_RGBA8888:
+		color = current_color.rgb;
+		break;
+	}
+	return u8x4(color, memory_coverage);
+}
+
+void write_color(u8x4 col)
+{
+	if (FB_FMT == FB_FMT_I4)
+		current_color.rgb = col.rgb;
+	else
+		current_color = col;
+	current_color_dirty = true;
+}
+
+void copy_pipeline(uint word, uint primitive_index)
+{
+	switch (FB_FMT)
+	{
+	case FB_FMT_I4:
+	{
+		current_color = u8x4(0);
+		current_color_dirty = true;
+		break;
+	}
+
+	case FB_FMT_I8:
+	{
+		// Alpha testing needs to only look at the low dword for some bizarre reason.
+		// I don't think alpha testing is supposed to be used at all with 8-bit FB ...
+		word &= 0xffu;
+		write_color(u8x4(word));
+		break;
+	}
+
+	case FB_FMT_RGBA5551:
+	{
+		uint r = (word >> 8) & 0xf8u;
+		uint g = (word >> 3) & 0xf8u;
+		uint b = (word << 2) & 0xf8u;
+		uint a = (word & 1) * 0xe0u;
+		write_color(u8x4(r, g, b, a));
+		break;
+	}
+	}
+
+	if (FB_COLOR_DEPTH_ALIAS)
+		alias_color_to_depth();
+}
+
+void fill_color(uint col)
+{
+	switch (FB_FMT)
+	{
+	case FB_FMT_RGBA8888:
+	{
+		uint r = (col >> 24u) & 0xffu;
+		uint g = (col >> 16u) & 0xffu;
+		uint b = (col >> 8u) & 0xffu;
+		uint a = (col >> 0u) & 0xffu;
+		write_color(u8x4(r, g, b, a));
+		break;
+	}
+
+	case FB_FMT_RGBA5551:
+	{
+		col >>= ((color_fb_index & 1u) ^ 1u) * 16u;
+		uint r = (col >> 8u) & 0xf8u;
+		uint g = (col >> 3u) & 0xf8u;
+		uint b = (col << 2u) & 0xf8u;
+		uint a = (col & 1u) * 0xe0u;
+		write_color(u8x4(r, g, b, a));
+		break;
+	}
+
+	case FB_FMT_IA88:
+	{
+		col >>= ((color_fb_index & 1u) ^ 1u) * 16u;
+		col &= 0xffffu;
+		uint r = (col >> 8u) & 0xffu;
+		uint a = (col >> 0u) & 0xffu;
+		write_color(u8x4(r, r, r, a));
+		break;
+	}
+
+	case FB_FMT_I8:
+	{
+		col >>= ((color_fb_index & 3u) ^ 3u) * 8u;
+		col &= 0xffu;
+		write_color(u8x4(col));
+		break;
+	}
+	}
+
+	if (FB_COLOR_DEPTH_ALIAS)
+		alias_color_to_depth();
+}
+
+void depth_blend(int x, int y, uint primitive_index, ShadedData shaded)
+{
+	int z = shaded.z_dith >> 9;
+	int dith = shaded.z_dith & 0x1ff;
+	int coverage_count = shaded.coverage_count;
+	u8x4 combined = shaded.combined;
+	u8 shade_alpha = shaded.shade_alpha;
+
+	uint blend_state_index = uint(state_indices.elems[primitive_index].static_depth_tmem.y);
+	DerivedSetup derived = load_derived_setup(primitive_index);
+	DepthBlendState depth_blend = load_depth_blend_state(blend_state_index);
+
+	bool force_blend = (depth_blend.flags & DEPTH_BLEND_FORCE_BLEND_BIT) != 0;
+	bool z_compare = (depth_blend.flags & DEPTH_BLEND_DEPTH_TEST_BIT) != 0;
+	bool z_update = (depth_blend.flags & DEPTH_BLEND_DEPTH_UPDATE_BIT) != 0;
+	bool image_read_enable = (depth_blend.flags & DEPTH_BLEND_IMAGE_READ_ENABLE_BIT) != 0;
+	bool color_on_coverage = (depth_blend.flags & DEPTH_BLEND_COLOR_ON_COVERAGE_BIT) != 0;
+	bool blend_multicycle = (depth_blend.flags & DEPTH_BLEND_MULTI_CYCLE_BIT) != 0;
+	bool aa_enable = (depth_blend.flags & DEPTH_BLEND_AA_BIT) != 0;
+	bool dither_en = (depth_blend.flags & DEPTH_BLEND_DITHER_ENABLE_BIT) != 0;
+
+	bool blend_en;
+	bool coverage_wrap;
+	u8x2 blend_shift;
+
+	u8x4 memory_color = decode_memory_color(image_read_enable);
+	u8 memory_coverage = memory_color.a >> U8_C(5);
+
+	bool z_pass = depth_test(z, derived.dz, derived.dz_compressed,
+	                         current_depth, current_dz,
+	                         coverage_count, memory_coverage,
+	                         z_compare, depth_blend.z_mode,
+	                         force_blend, aa_enable,
+	                         blend_en, coverage_wrap, blend_shift);
+
+	GENERIC_MESSAGE3(combined.x, combined.y, combined.z);
+
+	// Pixel tests.
+	if (z_pass && (!aa_enable || coverage_count != 0))
+	{
+		// Blending
+		BlendInputs blender_inputs =
+				BlendInputs(combined, memory_color,
+							derived.fog_color, derived.blend_color, shade_alpha);
+
+		u8x4 blend_modes = depth_blend.blend_modes0;
+		if (blend_multicycle)
+		{
+			blender_inputs.pixel_color.rgb =
+					blender(blender_inputs,
+							blend_modes,
+							force_blend, blend_en, color_on_coverage, coverage_wrap, blend_shift, false);
+			blend_modes = depth_blend.blend_modes1;
+		}
+		u8x3 rgb = blender(blender_inputs,
+						   blend_modes,
+						   force_blend, blend_en, color_on_coverage, coverage_wrap, blend_shift, true);
+
+		// Dither
+		if (dither_en)
+			rgb = rgb_dither(rgb, dith);
+
+		// Coverage blending
+		int new_coverage = blend_coverage(coverage_count, memory_coverage, blend_en, depth_blend.coverage_mode);
+
+		GENERIC_MESSAGE3(rgb.x, rgb.y, rgb.z);
+
+		// Writeback
+		write_color(u8x4(rgb, new_coverage << 5));
+
+		// Z-writeback.
+		if (z_update)
+		{
+			current_depth = z_compress(z);
+			current_dz = u8(derived.dz_compressed);
+			current_depth_dirty = true;
+
+			if (FB_COLOR_DEPTH_ALIAS)
+				alias_depth_to_color();
+		}
+		else if (FB_COLOR_DEPTH_ALIAS)
+			alias_color_to_depth();
+	}
+}
+
+#endif
--- a/parallel-rdp/shaders/noise.h
+++ b/parallel-rdp/shaders/noise.h
@@ -0,0 +1,71 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef NOISE_H_
+#define NOISE_H_
+
+u16 seeded_noise = U16_C(0);
+
+// From: https://www.shadertoy.com/view/XlXcW4 with slight modifications.
+void reseed_noise(uint x, uint y, uint primitive_offset)
+{
+	const uint NOISE_PRIME = 1103515245u;
+	uvec3 seed = uvec3(x, y, primitive_offset);
+	seed = ((seed >> 8u) ^ seed.yzx) * NOISE_PRIME;
+	seed = ((seed >> 8u) ^ seed.yzx) * NOISE_PRIME;
+	seed = ((seed >> 8u) ^ seed.yzx) * NOISE_PRIME;
+	seeded_noise = u16(seed.x >> 16u);
+}
+
+i16 noise_get_combiner()
+{
+	return i16(((seeded_noise & U16_C(7u)) << U16_C(6u)) | U16_C(0x20u));
+}
+
+int noise_get_dither_alpha()
+{
+	return int(seeded_noise & U16_C(7u));
+}
+
+int noise_get_dither_color()
+{
+	// 3 bits of noise for RGB separately.
+	return int(seeded_noise & U16_C(0x1ff));
+}
+
+u8 noise_get_blend_threshold()
+{
+	return u8(seeded_noise & U16_C(0xffu));
+}
+
+uvec3 noise_get_full_gamma_dither()
+{
+	uint seed = seeded_noise;
+	return uvec3(seed & 0x3f, (seed >> 6u) & 0x3f, ((seed >> 9u) & 0x38) | (seed & 7u));
+}
+
+uvec3 noise_get_partial_gamma_dither()
+{
+	return (uvec3(seeded_noise) >> uvec3(0, 1, 2)) & 1u;
+}
+
+#endif
--- a/parallel-rdp/shaders/perspective.h
+++ b/parallel-rdp/shaders/perspective.h
@@ -0,0 +1,114 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#ifndef PERSPECTIVE_H_
+#define PERSPECTIVE_H_
+
+const i16x2 perspective_table[64] = i16x2[](
+	i16x2(0x4000, -252 * 4), i16x2(0x3f04, -244 * 4), i16x2(0x3e10, -238 * 4), i16x2(0x3d22, -230 * 4),
+	i16x2(0x3c3c, -223 * 4), i16x2(0x3b5d, -218 * 4), i16x2(0x3a83, -210 * 4), i16x2(0x39b1, -205 * 4),
+	i16x2(0x38e4, -200 * 4), i16x2(0x381c, -194 * 4), i16x2(0x375a, -189 * 4), i16x2(0x369d, -184 * 4),
+	i16x2(0x35e5, -179 * 4), i16x2(0x3532, -175 * 4), i16x2(0x3483, -170 * 4), i16x2(0x33d9, -166 * 4),
+	i16x2(0x3333, -162 * 4), i16x2(0x3291, -157 * 4), i16x2(0x31f4, -155 * 4), i16x2(0x3159, -150 * 4),
+	i16x2(0x30c3, -147 * 4), i16x2(0x3030, -143 * 4), i16x2(0x2fa1, -140 * 4), i16x2(0x2f15, -137 * 4),
+	i16x2(0x2e8c, -134 * 4), i16x2(0x2e06, -131 * 4), i16x2(0x2d83, -128 * 4), i16x2(0x2d03, -125 * 4),
+	i16x2(0x2c86, -123 * 4), i16x2(0x2c0b, -120 * 4), i16x2(0x2b93, -117 * 4), i16x2(0x2b1e, -115 * 4),
+	i16x2(0x2aab, -113 * 4), i16x2(0x2a3a, -110 * 4), i16x2(0x29cc, -108 * 4), i16x2(0x2960, -106 * 4),
+	i16x2(0x28f6, -104 * 4), i16x2(0x288e, -102 * 4), i16x2(0x2828, -100 * 4), i16x2(0x27c4, -98 * 4),
+	i16x2(0x2762, -96 * 4),  i16x2(0x2702, -94 * 4),  i16x2(0x26a4, -92 * 4),  i16x2(0x2648, -91 * 4),
+	i16x2(0x25ed, -89 * 4),  i16x2(0x2594, -87 * 4),  i16x2(0x253d, -86 * 4),  i16x2(0x24e7, -85 * 4),
+	i16x2(0x2492, -83 * 4),  i16x2(0x243f, -81 * 4),  i16x2(0x23ee, -80 * 4),  i16x2(0x239e, -79 * 4),
+	i16x2(0x234f, -77 * 4),  i16x2(0x2302, -76 * 4),  i16x2(0x22b6, -74 * 4),  i16x2(0x226c, -74 * 4),
+	i16x2(0x2222, -72 * 4),  i16x2(0x21da, -71 * 4),  i16x2(0x2193, -70 * 4),  i16x2(0x214d, -69 * 4),
+	i16x2(0x2108, -67 * 4),  i16x2(0x20c5, -67 * 4),  i16x2(0x2082, -65 * 4),  i16x2(0x2041, -65 * 4)
+);
+
+ivec2 perspective_get_lut(int w)
+{
+	int shift = min(14 - findMSB(w), 14);
+	int normout = (w << shift) & 0x3fff;
+	int wnorm = normout & 0xff;
+	ivec2 table = ivec2(perspective_table[normout >> 8]);
+	int rcp = ((table.y * wnorm) >> 10) + table.x;
+	return ivec2(rcp, shift);
+}
+
+ivec2 no_perspective_divide(ivec3 stw)
+{
+	return stw.xy;
+}
+
+// s16 divided by s1.15.
+// Classic approximation of a (x * rcp) >> shift with a LUT to find rcp.
+ivec2 perspective_divide(ivec3 stw, inout bool overflow)
+{
+	int w = stw.z;
+	bool w_carry = w <= 0;
+	w &= 0x7fff;
+
+	ivec2 table = perspective_get_lut(w);
+	int shift = table.y;
+	ivec2 prod = stw.xy * table.x;
+
+	int temp_mask = ((1 << 30) - 1) & -((1 << 29) >> shift);
+	ivec2 out_of_bounds = prod & temp_mask;
+
+	ivec2 temp;
+	if (shift != 14)
+		temp = prod = prod >> (13 - shift);
+	else
+		temp = prod << 1;
+
+	if (any(notEqual(out_of_bounds, ivec2(0))))
+	{
+		if (out_of_bounds.x != temp_mask && out_of_bounds.x != 0)
+		{
+			if ((prod.x & (1 << 29)) == 0)
+				temp.x = 0x7fff;
+			else
+				temp.x = -0x8000;
+			overflow = true;
+		}
+
+		if (out_of_bounds.y != temp_mask && out_of_bounds.y != 0)
+		{
+			if ((prod.y & (1 << 29)) == 0)
+				temp.y = 0x7fff;
+			else
+				temp.y = -0x8000;
+			overflow = true;
+		}
+	}
+
+	if (w_carry)
+	{
+		temp = ivec2(0x7fff);
+		overflow = true;
+	}
+
+	// Perspective divide produces a 17-bit signed coordinate, which is later clamped to 16-bit signed.
+	// However, the LOD computation happens in 17 bits ...
+	return clamp(temp, ivec2(-0x10000), ivec2(0xffff));
+}
+
+#endif
--- a/parallel-rdp/shaders/rasterizer.comp
+++ b/parallel-rdp/shaders/rasterizer.comp
@@ -0,0 +1,191 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+#include "small_types.h"
+
+layout(local_size_x_id = 0, local_size_y_id = 1) in;
+
+#include "debug.h"
+#include "data_structures.h"
+
+layout(set = 0, binding = 0, std430) readonly buffer TriangleSetupBuffer
+{
+    TriangleSetupMem elems[];
+} triangle_setup;
+#include "load_triangle_setup.h"
+
+layout(set = 0, binding = 1, std430) readonly buffer AttributeSetupBuffer
+{
+    AttributeSetupMem elems[];
+} attribute_setup;
+#include "load_attribute_setup.h"
+
+layout(set = 0, binding = 2, std430) readonly buffer DerivedSetupBuffer
+{
+    DerivedSetupMem elems[];
+} derived_setup;
+#include "load_derived_setup.h"
+
+layout(set = 0, binding = 3, std430) readonly buffer StaticRasterStateBuffer
+{
+    StaticRasterizationStateMem elems[];
+} static_raster_state;
+#include "load_static_raster_state.h"
+
+layout(set = 0, binding = 4, std430) readonly buffer StateIndicesBuffer
+{
+    InstanceIndicesMem elems[];
+} state_indices;
+
+layout(set = 0, binding = 5, std430) readonly buffer SpanInfoOffsetBuffer
+{
+    SpanInfoOffsetsMem elems[];
+} span_offsets;
+#include "load_span_offsets.h"
+
+layout(set = 0, binding = 6, std430) readonly buffer SpanSetups
+{
+    SpanSetupMem elems[];
+} span_setups;
+#include "load_span_setup.h"
+
+layout(set = 0, binding = 7, std430) readonly buffer TMEM16
+{
+    TMEMInstance16Mem instances[];
+} tmem16;
+
+layout(set = 0, binding = 7, std430) readonly buffer TMEM8
+{
+    TMEMInstance8Mem instances[];
+} tmem8;
+
+layout(set = 0, binding = 8, std430) readonly buffer TileInfoBuffer
+{
+    TileInfoMem elems[];
+} tile_infos;
+#include "load_tile_info.h"
+
+layout(set = 2, binding = 0, std140) uniform GlobalConstants
+{
+    GlobalFBInfo fb_info;
+} global_constants;
+
+layout(constant_id = 2) const int STATIC_STATE_FLAGS = 0;
+layout(constant_id = 3) const int COMBINER_INPUTS_RGB0 = 0;
+layout(constant_id = 4) const int COMBINER_INPUTS_ALPHA0 = 0;
+layout(constant_id = 5) const int COMBINER_INPUTS_RGB1 = 0;
+layout(constant_id = 6) const int COMBINER_INPUTS_ALPHA1 = 0;
+layout(constant_id = 7) const int DITHER_TEX_SIZE_TEX_FMT = 0;
+
+const int COMBINER_INPUT_RGB0_MULADD = (COMBINER_INPUTS_RGB0 >> 0) & 0xff;
+const int COMBINER_INPUT_RGB0_MULSUB = (COMBINER_INPUTS_RGB0 >> 8) & 0xff;
+const int COMBINER_INPUT_RGB0_MUL = (COMBINER_INPUTS_RGB0 >> 16) & 0xff;
+const int COMBINER_INPUT_RGB0_ADD = (COMBINER_INPUTS_RGB0 >> 24) & 0xff;
+
+const int COMBINER_INPUT_ALPHA0_MULADD = (COMBINER_INPUTS_ALPHA0 >> 0) & 0xff;
+const int COMBINER_INPUT_ALPHA0_MULSUB = (COMBINER_INPUTS_ALPHA0 >> 8) & 0xff;
+const int COMBINER_INPUT_ALPHA0_MUL = (COMBINER_INPUTS_ALPHA0 >> 16) & 0xff;
+const int COMBINER_INPUT_ALPHA0_ADD = (COMBINER_INPUTS_ALPHA0 >> 24) & 0xff;
+
+const int COMBINER_INPUT_RGB1_MULADD = (COMBINER_INPUTS_RGB1 >> 0) & 0xff;
+const int COMBINER_INPUT_RGB1_MULSUB = (COMBINER_INPUTS_RGB1 >> 8) & 0xff;
+const int COMBINER_INPUT_RGB1_MUL = (COMBINER_INPUTS_RGB1 >> 16) & 0xff;
+const int COMBINER_INPUT_RGB1_ADD = (COMBINER_INPUTS_RGB1 >> 24) & 0xff;
+
+const int COMBINER_INPUT_ALPHA1_MULADD = (COMBINER_INPUTS_ALPHA1 >> 0) & 0xff;
+const int COMBINER_INPUT_ALPHA1_MULSUB = (COMBINER_INPUTS_ALPHA1 >> 8) & 0xff;
+const int COMBINER_INPUT_ALPHA1_MUL = (COMBINER_INPUTS_ALPHA1 >> 16) & 0xff;
+const int COMBINER_INPUT_ALPHA1_ADD = (COMBINER_INPUTS_ALPHA1 >> 24) & 0xff;
+
+const int DITHER = (DITHER_TEX_SIZE_TEX_FMT >> 0) & 0xff;
+const int TEX_SIZE = (DITHER_TEX_SIZE_TEX_FMT >> 8) & 0xff;
+const int TEX_FMT = (DITHER_TEX_SIZE_TEX_FMT >> 16) & 0xff;
+
+#define RASTERIZER_SPEC_CONSTANT
+
+#include "noise.h"
+#include "shading.h"
+
+layout(set = 0, binding = 9, std430) writeonly buffer ColorBuffer
+{
+    mem_u8x4 elems[];
+} color;
+
+layout(set = 0, binding = 9, std430) writeonly buffer ColorBufferRaw
+{
+    uint elems[];
+} raw_color;
+
+layout(set = 0, binding = 10, std430) writeonly buffer DepthBuffer
+{
+    int elems[];
+} depth;
+
+layout(set = 0, binding = 11, std430) writeonly buffer ShadeAlpha
+{
+    mem_u8 elems[];
+} shade_alpha;
+
+layout(set = 0, binding = 12, std430) writeonly buffer Coverage
+{
+    mem_i8 elems[];
+} coverage;
+
+layout(set = 1, binding = 0, std430) readonly buffer TileWorkList
+{
+    uvec4 elems[];
+} tile_work_list;
+
+void main()
+{
+    uvec4 work = tile_work_list.elems[gl_WorkGroupID.x];
+    int x = int(work.x * gl_WorkGroupSize.x + gl_LocalInvocationID.x);
+    int y = int(work.y * gl_WorkGroupSize.y + gl_LocalInvocationID.y);
+    uint tile_instance = work.z;
+    uint primitive_index = work.w;
+
+    ShadedData shaded;
+    i8 coverage_value;
+    uint index = tile_instance * (gl_WorkGroupSize.x * gl_WorkGroupSize.y) + gl_LocalInvocationIndex;
+
+    if (shade_pixel(x, y, primitive_index, shaded))
+    {
+        coverage_value = i8(shaded.coverage_count);
+
+        if (coverage_value <= I8_C(8))
+        {
+            // Workaround curious bug with glslang, need to cast manually to uvec4 first.
+            color.elems[index] = mem_u8x4(uvec4(shaded.combined));
+            shade_alpha.elems[index] = mem_u8(shaded.shade_alpha);
+            depth.elems[index] = shaded.z_dith;
+        }
+        else if ((coverage_value & COVERAGE_COPY_BIT) != 0)
+        {
+            // For copy pipe, we use a raw 32-bit word to represent the loaded texel.
+            raw_color.elems[index] = shaded.z_dith;
+        }
+    }
+    else
+        coverage_value = I8_C(-1);
+
+    coverage.elems[index] = mem_i8(coverage_value);
+}
--- a/parallel-rdp/shaders/shading.h
+++ b/parallel-rdp/shaders/shading.h
@@ -0,0 +1,361 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef SHADING_H_
+#define SHADING_H_
+
+#ifdef RASTERIZER_SPEC_CONSTANT
+const int SCALING_LOG2 = (STATIC_STATE_FLAGS >> RASTERIZATION_UPSCALING_LOG2_BIT_OFFSET) & 3;
+const int SCALING_FACTOR = 1 << SCALING_LOG2;
+#endif
+
+#include "coverage.h"
+#include "interpolation.h"
+#include "perspective.h"
+#include "texture.h"
+#include "dither.h"
+#include "combiner.h"
+
+bool shade_pixel(int x, int y, uint primitive_index, out ShadedData shaded)
+{
+	SpanInfoOffsets span_offsets = load_span_offsets(primitive_index);
+	if ((y < (SCALING_FACTOR * span_offsets.ylo)) || (y > (span_offsets.yhi * SCALING_FACTOR + (SCALING_FACTOR - 1))))
+		return false;
+
+	uint setup_flags = uint(triangle_setup.elems[primitive_index].flags);
+	if (SCALING_FACTOR > 1)
+	{
+		if ((setup_flags & TRIANGLE_SETUP_DISABLE_UPSCALING_BIT) != 0u)
+		{
+			x &= ~(SCALING_FACTOR - 1);
+			y &= ~(SCALING_FACTOR - 1);
+		}
+	}
+
+	SpanSetup span_setup = load_span_setup(SCALING_FACTOR * span_offsets.offset + (y - SCALING_FACTOR * span_offsets.ylo));
+	if (span_setup.valid_line == U16_C(0))
+		return false;
+
+	uint setup_tile = uint(triangle_setup.elems[primitive_index].tile);
+	AttributeSetup attr = load_attribute_setup(primitive_index);
+
+	uvec4 states = uvec4(state_indices.elems[primitive_index].static_depth_tmem);
+	uint static_state_index = states.x;
+	uint tmem_instance_index = states.z;
+
+	StaticRasterizationState static_state = load_static_rasterization_state(static_state_index);
+	uint static_state_flags = static_state.flags;
+	int static_state_dither = static_state.dither;
+	u8x4 combiner_inputs_rgb0 = static_state.combiner_inputs_rgb0;
+	u8x4 combiner_inputs_alpha0 = static_state.combiner_inputs_alpha0;
+	u8x4 combiner_inputs_rgb1 = static_state.combiner_inputs_rgb1;
+	u8x4 combiner_inputs_alpha1 = static_state.combiner_inputs_alpha1;
+
+#ifdef RASTERIZER_SPEC_CONSTANT
+	if ((STATIC_STATE_FLAGS & RASTERIZATION_USE_SPECIALIZATION_CONSTANT_BIT) != 0)
+	{
+		static_state_flags = STATIC_STATE_FLAGS;
+		static_state_dither = DITHER;
+
+		combiner_inputs_rgb0.x = u8(COMBINER_INPUT_RGB0_MULADD);
+		combiner_inputs_rgb0.y = u8(COMBINER_INPUT_RGB0_MULSUB);
+		combiner_inputs_rgb0.z = u8(COMBINER_INPUT_RGB0_MUL);
+		combiner_inputs_rgb0.w = u8(COMBINER_INPUT_RGB0_ADD);
+
+		combiner_inputs_alpha0.x = u8(COMBINER_INPUT_ALPHA0_MULADD);
+		combiner_inputs_alpha0.y = u8(COMBINER_INPUT_ALPHA0_MULSUB);
+		combiner_inputs_alpha0.z = u8(COMBINER_INPUT_ALPHA0_MUL);
+		combiner_inputs_alpha0.w = u8(COMBINER_INPUT_ALPHA0_ADD);
+
+		combiner_inputs_rgb1.x = u8(COMBINER_INPUT_RGB1_MULADD);
+		combiner_inputs_rgb1.y = u8(COMBINER_INPUT_RGB1_MULSUB);
+		combiner_inputs_rgb1.z = u8(COMBINER_INPUT_RGB1_MUL);
+		combiner_inputs_rgb1.w = u8(COMBINER_INPUT_RGB1_ADD);
+
+		combiner_inputs_alpha1.x = u8(COMBINER_INPUT_ALPHA1_MULADD);
+		combiner_inputs_alpha1.y = u8(COMBINER_INPUT_ALPHA1_MULSUB);
+		combiner_inputs_alpha1.z = u8(COMBINER_INPUT_ALPHA1_MUL);
+		combiner_inputs_alpha1.w = u8(COMBINER_INPUT_ALPHA1_ADD);
+	}
+#endif
+
+	// This is a great case for specialization constants.
+	bool tlut = (static_state_flags & RASTERIZATION_TLUT_BIT) != 0;
+	bool tlut_type = (static_state_flags & RASTERIZATION_TLUT_TYPE_BIT) != 0;
+	bool sample_quad = (static_state_flags & RASTERIZATION_SAMPLE_MODE_BIT) != 0;
+	bool cvg_times_alpha = (static_state_flags & RASTERIZATION_CVG_TIMES_ALPHA_BIT) != 0;
+	bool alpha_cvg_select = (static_state_flags & RASTERIZATION_ALPHA_CVG_SELECT_BIT) != 0;
+	bool perspective = (static_state_flags & RASTERIZATION_PERSPECTIVE_CORRECT_BIT) != 0;
+	bool tex_lod_en = (static_state_flags & RASTERIZATION_TEX_LOD_ENABLE_BIT) != 0;
+	bool sharpen_lod_en = (static_state_flags & RASTERIZATION_SHARPEN_LOD_ENABLE_BIT) != 0;
+	bool detail_lod_en = (static_state_flags & RASTERIZATION_DETAIL_LOD_ENABLE_BIT) != 0;
+	bool aa_enable = (static_state_flags & RASTERIZATION_AA_BIT) != 0;
+	bool multi_cycle = (static_state_flags & RASTERIZATION_MULTI_CYCLE_BIT) != 0;
+	bool interlace_en = (static_state_flags & RASTERIZATION_INTERLACE_FIELD_BIT) != 0;
+	bool fill_en = (static_state_flags & RASTERIZATION_FILL_BIT) != 0;
+	bool copy_en = (static_state_flags & RASTERIZATION_COPY_BIT) != 0;
+	bool alpha_test = (static_state_flags & RASTERIZATION_ALPHA_TEST_BIT) != 0;
+	bool alpha_test_dither = (static_state_flags & RASTERIZATION_ALPHA_TEST_DITHER_BIT) != 0;
+	bool mid_texel = (static_state_flags & RASTERIZATION_SAMPLE_MID_TEXEL_BIT) != 0;
+	bool uses_texel0 = (static_state_flags & RASTERIZATION_USES_TEXEL0_BIT) != 0;
+	bool uses_texel1 = (static_state_flags & RASTERIZATION_USES_TEXEL1_BIT) != 0;
+	bool uses_pipelined_texel1 = (static_state_flags & RASTERIZATION_USES_PIPELINED_TEXEL1_BIT) != 0;
+	bool uses_lod = (static_state_flags & RASTERIZATION_USES_LOD_BIT) != 0;
+	bool convert_one = (static_state_flags & RASTERIZATION_CONVERT_ONE_BIT) != 0;
+	bool bilerp0 = (static_state_flags & RASTERIZATION_BILERP_0_BIT) != 0;
+	bool bilerp1 = (static_state_flags & RASTERIZATION_BILERP_1_BIT) != 0;
+
+	if ((static_state_flags & RASTERIZATION_NEED_NOISE_BIT) != 0)
+		reseed_noise(x, y, primitive_index + global_constants.fb_info.base_primitive_index);
+
+	bool flip = (setup_flags & TRIANGLE_SETUP_FLIP_BIT) != 0;
+
+	if (copy_en)
+	{
+		bool valid = x >= span_setup.start_x && x <= span_setup.end_x;
+		if (!valid)
+			return false;
+
+		ivec2 st;
+		int s_offset;
+		interpolate_st_copy(span_setup, attr.dstzw_dx, x, perspective, flip, st, s_offset);
+
+		uint tile0 = uint(setup_tile) & 7u;
+		uint tile_info_index0 = uint(state_indices.elems[primitive_index].tile_infos[tile0]);
+		TileInfo tile_info0 = load_tile_info(tile_info_index0);
+#ifdef RASTERIZER_SPEC_CONSTANT
+		if ((STATIC_STATE_FLAGS & RASTERIZATION_USE_STATIC_TEXTURE_SIZE_FORMAT_BIT) != 0)
+		{
+			tile_info0.fmt = u8(TEX_FMT);
+			tile_info0.size = u8(TEX_SIZE);
+		}
+#endif
+		int texel0 = sample_texture_copy(tile_info0, tmem_instance_index, st, s_offset, tlut, tlut_type);
+		shaded.z_dith = texel0;
+		shaded.coverage_count = U8_C(COVERAGE_COPY_BIT);
+
+		if (alpha_test && global_constants.fb_info.fb_size == 2 && (texel0 & 1) == 0)
+			return false;
+
+		return true;
+	}
+	else if (fill_en)
+	{
+		shaded.coverage_count = U8_C(COVERAGE_FILL_BIT);
+		return x >= span_setup.start_x && x <= span_setup.end_x;
+	}
+
+	int coverage = compute_coverage(span_setup.xleft, span_setup.xright, x);
+
+	// There is no way we can gain coverage here.
+	// Reject work as fast as possible.
+	if (coverage == 0)
+		return false;
+
+	int coverage_count = bitCount(coverage);
+
+	// If we're not using AA, only the first coverage bit is relevant.
+	if (!aa_enable && (coverage & 1) == 0)
+		return false;
+
+	DerivedSetup derived = load_derived_setup(primitive_index);
+
+	int dx = x - span_setup.interpolation_base_x;
+	int interpolation_direction = flip ? 1 : -1;
+
+	// Interpolate attributes.
+	u8x4 shade = interpolate_rgba(span_setup.rgba, attr.drgba_dx, attr.drgba_dy,
+	                              dx, coverage);
+
+	ivec2 st, st_dx, st_dy;
+	int z;
+	bool perspective_overflow = false;
+
+	int tex_interpolation_direction = interpolation_direction;
+	if (SCALING_FACTOR > 1 && uses_lod)
+		if ((setup_flags & TRIANGLE_SETUP_NATIVE_LOD_BIT) != 0)
+			tex_interpolation_direction *= SCALING_FACTOR;
+
+	interpolate_stz(span_setup.stzw, attr.dstzw_dx, attr.dstzw_dy, dx, coverage, perspective, uses_lod,
+	                tex_interpolation_direction, st, st_dx, st_dy, z, perspective_overflow);
+
+	// Sample textures.
+	uint tile0 = uint(setup_tile) & 7u;
+	uint tile1 = (tile0 + 1) & 7u;
+	uint max_level = uint(setup_tile) >> 3u;
+	int min_lod = derived.min_lod;
+
+	i16 lod_frac;
+	if (uses_lod)
+	{
+		compute_lod_2cycle(tile0, tile1, lod_frac, max_level, min_lod, st, st_dx, st_dy, perspective_overflow,
+		                   tex_lod_en, sharpen_lod_en, detail_lod_en);
+	}
+
+	i16x4 texel0, texel1;
+
+	if (uses_texel0)
+	{
+		uint tile_info_index0 = uint(state_indices.elems[primitive_index].tile_infos[tile0]);
+		TileInfo tile_info0 = load_tile_info(tile_info_index0);
+#ifdef RASTERIZER_SPEC_CONSTANT
+		if ((STATIC_STATE_FLAGS & RASTERIZATION_USE_STATIC_TEXTURE_SIZE_FORMAT_BIT) != 0)
+		{
+			tile_info0.fmt = u8(TEX_FMT);
+			tile_info0.size = u8(TEX_SIZE);
+		}
+#endif
+		texel0 = sample_texture(tile_info0, tmem_instance_index, st, tlut, tlut_type,
+		                        sample_quad, mid_texel, false, bilerp0, derived.factors, i16x4(0));
+	}
+
+	// A very awkward mechanism where we peek into the next pixel, or in some cases, the next scanline's first pixel.
+	if (uses_pipelined_texel1)
+	{
+		bool valid_line = uint(span_setups.elems[SCALING_FACTOR * span_offsets.offset + (y - SCALING_FACTOR * span_offsets.ylo + 1)].valid_line) != 0u;
+		bool long_span = span_setup.lodlength >= 8;
+		bool end_span = x == (flip ? span_setup.end_x : span_setup.start_x);
+
+		if (end_span && long_span && valid_line)
+		{
+			ivec3 stw = span_setups.elems[SCALING_FACTOR * span_offsets.offset + (y - SCALING_FACTOR * span_offsets.ylo + 1)].stzw.xyw >> 16;
+			if (perspective)
+			{
+				bool st_overflow;
+				st = perspective_divide(stw, st_overflow);
+			}
+			else
+				st = no_perspective_divide(stw);
+		}
+		else
+			st = interpolate_st_single(span_setup.stzw, attr.dstzw_dx, dx + interpolation_direction * SCALING_FACTOR, perspective);
+
+		tile1 = tile0;
+		uses_texel1 = true;
+	}
+
+	if (uses_texel1)
+	{
+		if (convert_one && !bilerp1)
+		{
+			texel1 = texture_convert_factors(texel0, derived.factors);
+		}
+		else
+		{
+			uint tile_info_index1 = uint(state_indices.elems[primitive_index].tile_infos[tile1]);
+			TileInfo tile_info1 = load_tile_info(tile_info_index1);
+#ifdef RASTERIZER_SPEC_CONSTANT
+			if ((STATIC_STATE_FLAGS & RASTERIZATION_USE_STATIC_TEXTURE_SIZE_FORMAT_BIT) != 0)
+			{
+				tile_info1.fmt = u8(TEX_FMT);
+				tile_info1.size = u8(TEX_SIZE);
+			}
+#endif
+			texel1 = sample_texture(tile_info1, tmem_instance_index, st, tlut, tlut_type, sample_quad, mid_texel,
+			                        convert_one, bilerp1, derived.factors, texel0);
+		}
+	}
+
+	int rgb_dith, alpha_dith;
+	dither_coefficients(x, y >> int(interlace_en), static_state_dither >> 2, static_state_dither & 3, rgb_dith, alpha_dith);
+
+	// Run combiner.
+	u8x4 combined;
+	u8 alpha_reference;
+	if (multi_cycle)
+	{
+		CombinerInputs combined_inputs =
+				CombinerInputs(derived.constant_muladd0, derived.constant_mulsub0, derived.constant_mul0, derived.constant_add0,
+				               shade, u8x4(0), texel0, texel1, lod_frac, noise_get_combiner());
+
+		combined_inputs.combined = combiner_cycle0(combined_inputs,
+		                                           combiner_inputs_rgb0,
+		                                           combiner_inputs_alpha0,
+		                                           alpha_dith, coverage_count, cvg_times_alpha, alpha_cvg_select,
+		                                           alpha_test, alpha_reference);
+
+		combined_inputs.constant_muladd = derived.constant_muladd1;
+		combined_inputs.constant_mulsub = derived.constant_mulsub1;
+		combined_inputs.constant_mul = derived.constant_mul1;
+		combined_inputs.constant_add = derived.constant_add1;
+
+		// Pipelining, texel1 is promoted to texel0 in cycle1.
+		// I don't think hardware ever intended for you to access texels in second cycle due to this nature.
+		i16x4 tmp_texel = combined_inputs.texel0;
+		combined_inputs.texel0 = combined_inputs.texel1;
+		// Following the pipelining, texel1 should become texel0 of next pixel,
+		// but let's not go there ...
+		combined_inputs.texel1 = tmp_texel;
+
+		// Resample the noise at some arbitrary other offset.
+		// This only matters if both noise combiner inputs take noise (very weird).
+		if ((static_state_flags & RASTERIZATION_NEED_NOISE_DUAL_BIT) != 0)
+		{
+			reseed_noise(x + 1023, y + 7, primitive_index + global_constants.fb_info.base_primitive_index + 11);
+			combined_inputs.noise = noise_get_combiner();
+		}
+
+		combined = u8x4(combiner_cycle1(combined_inputs,
+		                                combiner_inputs_rgb1,
+		                                combiner_inputs_alpha1,
+		                                alpha_dith, coverage_count, cvg_times_alpha, alpha_cvg_select));
+	}
+	else
+	{
+		CombinerInputs combined_inputs =
+				CombinerInputs(derived.constant_muladd1, derived.constant_mulsub1, derived.constant_mul1, derived.constant_add1,
+				               shade, u8x4(0), texel0, texel1, lod_frac, noise_get_combiner());
+
+		combined = u8x4(combiner_cycle1(combined_inputs,
+		                                combiner_inputs_rgb1,
+		                                combiner_inputs_alpha1,
+		                                alpha_dith, coverage_count, cvg_times_alpha, alpha_cvg_select));
+
+		alpha_reference = combined.a;
+	}
+
+	// After combiner, color can be modified to 0 through alpha-to-cvg, so check for potential write_enable here.
+	// If we're not using AA, the first coverage bit is used instead, coverage count is ignored.
+	if (aa_enable && coverage_count == 0)
+		return false;
+
+	if (alpha_test)
+	{
+		u8 alpha_threshold;
+		if (alpha_test_dither)
+			alpha_threshold = noise_get_blend_threshold();
+		else
+			alpha_threshold = derived.blend_color.a;
+
+		if (alpha_reference < alpha_threshold)
+			return false;
+	}
+
+	shaded.combined = combined;
+	shaded.z_dith = (z << 9) | rgb_dith;
+	shaded.coverage_count = u8(coverage_count);
+	// Shade alpha needs to be passed separately since it might affect the blending stage.
+	shaded.shade_alpha = u8(min(shade.a + alpha_dith, 0xff));
+	return true;
+}
+
+#endif
--- a/parallel-rdp/shaders/slangmosh.hpp
+++ b/parallel-rdp/shaders/slangmosh.hpp
--- a/parallel-rdp/shaders/slangmosh.json
+++ b/parallel-rdp/shaders/slangmosh.json
@@ -0,0 +1,126 @@
+{
+	"include": [ "../../Granite/assets/shaders/inc" ],
+	"shaders": [
+		{
+			"name": "tmem_update",
+			"compute": true,
+			"path": "tmem_update.comp"
+		},
+		{
+
+			"name": "span_setup",
+			"compute": true,
+			"path": "span_setup.comp"
+		},
+		{
+			"name": "clear_indirect_buffer",
+			"compute": true,
+			"path": "clear_indirect_buffer.comp"
+		},
+		{
+			"name": "tile_binning_combined",
+			"compute": true,
+			"path": "tile_binning_combined.comp",
+			"variants": [
+				{ "define": "SUBGROUP", "count": 2, "resolve": true },
+				{ "define": "UBERSHADER", "count": 2, "resolve": true },
+				{ "define": "SMALL_TYPES", "count": 2, "resolve": true }
+			]
+		},
+		{
+			"name": "ubershader",
+			"path": "ubershader.comp",
+			"compute": true,
+			"variants": [
+				{ "define": "SUBGROUP", "count": 2, "resolve": true },
+				{ "define": "SMALL_TYPES", "count": 2, "resolve": true }
+			]
+		},
+		{
+			"name": "depth_blend",
+			"path": "depth_blend.comp",
+			"compute": true,
+			"variants": [
+				{ "define": "SUBGROUP", "count": 2, "resolve": true },
+				{ "define": "SMALL_TYPES", "count": 2, "resolve": true }
+			]
+		},
+		{
+			"name": "rasterizer",
+			"path": "rasterizer.comp",
+			"compute": true,
+			"variants": [
+				{ "define": "SMALL_TYPES", "count": 2, "resolve": true }
+			]
+		},
+		{
+			"name": "fullscreen",
+			"path": "fullscreen.vert"
+		},
+		{
+			"name": "vi_scale",
+			"path": "vi_scale.frag"
+		},
+		{
+			"name": "vi_divot",
+			"path": "vi_divot.frag",
+			"variants": [
+				{ "define": "FETCH_BUG", "count": 2 }
+			]
+		},
+		{
+			"name": "vi_fetch",
+			"path": "vi_fetch.frag",
+			"variants": [
+				{ "define": "FETCH_BUG", "count": 2 }
+			]
+		},
+		{
+			"name": "vi_blend_fields",
+			"path": "vi_blend_fields.frag"
+		},
+		{
+			"name": "extract_vram",
+			"path": "extract_vram.comp",
+			"compute": true
+		},
+		{
+			"name": "masked_rdram_resolve",
+			"path": "masked_rdram_resolve.comp",
+			"compute": true
+		},
+		{
+			"name": "clear_write_mask",
+			"path": "clear_write_mask.comp",
+			"compute": true
+		},
+		{
+			"name": "update_upscaled_domain_post",
+			"path": "update_upscaled_domain_post.comp",
+			"compute": true
+		},
+		{
+			"name": "update_upscaled_domain_pre",
+			"path": "update_upscaled_domain_pre.comp",
+			"compute": true
+		},
+		{
+			"name": "update_upscaled_domain_resolve",
+			"path": "update_upscaled_domain_resolve.comp",
+			"compute": true
+		},
+		{
+			"name": "clear_super_sampled_write_mask",
+			"path": "clear_super_sampled_write_mask.comp",
+			"compute": true
+		},
+		{
+			"name": "vi_deinterlace_vert",
+			"path": "vi_deinterlace.vert"
+		},
+		{
+			"name": "vi_deinterlace_frag",
+			"path": "vi_deinterlace.frag"
+		}
+	]
+}
--- a/parallel-rdp/shaders/small_types.h
+++ b/parallel-rdp/shaders/small_types.h
@@ -0,0 +1,121 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+// Utility header to smooth over the difference between
+// 8/16-bit integer arithmetic vs. just 8/16-bit storage.
+
+#ifndef SMALL_INTEGERS_H_
+#define SMALL_INTEGERS_H_
+
+#extension GL_EXT_shader_16bit_storage : require
+#extension GL_EXT_shader_8bit_storage : require
+
+#if SMALL_TYPES
+#extension GL_EXT_shader_explicit_arithmetic_types_int8 : require
+#extension GL_EXT_shader_explicit_arithmetic_types_int16 : require
+
+#define mem_u8 uint8_t
+#define mem_u16 uint16_t
+#define mem_u8x2 u8vec2
+#define mem_u16x2 u16vec2
+#define mem_u8x3 u8vec3
+#define mem_u16x3 u16vec3
+#define mem_u8x4 u8vec4
+#define mem_u16x4 u16vec4
+
+#define mem_i8 int8_t
+#define mem_i16 int16_t
+#define mem_i8x2 i8vec2
+#define mem_i16x2 i16vec2
+#define mem_i8x3 i8vec3
+#define mem_i16x3 i16vec3
+#define mem_i8x4 i8vec4
+#define mem_i16x4 i16vec4
+
+#define u8 uint8_t
+#define u16 uint16_t
+#define u8x2 u8vec2
+#define u16x2 u16vec2
+#define u8x3 u8vec3
+#define u16x3 u16vec3
+#define u8x4 u8vec4
+#define u16x4 u16vec4
+
+#define i8 int8_t
+#define i16 int16_t
+#define i8x2 i8vec2
+#define i16x2 i16vec2
+#define i8x3 i8vec3
+#define i16x3 i16vec3
+#define i8x4 i8vec4
+#define i16x4 i16vec4
+
+#define U8_C(x) uint8_t(x)
+#define I8_C(x) int8_t(x)
+#define U16_C(x) uint16_t(x)
+#define I16_C(x) int16_t(x)
+
+#else
+
+#define mem_u8 uint8_t
+#define mem_u16 uint16_t
+#define mem_u8x2 u8vec2
+#define mem_u16x2 u16vec2
+#define mem_u8x3 u8vec3
+#define mem_u16x3 u16vec3
+#define mem_u8x4 u8vec4
+#define mem_u16x4 u16vec4
+
+#define mem_i8 int8_t
+#define mem_i16 int16_t
+#define mem_i8x2 i8vec2
+#define mem_i16x2 i16vec2
+#define mem_i8x3 i8vec3
+#define mem_i16x3 i16vec3
+#define mem_i8x4 i8vec4
+#define mem_i16x4 i16vec4
+
+#define u8 int
+#define u16 int
+#define u8x2 ivec2
+#define u16x2 ivec2
+#define u8x3 ivec3
+#define u16x3 ivec3
+#define u8x4 ivec4
+#define u16x4 ivec4
+
+#define i8 int
+#define i16 int
+#define i8x2 ivec2
+#define i16x2 ivec2
+#define i8x3 ivec3
+#define i16x3 ivec3
+#define i8x4 ivec4
+#define i16x4 ivec4
+
+#define U8_C(x) int(x)
+#define I8_C(x) int(x)
+#define U16_C(x) int(x)
+#define I16_C(x) int(x)
+
+#endif
+#endif
--- a/parallel-rdp/shaders/span_setup.comp
+++ b/parallel-rdp/shaders/span_setup.comp
@@ -0,0 +1,227 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+#include "small_types.h"
+#include "debug.h"
+
+layout(local_size_x_id = 0) in;
+layout(constant_id = 1) const int SCALING_LOG2 = 0;
+const int SCALING_FACTOR = 1 << SCALING_LOG2;
+#include "data_structures.h"
+
+layout(std430, set = 0, binding = 0) readonly buffer TriangleSetupBuffer
+{
+    TriangleSetupMem elems[];
+} triangle_setup;
+#include "load_triangle_setup.h"
+
+layout(std430, set = 0, binding = 1) readonly buffer AttributeSetupBuffer
+{
+    AttributeSetupMem elems[];
+} attribute_setup;
+#include "load_attribute_setup.h"
+
+layout(set = 0, binding = 2, std430) readonly buffer ScissorStateBuffer
+{
+    ScissorStateMem elems[];
+} scissor_state;
+#include "load_scissor_state.h"
+
+layout(std430, set = 0, binding = 3) writeonly buffer SpanSetups
+{
+    SpanSetupMem elems[];
+} span_setups;
+#include "store_span_setup.h"
+
+layout(set = 1, binding = 0) uniform utextureBuffer uInterpolationJobs;
+
+const int SUBPIXELS = 4;
+const int SUBPIXELS_LOG2 = 2;
+
+// Convert a 16.16 signed value to 16.3. We have 8 subpixels in X direction after snapping.
+ivec4 quantize_x(ivec4 x)
+{
+    ivec4 sticky = ivec4(notEqual(x & 0xfff, ivec4(0)));
+    ivec4 snapped = ivec4((x >> 12) | sticky);
+    return snapped;
+}
+
+int min4(ivec4 v)
+{
+    ivec2 v2 = min(v.xy, v.zw);
+    return min(v2.x, v2.y);
+}
+
+int max4(ivec4 v)
+{
+    ivec2 v2 = max(v.xy, v.zw);
+    return max(v2.x, v2.y);
+}
+
+ivec4 interpolate_snapped(ivec4 dvalue, int dy)
+{
+    int dy_shifted = dy >> SCALING_LOG2;
+    int dy_masked = dy & (SCALING_FACTOR - 1);
+    return dy_shifted * dvalue + dy_masked * (dvalue >> SCALING_LOG2);
+}
+
+void main()
+{
+    ivec3 job_indices = ivec3(texelFetch(uInterpolationJobs, int(gl_WorkGroupID.x)).xyz);
+    int primitive_index = job_indices.x;
+    int base_y = job_indices.y * SCALING_FACTOR;
+    int max_y = job_indices.z * SCALING_FACTOR + (SCALING_FACTOR - 1);
+    int y = base_y + int(gl_LocalInvocationIndex);
+    if (y > max_y)
+        return;
+
+    TriangleSetup setup = load_triangle_setup(primitive_index);
+    AttributeSetup attr = load_attribute_setup(primitive_index);
+    ScissorState scissor = load_scissor_state(primitive_index);
+
+    bool flip = (setup.flags & TRIANGLE_SETUP_FLIP_BIT) != 0;
+    bool interlace_en = (setup.flags & TRIANGLE_SETUP_INTERLACE_FIELD_BIT) != 0;
+    bool keep_odd_field = (setup.flags & TRIANGLE_SETUP_INTERLACE_KEEP_ODD_BIT) != 0;
+
+    SpanSetup span_setup;
+
+    // Interpolate RGBA, STZW to their scanline.
+    {
+        bool do_offset = (setup.flags & TRIANGLE_SETUP_DO_OFFSET_BIT) != 0;
+        bool skip_xfrac = (setup.flags & TRIANGLE_SETUP_SKIP_XFRAC_BIT) != 0;
+        int y_interpolation_base = int(setup.yh) >> 2;
+        y_interpolation_base *= SCALING_FACTOR;
+
+        // For high-resolution interpolation, make sure we snap interpolation correctly at whole pixels,
+        // and quantize derivatives in-between pixels.
+        int dy = y - y_interpolation_base;
+
+        int xh = setup.xh * SCALING_FACTOR + dy * (setup.dxhdy << 2);
+
+        ivec4 drgba_diff = ivec4(0);
+        ivec4 dstzw_diff = ivec4(0);
+
+        // In do_offset mode, varyings are latched at last subpixel line instead of first (for some reason).
+        if (do_offset)
+        {
+            xh += (SCALING_FACTOR * 3) * setup.dxhdy;
+
+            ivec4 drgba_deh = attr.drgba_de & ~0x1ff;
+            ivec4 drgba_dyh = attr.drgba_dy & ~0x1ff;
+            drgba_diff = drgba_deh - (drgba_deh >> 2) - drgba_dyh + (drgba_dyh >> 2);
+
+            ivec4 dstzw_deh = attr.dstzw_de & ~0x1ff;
+            ivec4 dstzw_dyh = attr.dstzw_dy & ~0x1ff;
+            dstzw_diff = dstzw_deh - (dstzw_deh >> 2) - dstzw_dyh + (dstzw_dyh >> 2);
+        }
+
+        int base_x = xh >> 15;
+        int xfrac = skip_xfrac ? 0 : ((xh >> 7) & 0xff);
+
+        ivec4 rgba = attr.rgba + interpolate_snapped(attr.drgba_de, dy);
+        rgba = ((rgba & ~0x1ff) + drgba_diff - interpolate_snapped((attr.drgba_dx >> 8) & ~1, xfrac)) & ~0x3ff;
+
+        ivec4 stzw = attr.stzw + interpolate_snapped(attr.dstzw_de, dy);
+        stzw = ((stzw & ~0x1ff) + dstzw_diff - interpolate_snapped((attr.dstzw_dx >> 8) & ~1, xfrac)) & ~0x3ff;
+
+        span_setup.rgba = rgba;
+        span_setup.stzw = stzw;
+        span_setup.interpolation_base_x = base_x;
+    }
+
+    // Check Y dimension.
+    int yh_interpolation_base = int(setup.yh) & ~(SUBPIXELS - 1);
+    int ym_interpolation_base = int(setup.ym);
+    yh_interpolation_base *= SCALING_FACTOR;
+    ym_interpolation_base *= SCALING_FACTOR;
+
+    int y_sub = int(y * SUBPIXELS);
+    ivec4 y_subs = y_sub + ivec4(0, 1, 2, 3);
+    int ylo = max(setup.yh, scissor.ylo) * SCALING_FACTOR;
+    int yhi = min(setup.yl, scissor.yhi) * SCALING_FACTOR;
+
+    bvec4 clip_lo_y = lessThan(y_subs, ivec4(ylo));
+    bvec4 clip_hi_y = greaterThanEqual(y_subs, ivec4(yhi));
+    uvec4 clip_y = uvec4(clip_lo_y) | uvec4(clip_hi_y);
+
+    // Interpolate X at all 4 Y-subpixels.
+    ivec4 xh = setup.xh * SCALING_FACTOR + (y_subs - yh_interpolation_base) * setup.dxhdy;
+    ivec4 xm = setup.xm * SCALING_FACTOR + (y_subs - yh_interpolation_base) * setup.dxmdy;
+    ivec4 xl = setup.xl * SCALING_FACTOR + (y_subs - ym_interpolation_base) * setup.dxldy;
+    xl = mix(xl, xm, lessThan(y_subs, ivec4(SCALING_FACTOR * setup.ym)));
+
+    // If we have overflows, we can become sensitive to this in invalid_line check, where
+    // checks that should pass fail, and vice versa.
+    // Note that we shaved off one bit in triangle setup for upscaling purposes,
+    // so this should be 28 bits normally.
+    xl = bitfieldExtract(xl, 0, 27 + SCALING_LOG2);
+    xh = bitfieldExtract(xh, 0, 27 + SCALING_LOG2);
+
+    ivec4 xh_shifted = quantize_x(xh);
+    ivec4 xl_shifted = quantize_x(xl);
+
+    ivec4 xleft, xright;
+    if (flip)
+    {
+        xleft = xh_shifted;
+        xright = xl_shifted;
+    }
+    else
+    {
+        xleft = xl_shifted;
+        xright = xh_shifted;
+    }
+
+    bvec4 invalid_line = greaterThan(xleft >> 1, xright >> 1);
+
+    ivec4 lo_scissor = ivec4(SCALING_FACTOR * (scissor.xlo << 1));
+    ivec4 hi_scissor = ivec4(SCALING_FACTOR * (scissor.xhi << 1));
+
+    bool all_over = all(greaterThanEqual(min(xleft, xright), hi_scissor));
+    bool all_under = all(lessThan(max(xleft, xright), lo_scissor));
+
+    xleft = max(xleft, lo_scissor);
+    xleft = min(xleft, hi_scissor);
+    xright = max(xright, lo_scissor);
+    xright = min(xright, hi_scissor);
+
+    invalid_line = bvec4(uvec4(invalid_line) | clip_y);
+
+    xleft = mix(xleft, ivec4(0xffff), invalid_line);
+    xright = mix(xright, ivec4(0), invalid_line);
+
+    int start_x = min4(xleft) >> 3;
+    int end_x = max4(xright) >> 3;
+
+    span_setup.xleft = xleft;
+    span_setup.xright = xright;
+    span_setup.start_x = start_x;
+    span_setup.end_x = end_x;
+    span_setup.valid_line = int(!all(invalid_line) && !all_over && !all_under);
+
+    if (interlace_en)
+        if (((y >> SCALING_LOG2) & 1) != int(keep_odd_field))
+            span_setup.valid_line = U16_C(0);
+
+    span_setup.lodlength = int(flip ? (end_x - span_setup.interpolation_base_x) : (span_setup.interpolation_base_x - start_x));
+    store_span_setup(gl_GlobalInvocationID.x, span_setup);
+}
--- a/parallel-rdp/shaders/store_span_setup.h
+++ b/parallel-rdp/shaders/store_span_setup.h
@@ -0,0 +1,43 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef STORE_SPAN_SETUP_H_
+#define STORE_SPAN_SETUP_H_
+
+void store_span_setup(uint index, SpanSetup setup)
+{
+#if SMALL_TYPES
+	span_setups.elems[index] = setup;
+#else
+	span_setups.elems[index].rgba = setup.rgba;
+	span_setups.elems[index].stzw = setup.stzw;
+	span_setups.elems[index].xleft = mem_u16x4(uvec4(setup.xleft));
+	span_setups.elems[index].xright = mem_u16x4(uvec4(setup.xright));
+	span_setups.elems[index].interpolation_base_x = setup.interpolation_base_x;
+	span_setups.elems[index].start_x = setup.start_x;
+	span_setups.elems[index].end_x = setup.end_x;
+	span_setups.elems[index].lodlength = mem_i16(setup.lodlength);
+	span_setups.elems[index].valid_line = mem_u16(setup.valid_line);
+#endif
+}
+
+#endif
--- a/parallel-rdp/shaders/texture.h
+++ b/parallel-rdp/shaders/texture.h
--- a/parallel-rdp/shaders/tile_binning_combined.comp
+++ b/parallel-rdp/shaders/tile_binning_combined.comp
@@ -0,0 +1,270 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+// Consumes result from tile_binning_prepass.comp, bins at a finer resolution (8x8 or 16x16 blocks).
+#include "small_types.h"
+
+#if SUBGROUP
+#extension GL_KHR_shader_subgroup_basic : require
+#extension GL_KHR_shader_subgroup_vote : require
+#extension GL_KHR_shader_subgroup_ballot : require
+#extension GL_KHR_shader_subgroup_arithmetic : require
+layout(local_size_x_id = 0) in;
+#else
+// Reasonable default. For AMD (64 threads), subgroups are definitely supported, so this won't be hit.
+layout(local_size_x = 32) in;
+#endif
+
+#include "debug.h"
+#include "data_structures.h"
+#include "binning.h"
+
+layout(constant_id = 1) const int TILE_WIDTH = 8;
+layout(constant_id = 2) const int TILE_HEIGHT = 8;
+layout(constant_id = 3) const int MAX_PRIMITIVES = 256;
+layout(constant_id = 4) const int MAX_WIDTH = 1024;
+layout(constant_id = 5) const int TILE_INSTANCE_STRIDE = 0x8000;
+layout(constant_id = 6) const int SCALE_FACTOR = 1;
+
+const int TILE_BINNING_STRIDE = MAX_PRIMITIVES / 32;
+const int MAX_TILES_X = MAX_WIDTH / TILE_WIDTH;
+
+layout(set = 0, binding = 0, std430) readonly buffer TriangleSetupBuffer
+{
+    TriangleSetupMem elems[];
+} triangle_setup;
+#include "load_triangle_setup.h"
+
+layout(set = 0, binding = 1, std430) readonly buffer ScissorStateBuffer
+{
+    ScissorStateMem elems[];
+} scissor_state;
+#include "load_scissor_state.h"
+
+layout(set = 0, binding = 2, std430) readonly buffer StateIndicesBuffer
+{
+    InstanceIndicesMem elems[];
+} state_indices;
+
+layout(std430, set = 0, binding = 3) writeonly buffer TileBitmask
+{
+    uint binned_bitmask[];
+};
+
+layout(std430, set = 0, binding = 4) writeonly buffer TileBitmaskCoarse
+{
+    uint binned_bitmask_coarse[];
+};
+
+#if !UBERSHADER
+layout(std430, set = 0, binding = 5) writeonly buffer TileInstanceOffset
+{
+    uint elems[];
+} tile_instance_offsets;
+
+layout(std430, set = 0, binding = 6) buffer IndirectBuffer
+{
+    uvec4 elems[];
+} indirect_counts;
+
+// This can actually be uint16_t, but AMD doesn't seem to support loading uint16_t in SMEM unit,
+// the memory traffic for this data structure is not relevant anyways.
+struct TileRasterWork
+{
+    uint tile_x, tile_y;
+    uint tile_instance;
+    uint primitive;
+};
+
+layout(std430, set = 0, binding = 7) writeonly buffer WorkList
+{
+    uvec4 elems[];
+} tile_raster_work;
+#endif
+
+#if !UBERSHADER
+uint allocate_work_offset(uint variant_index)
+{
+#if !SUBGROUP
+    return atomicAdd(indirect_counts.elems[variant_index].x, 1u);
+#else
+    // Merge atomic operations. Compiler would normally do this,
+    // but it might not have figured out that variant_index is uniform.
+    uvec4 active_mask = subgroupBallot(true);
+    uint count = subgroupBallotBitCount(active_mask);
+    uint work_offset = 0u;
+    if (subgroupElect())
+        work_offset = atomicAdd(indirect_counts.elems[variant_index].x, count);
+    work_offset = subgroupBroadcastFirst(work_offset);
+    work_offset += subgroupBallotExclusiveBitCount(active_mask);
+    return work_offset;
+#endif
+}
+#endif
+
+layout(push_constant, std430) uniform Registers
+{
+    uvec2 resolution;
+    int primitive_count;
+} fb_info;
+
+#if !SUBGROUP
+shared uint merged_mask_shared;
+#endif
+
+void main()
+{
+    int group_index = int(gl_WorkGroupID.x);
+    ivec2 meta_tile = ivec2(gl_WorkGroupID.yz);
+
+    const int TILES_X = 8;
+    const int TILES_Y = int(gl_WorkGroupSize.x) >> 3;
+
+#if SUBGROUP
+    // Spec is unclear how gl_LocalInvocationIndex is mapped to gl_SubgroupInvocationID, so synthesize our own.
+    // We know the subgroups are fully occupied with VK_EXT_subgroup_size_control already.
+    int local_index = int(gl_SubgroupInvocationID);
+    int SUBGROUP_TILES_Y = int(gl_SubgroupSize) >> 3;
+#else
+    int local_index = int(gl_LocalInvocationIndex);
+#endif
+
+    int inner_tile_x = local_index & 7;
+    int inner_tile_y = local_index >> 3;
+#if SUBGROUP
+    inner_tile_y += SUBGROUP_TILES_Y * int(gl_SubgroupID);
+#endif
+    ivec2 tile = meta_tile * ivec2(TILES_X, TILES_Y) + ivec2(inner_tile_x, inner_tile_y);
+
+    int linear_tile = tile.y * MAX_TILES_X + tile.x;
+
+    ivec2 base_coord_meta = meta_tile * ivec2(TILE_WIDTH * TILES_X, TILE_HEIGHT * TILES_Y);
+#if SUBGROUP
+    base_coord_meta.y += SUBGROUP_TILES_Y * TILE_HEIGHT * int(gl_SubgroupID);
+    ivec2 end_coord_meta = min(base_coord_meta + ivec2(TILE_WIDTH * TILES_X, TILE_HEIGHT * SUBGROUP_TILES_Y), ivec2(fb_info.resolution)) - 1;
+#else
+    ivec2 end_coord_meta = min(base_coord_meta + ivec2(TILE_WIDTH * TILES_X, TILE_HEIGHT * TILES_Y), ivec2(fb_info.resolution)) - 1;
+#endif
+
+    ivec2 base_coord = tile * ivec2(TILE_WIDTH, TILE_HEIGHT);
+    ivec2 end_coord = min(base_coord + ivec2(TILE_WIDTH, TILE_HEIGHT), ivec2(fb_info.resolution)) - 1;
+
+    int primitive_count = fb_info.primitive_count;
+
+#if !SUBGROUP
+    if (local_index == 0)
+        merged_mask_shared = 0u;
+    barrier();
+#endif
+
+    bool binned = false;
+    if (local_index < 32)
+    {
+        uint primitive_index = group_index * 32 + local_index;
+        if (primitive_index < primitive_count)
+        {
+            ScissorState scissor = load_scissor_state(primitive_index);
+            TriangleSetup setup = load_triangle_setup(primitive_index);
+            binned = bin_primitive(setup, base_coord_meta, end_coord_meta, SCALE_FACTOR, scissor);
+        }
+    }
+
+#if SUBGROUP
+    uint merged_mask = subgroupBallot(binned).x;
+#else
+    if (binned)
+        atomicOr(merged_mask_shared, 1u << local_index);
+    barrier();
+    uint merged_mask = merged_mask_shared;
+#endif
+
+    uint binned_mask = 0u;
+    while (merged_mask != 0u)
+    {
+        int bit = findLSB(merged_mask);
+        merged_mask &= ~(1u << bit);
+        uint primitive_index = group_index * 32 + bit;
+        ScissorState scissor = load_scissor_state(primitive_index);
+        TriangleSetup setup = load_triangle_setup(primitive_index);
+        if (bin_primitive(setup, base_coord, end_coord, SCALE_FACTOR, scissor))
+            binned_mask |= 1u << bit;
+    }
+
+    binned_bitmask[linear_tile * TILE_BINNING_STRIDE + group_index] = binned_mask;
+    if (binned_mask != 0u)
+        atomicOr(binned_bitmask_coarse[linear_tile], 1u << group_index);
+    else
+        atomicAnd(binned_bitmask_coarse[linear_tile], ~(1u << group_index));
+
+#if SUBGROUP
+#if !UBERSHADER
+    uint bit_count = uint(bitCount(binned_mask));
+    uint instance_offset = 0u;
+    if (subgroupAny(bit_count != 0u))
+    {
+        // Allocate tile instance space for all threads in subgroup in one go.
+        uint total_bit_count = subgroupAdd(bit_count);
+
+        if (subgroupElect())
+            if (total_bit_count != 0u)
+                instance_offset = atomicAdd(indirect_counts.elems[0].w, total_bit_count);
+
+        instance_offset = subgroupBroadcastFirst(instance_offset);
+        instance_offset += subgroupInclusiveAdd(bit_count) - bit_count;
+    }
+#endif
+#else
+#if !UBERSHADER
+    uint bit_count = uint(bitCount(binned_mask));
+    uint instance_offset = 0u;
+    if (bit_count != 0u)
+        instance_offset = atomicAdd(indirect_counts.elems[0].w, bit_count);
+#endif
+#endif
+
+#if !UBERSHADER
+    if (bit_count != 0u)
+        tile_instance_offsets.elems[linear_tile * TILE_BINNING_STRIDE + group_index] = instance_offset;
+
+#if SUBGROUP
+    uint variant_mask = subgroupOr(binned_mask);
+#else
+    uint variant_mask = binned_mask;
+#endif
+
+    while (variant_mask != 0u)
+    {
+        int bit = findLSB(variant_mask);
+        variant_mask &= ~(1u << bit);
+        int primitive_index = group_index * 32 + bit;
+
+        if ((binned_mask & (1u << bit)) != 0u)
+        {
+            uint variant_index = uint(state_indices.elems[primitive_index].static_depth_tmem.x);
+            uint work_offset = allocate_work_offset(variant_index);
+            tile_raster_work.elems[work_offset + uint(TILE_INSTANCE_STRIDE) * variant_index] =
+                uvec4(tile.x, tile.y, instance_offset, primitive_index);
+            instance_offset++;
+        }
+    }
+#endif
+}
--- a/parallel-rdp/shaders/tmem_update.comp
+++ b/parallel-rdp/shaders/tmem_update.comp
@@ -0,0 +1,577 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+#include "debug.h"
+#include "small_types.h"
+layout(local_size_x_id = 0) in;
+
+layout(set = 0, binding = 0, std430) readonly buffer VRAM8Buffer
+{
+    mem_u8 data[];
+} vram8;
+
+layout(set = 0, binding = 0, std430) readonly buffer VRAM16Buffer
+{
+    mem_u16 data[];
+} vram16;
+
+layout(set = 0, binding = 0, std430) readonly buffer VRAM32Buffer
+{
+    uint data[];
+} vram32;
+
+layout(set = 0, binding = 1, std430) buffer TMEM16Buffer
+{
+    mem_u16 data[2048];
+} tmem16;
+
+struct TileInstance
+{
+    mem_u16 data[2048];
+};
+
+layout(set = 0, binding = 2, std430) writeonly buffer TMEMInstances
+{
+    TileInstance instances[];
+} tile_instances;
+
+layout(push_constant, std430) uniform Registers
+{
+    int num_uploads;
+} registers;
+
+const int TEXTURE_FMT_RGBA = 0;
+const int TEXTURE_FMT_YUV = 1;
+const int TEXTURE_FMT_CI = 2;
+const int TEXTURE_FMT_IA = 3;
+const int TEXTURE_FMT_I = 4;
+
+const int UPLOAD_MODE_TILE = 0;
+const int UPLOAD_MODE_TLUT = 1;
+const int UPLOAD_MODE_BLOCK = 2;
+
+struct UploadInfo
+{
+    int width, height;
+    float min_t_mod, max_t_mod;
+
+    int vram_addr;
+    int vram_width;
+    int vram_size;
+    int vram_effective_width;
+
+    int tmem_offset;
+    int tmem_stride_words;
+    int tmem_size;
+    int tmem_fmt;
+
+    int mode;
+    float inv_tmem_stride_words;
+    int dxt;
+    int padding;
+};
+
+layout(set = 1, binding = 0, std140) uniform UploadInfos
+{
+    UploadInfo upload_info[256];
+};
+
+bool tmem_dirty;
+uint current_tmem_value;
+
+int compute_upload_t(int offset, float inv_stride)
+{
+    // This is still exact for all relevant inputs, and much faster than integer divide.
+    return int((float(offset) + 0.5) * inv_stride);
+}
+
+// In 32bpp upload mode we read 64 bits and split the result over the lower and upper TMEM.
+void update_tmem_32(UploadInfo info, int tmem16_index, bool upper_tmem, bool yuv)
+{
+    int tmem16_offset = (info.tmem_offset & 0x7ff) >> 1;
+    int tmem16_stride = info.tmem_stride_words;
+
+    int pixel_offset = (tmem16_index - tmem16_offset) & 0x3ff;
+    int upload_x, upload_y;
+    int upload_x_xor = 0;
+
+    if (info.mode == UPLOAD_MODE_BLOCK)
+    {
+        int word_offset = pixel_offset >> 1;
+
+        if (info.tmem_stride_words == 0)
+        {
+            // Trivial case, we can just compute T factor directly and set upload_x_xor.
+            // Other than that, it works like a simple 1D upload.
+
+            // However, if DxT is weird, we might end up in a situation where this word is written multiple times,
+            // or zero times.
+
+            int iteration_candidate_first = word_offset & ~1;
+            int iteration_candidate_second = iteration_candidate_first + 1;
+            int first_t = (iteration_candidate_first * info.dxt) >> 16;
+            int second_t = (iteration_candidate_second * info.dxt) >> 16;
+            if (first_t != second_t)
+            {
+                int iteration_candidate_first_write_index = iteration_candidate_first ^ (first_t & 1);
+                int iteration_candidate_second_write_index = iteration_candidate_second ^ (second_t & 1);
+                if (iteration_candidate_second_write_index == word_offset)
+                    upload_x_xor = (second_t & 1) << 1;
+                else if (iteration_candidate_first_write_index == word_offset)
+                    upload_x_xor = (first_t & 1) << 1;
+                else
+                    return;
+            }
+            else
+                upload_x_xor ^= (first_t & 1) << 1;
+        }
+        else
+        {
+            // Welp ... This is pure insanity, but if we want to be completely correct ...
+            int min_t = compute_upload_t(word_offset & ~1, info.min_t_mod);
+            int max_t = compute_upload_t(word_offset | 1, info.max_t_mod);
+
+            // If t has a range, then the solution to Y = (t = floor(X * dt / 2048)) * stride + X has a range space of:
+            // Y - t_max * stride <= X <= Y - t_min * stride.
+            int max_word_candidate = (word_offset | 1) - tmem16_stride * min_t;
+            int min_word_candidate = (word_offset & ~1) - tmem16_stride * max_t;
+
+            // If we have constraints for X, we constraint T further.
+            min_t = max(min_t, (min_word_candidate * info.dxt) >> 16);
+            max_t = min(max_t, (max_word_candidate * info.dxt) >> 16);
+
+            bool found_candidate = false;
+            for (int t = max_t; t >= min_t; t--)
+            {
+                // Check to see if t is a solution to the equation.
+                // Potentially two targets could write here.
+                int candidate_solution_first = (word_offset & ~1) - tmem16_stride * t;
+                int candidate_solution_second = (word_offset | 1) - tmem16_stride * t;
+
+                int candidate_t_first = (candidate_solution_first * info.dxt) >> 16;
+                int candidate_t_second = (candidate_solution_second * info.dxt) >> 16;
+
+                if (((candidate_solution_second + candidate_t_second * tmem16_stride) ^ (candidate_t_second & 1)) == word_offset)
+                {
+                    found_candidate = true;
+                    pixel_offset = (candidate_solution_second << 1) + (pixel_offset & 1);
+                    break;
+                }
+                else if (((candidate_solution_first + candidate_t_first * tmem16_stride) ^ (candidate_t_first & 1)) == word_offset)
+                {
+                    found_candidate = true;
+                    pixel_offset = (candidate_solution_first << 1) + (pixel_offset & 1);
+                    break;
+                }
+            }
+
+            // We strided over this 64bpp word.
+            if (!found_candidate)
+                return;
+        }
+
+        upload_x = pixel_offset;
+        upload_y = 0;
+    }
+    else if (tmem16_stride == 0)
+    {
+        // For TMEM stride of 0 we're essentially replaying the same line over and over and the final visible result
+        // is what happened in Y == height - 1.
+        upload_x = pixel_offset;
+        upload_y = info.height - 1;
+    }
+    else
+    {
+        upload_y = compute_upload_t(pixel_offset, info.inv_tmem_stride_words);
+        upload_x = pixel_offset - upload_y * tmem16_stride;
+
+        // If stride is smaller than width, we'll need to unroll the last line.
+        if (upload_y >= info.height)
+        {
+            upload_x += tmem16_stride * (upload_y - info.height + 1);
+            upload_y = info.height - 1;
+        }
+    }
+
+    int last_line_upload_x = upload_x ^ ((upload_y & 1) << 1);
+    if (last_line_upload_x >= info.width && upload_y > 0)
+    {
+        // If the last line won't trigger a write, the previous line probably did.
+        upload_y--;
+        upload_x += tmem16_stride;
+    }
+
+    int iteration_offset;
+
+    upload_x ^= ((upload_y & 1) << 1) | upload_x_xor;
+
+    if (info.vram_size == 3 || yuv)
+    {
+        iteration_offset = 4 * (upload_x & ~1);
+    }
+    else if (info.vram_size == 2)
+    {
+        // In 16bpp VRAM mode, we are supposed to step 4 pixels at a time (8 bytes), which will form 2 complete pixels.
+        // However, in 32bpp tile mode we're not shifting the X value appropriately.
+        // So, we're writing texels [0, 1, ..., 4, 5, ...], etc.
+        if ((upload_x & 2) != 0)
+        {
+            // We're not writing in this line, but the previous line might have!
+            // Interleaving patterns will form ...
+            if (upload_y > 0)
+            {
+                upload_y--;
+                upload_x += tmem16_stride;
+                upload_x ^= 2;
+            }
+            else
+            {
+                // These 2 words will never be written to.
+                return;
+            }
+        }
+        iteration_offset = 2 * (upload_x & ~1);
+    }
+    else if (info.vram_size == 1)
+    {
+        // 4 potential mirrors.
+        for (int i = 0; i < 4 && upload_y > 0 && (upload_x & 6) != 0; i++)
+        {
+            upload_y--;
+            upload_x += tmem16_stride;
+            upload_x ^= 2;
+        }
+
+        if ((upload_x & 6) != 0)
+        {
+            // These 6 words will never be written to.
+            return;
+        }
+
+        iteration_offset = upload_x & ~1;
+    }
+
+    if (upload_x >= info.width)
+        return;
+
+    int line_rdram_addr = info.vram_addr + ((upload_y * info.vram_width) << (info.vram_size - 1));
+
+    // The loading pipeline reads 64 bits per iteration.
+    int rdram_addr = line_rdram_addr + iteration_offset + 4 * (upload_x & 1);
+
+    uint word;
+    if ((rdram_addr & 3) == 0)
+    {
+        word = uint(vram32.data[rdram_addr >> 2]);
+    }
+    else
+    {
+        word = (uint(vram8.data[rdram_addr ^ 3]) << 24) |
+               (uint(vram8.data[(rdram_addr + 1) ^ 3]) << 16) |
+               (uint(vram8.data[(rdram_addr + 2) ^ 3]) << 8) |
+                uint(vram8.data[(rdram_addr + 3) ^ 3]);
+    }
+
+    if (yuv)
+    {
+        // Lower TMEM receives interleaved UV samples, while upper receives Y.
+        if (upper_tmem)
+        {
+            uint y0 = (word >> 16u) & 0xffu;
+            uint y1 = (word >> 0u) & 0xffu;
+            word = (y0 << 8u) | y1;
+        }
+        else
+        {
+            uint u = (word >> 24u) & 0xffu;
+            uint v = (word >> 8u) & 0xffu;
+            word = (u << 8u) | v;
+        }
+    }
+    else
+    {
+        word >>= 16u - 16u * uint(upper_tmem);
+        word &= 0xffffu;
+    }
+    current_tmem_value = word;
+    tmem_dirty = true;
+}
+
+void update_tmem_16(UploadInfo info, int tmem16_index)
+{
+    int tmem16_offset = (info.tmem_offset & 0xfff) >> 1;
+    int tmem16_stride = info.tmem_stride_words;
+
+    int pixel_offset = (tmem16_index - tmem16_offset) & 0x7ff;
+    int upload_x, upload_y;
+    int upload_x_xor = 0;
+
+    if (info.mode == UPLOAD_MODE_BLOCK)
+    {
+        int word_offset = pixel_offset >> 2;
+
+        if (info.tmem_stride_words == 0)
+        {
+            // Trivial case, we can just compute T factor directly and set upload_x_xor.
+            // Other than that, it works like a simple 1D upload.
+            upload_x_xor = (((word_offset * info.dxt) >> 16) & 1) << 1;
+        }
+        else
+        {
+            // Welp ... This is pure insanity, but if we want to be completely correct ...
+            int min_t = compute_upload_t(word_offset, info.min_t_mod);
+            int max_t = compute_upload_t(word_offset, info.max_t_mod);
+
+            // If t has a range, then the solution to Y = (t = floor(X * dt / 2048)) * stride + X has a range space of:
+            // Y - t_max * stride <= X <= Y - t_min * stride.
+            int max_word_candidate = word_offset - tmem16_stride * min_t;
+            int min_word_candidate = word_offset - tmem16_stride * max_t;
+
+            // If we have constraints for X, we constraint T further.
+            min_t = max(min_t, (min_word_candidate * info.dxt) >> 16);
+            max_t = min(max_t, (max_word_candidate * info.dxt) >> 16);
+
+            bool found_candidate = false;
+            for (int t = max_t; t >= min_t; t--)
+            {
+                // Check to see if t is a solution to the equation.
+                int candidate_solution = word_offset - tmem16_stride * t;
+                int computed_t = (candidate_solution * info.dxt) >> 16;
+                if (candidate_solution + computed_t * tmem16_stride == word_offset)
+                {
+                    found_candidate = true;
+                    upload_x_xor = (computed_t & 1) << 1;
+                    pixel_offset = (candidate_solution << 2) + (pixel_offset & 3);
+                }
+            }
+
+            // We strided over this 64bpp word.
+            if (!found_candidate)
+                return;
+        }
+
+        upload_x = pixel_offset;
+        upload_y = 0;
+    }
+    else if (tmem16_stride == 0)
+    {
+        // For TMEM stride of 0 we're essentially replaying the same line over and over and the final visible result
+        // is what happened in Y == height - 1.
+        upload_x = pixel_offset;
+        upload_y = info.height - 1;
+    }
+    else
+    {
+        upload_y = compute_upload_t(pixel_offset, info.inv_tmem_stride_words);
+        upload_x = pixel_offset - upload_y * tmem16_stride;
+
+        // If stride is smaller than width, we'll need to unroll the last line.
+        if (upload_y >= info.height)
+        {
+            upload_x += tmem16_stride * (upload_y - info.height + 1);
+            upload_y = info.height - 1;
+        }
+    }
+
+    // This is pure bullshit magic which arises as an edge case when
+    // tile pixel size does not match texture image size.
+    // Should not happen in normal applications.
+    // This is basically doing scatter-as-gather, so we need to figure out
+    // if there is no write to our texel after all (striding), or if there are multiple writes
+    // to our texel, in which case we need to figure out the last writer.
+    // This code is black magic, and it's made with blood, sweat and tears from testing with lots of trial and error.
+    int iteration_offset;
+    if (info.tmem_size != info.vram_size)
+    {
+        if (info.vram_size - info.tmem_size == 1)
+        {
+            // If TMEM is N bpp but VRAM is 2N bpp, we will get mirrored writes here.
+            // Select which half of the 2N bpp load we observe in TMEM.
+            iteration_offset = (upload_x & ~3) * 4;
+            if ((upload_x & ~3) + 2 < (info.vram_effective_width >> (3 - info.vram_size)))
+                iteration_offset += 8;
+        }
+        else if (info.tmem_size == 2 && info.vram_size == 1)
+        {
+            // In 8bpp VRAM mode, we are supposed to step 8 pixels at a time (8 bytes), which will form 4 complete pixels.
+            // However, in 16bpp tile mode we're not shifting the X value appropriately.
+            // So, we're writing texels [0, 1, 2, 3, ..., 8, 9, 10, 11], etc.
+            if ((upload_x & 4) != 0)
+            {
+                // We're not writing in this line, but the previous line might have!
+                // Interleaving patterns will form ...
+                if ((tmem16_stride & 4) != 0 && upload_y > 0)
+                {
+                    upload_y--;
+                    upload_x += tmem16_stride;
+                }
+                else
+                {
+                    // These 4 words will never be written to.
+                    return;
+                }
+            }
+            iteration_offset = upload_x & ~3;
+        }
+    }
+    else
+    {
+        // Normal case TMEM size aligns with VRAM size.
+        iteration_offset = (upload_x & ~3) * 2;
+    }
+
+    if (upload_x >= info.width)
+        return;
+
+    int line_rdram_addr = info.vram_addr + ((upload_y * info.vram_width) << (info.vram_size - 1));
+    upload_x ^= ((upload_y & 1) << 1) | upload_x_xor;
+
+    // The loading pipeline reads 64 bits per iteration.
+    int rdram_addr = line_rdram_addr + iteration_offset + 2 * (upload_x & 3);
+
+    uint word;
+    if ((rdram_addr & 1) == 0)
+        word = uint(vram16.data[(rdram_addr >> 1) ^ 1]);
+    else
+        word = (uint(vram8.data[rdram_addr ^ 3]) << 8) | uint(vram8.data[(rdram_addr + 1) ^ 3]);
+
+    current_tmem_value = word;
+    tmem_dirty = true;
+}
+
+void update_tmem_lut(UploadInfo info, int tmem16_index)
+{
+    int tmem16_offset = (info.tmem_offset & 0xfff) >> 1;
+    int pixel_offset = (tmem16_index - tmem16_offset) & 0x7ff;
+    int pixel_offset_splat;
+
+    if (info.vram_size - info.tmem_size == 2)
+    {
+        pixel_offset_splat = pixel_offset >> 2;
+        pixel_offset_splat <<= info.vram_size - 2;
+        if (pixel_offset_splat >= info.vram_effective_width)
+            return;
+    }
+    else if (info.vram_size - info.tmem_size == 1)
+    {
+        if ((pixel_offset & 4) == 0)
+        {
+            int shamt = info.tmem_size + (info.vram_size == 2 ? 2 : 0);
+            pixel_offset_splat = (pixel_offset & ~7) >> shamt;
+            if (pixel_offset_splat >= info.vram_effective_width)
+                return;
+        }
+        else
+        {
+            return;
+        }
+    }
+    else if (info.vram_size == info.tmem_size)
+    {
+        if ((pixel_offset & 0xc) == 0)
+        {
+            int shamt = info.tmem_size + (info.vram_size == 2 ? 2 : 0);
+            pixel_offset_splat = (pixel_offset & ~3) >> shamt;
+            if (pixel_offset_splat >= info.vram_effective_width)
+                return;
+        }
+        else
+        {
+            return;
+        }
+    }
+    else if (info.vram_size - info.tmem_size == -1)
+    {
+        if ((pixel_offset & 0x1c) == 0)
+        {
+            int shamt = info.tmem_size;
+            pixel_offset_splat = (pixel_offset >> shamt) & ~7;
+            if (pixel_offset_splat >= info.vram_effective_width)
+                return;
+        }
+        else
+        {
+            return;
+        }
+    }
+    else
+    {
+        // 4bpp tile, 32bpp VRAM. Mirrored writes.
+        int span_iteration = pixel_offset >> 2;
+        span_iteration = span_iteration * 2;
+        int span_pixel = span_iteration * 2;
+        if (span_pixel + 2 < info.vram_effective_width)
+            span_pixel += 2;
+
+        if (span_pixel >= info.vram_effective_width)
+            return;
+
+        pixel_offset_splat = span_pixel;
+    }
+
+    int rdram_addr = info.vram_addr + (pixel_offset_splat << (info.vram_size - 1));
+
+    // Odd behavior when we have unaligned TLUT uploads.
+    rdram_addr += 2 * (rdram_addr & 1) * (pixel_offset & 3);
+
+    uint word;
+    if ((rdram_addr & 1) == 0)
+        word = uint(vram16.data[(rdram_addr >> 1) ^ 1]);
+    else
+        word = (uint(vram8.data[rdram_addr ^ 3]) << 8) | uint(vram8.data[(rdram_addr + 1) ^ 3]);
+
+    current_tmem_value = word;
+    tmem_dirty = true;
+}
+
+void main()
+{
+    tmem_dirty = false;
+    current_tmem_value = uint(tmem16.data[gl_GlobalInvocationID.x]);
+    int tmem16_index = int(gl_GlobalInvocationID.x) ^ 1;
+    bool upper_tmem = tmem16_index >= 0x400;
+
+    tile_instances.instances[0].data[gl_GlobalInvocationID.x] = mem_u16(current_tmem_value);
+
+    int num_uploads = registers.num_uploads;
+    for (int i = 0; i < num_uploads; i++)
+    {
+        UploadInfo info = upload_info[i];
+        if (info.mode == UPLOAD_MODE_TLUT)
+        {
+            update_tmem_lut(info, tmem16_index);
+        }
+        else
+        {
+            bool yuv = info.tmem_fmt == TEXTURE_FMT_YUV;
+            if (info.tmem_size == 3 || yuv)
+                update_tmem_32(info, tmem16_index & 0x3ff, upper_tmem, yuv);
+            else
+                update_tmem_16(info, tmem16_index);
+        }
+
+        tile_instances.instances[i + 1].data[gl_GlobalInvocationID.x] = mem_u16(current_tmem_value);
+    }
+
+    if (tmem_dirty)
+        tmem16.data[gl_GlobalInvocationID.x] = mem_u16(current_tmem_value);
+}
--- a/parallel-rdp/shaders/ubershader.comp
+++ b/parallel-rdp/shaders/ubershader.comp
@@ -0,0 +1,103 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+// RIP to any GPU which attempts to execute this monstrosity :)
+
+#if SUBGROUP
+#extension GL_KHR_shader_subgroup_basic : require
+#extension GL_KHR_shader_subgroup_vote : require
+#extension GL_KHR_shader_subgroup_ballot : require
+#extension GL_KHR_shader_subgroup_arithmetic : require
+#endif
+#include "small_types.h"
+
+layout(local_size_x_id = 3, local_size_y_id = 4) in;
+
+#include "debug.h"
+#include "data_structures_buffers.h"
+
+#include "noise.h"
+#include "memory_interfacing.h"
+#include "shading.h"
+
+layout(push_constant, std430) uniform Registers
+{
+    uint fb_addr_index;
+    uint fb_depth_addr_index;
+    uint fb_width;
+    uint fb_height;
+    uint group_mask;
+} registers;
+
+layout(constant_id = 5) const int MAX_PRIMITIVES = 256;
+layout(constant_id = 6) const int MAX_WIDTH = 1024;
+
+const int TILE_BINNING_STRIDE = MAX_PRIMITIVES / 32;
+const int MAX_TILES_X = MAX_WIDTH / int(gl_WorkGroupSize.x);
+
+void main()
+{
+    int x = int(gl_GlobalInvocationID.x);
+    int y = int(gl_GlobalInvocationID.y);
+    ivec2 tile = ivec2(gl_WorkGroupID.xy);
+
+    int linear_tile = tile.x + tile.y * MAX_TILES_X;
+    int linear_tile_base = linear_tile * TILE_BINNING_STRIDE;
+
+    uint coarse_binned = tile_binning_coarse.elems[linear_tile] & registers.group_mask;
+    if (coarse_binned == 0u)
+        return;
+
+    init_tile(gl_GlobalInvocationID.xy,
+              registers.fb_width, registers.fb_height,
+              registers.fb_addr_index, registers.fb_depth_addr_index);
+
+    while (coarse_binned != 0u)
+    {
+        int mask_index = findLSB(coarse_binned);
+        coarse_binned &= ~uint(1 << mask_index);
+
+        uint binned = tile_binning.elems[linear_tile_base + mask_index];
+        while (binned != 0u)
+        {
+            int i = findLSB(binned);
+            binned &= ~uint(1 << i);
+            uint primitive_index = uint(i + 32 * mask_index);
+
+            ShadedData shaded;
+            if (shade_pixel(x, y, primitive_index, shaded))
+            {
+                if ((shaded.coverage_count & COVERAGE_FILL_BIT) != 0)
+                    fill_color(derived_setup.elems[primitive_index].fill_color);
+                else if ((shaded.coverage_count & COVERAGE_COPY_BIT) != 0)
+                    copy_pipeline(shaded.z_dith, primitive_index);
+                else
+                    depth_blend(x, y, primitive_index, shaded);
+            }
+        }
+    }
+
+    finish_tile(gl_GlobalInvocationID.xy,
+                registers.fb_width, registers.fb_height,
+                registers.fb_addr_index, registers.fb_depth_addr_index);
+}
--- a/parallel-rdp/shaders/update_upscaled_domain_post.comp
+++ b/parallel-rdp/shaders/update_upscaled_domain_post.comp
@@ -0,0 +1,119 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#include "small_types.h"
+#include "fb_formats.h"
+
+layout(local_size_x_id = 3) in;
+
+layout(constant_id = 0) const int RDRAM_SIZE = 8 * 1024 * 1024;
+const int RDRAM_MASK_8 = RDRAM_SIZE - 1;
+const int RDRAM_MASK_16 = RDRAM_MASK_8 >> 1;
+const int RDRAM_MASK_32 = RDRAM_MASK_8 >> 2;
+layout(constant_id = 1) const int FB_SIZE_LOG2 = 0;
+layout(constant_id = 2) const bool COLOR_DEPTH_ALIAS = false;
+layout(constant_id = 4) const int NUM_SAMPLES = 1;
+
+layout(push_constant) uniform Registers
+{
+    uint num_pixels, fb_addr, fb_depth_addr;
+} registers;
+
+layout(set = 0, binding = 0) readonly buffer RDRAMSingleSampled8
+{
+    uint8_t elems[];
+} vram8;
+
+layout(set = 0, binding = 0) readonly buffer RDRAMSingleSampled16
+{
+    uint16_t elems[];
+} vram16;
+
+layout(set = 0, binding = 0) readonly buffer RDRAMSingleSampled32
+{
+    uint elems[];
+} vram32;
+
+layout(set = 0, binding = 2) buffer RDRAMUpscalingReference8
+{
+    uint8_t elems[];
+} vram_reference8;
+
+layout(set = 0, binding = 2) buffer RDRAMUpscalingReference16
+{
+    uint16_t elems[];
+} vram_reference16;
+
+layout(set = 0, binding = 2) buffer RDRAMUpscalingReference32
+{
+    uint elems[];
+} vram_reference32;
+
+void copy_rdram_8(uint index)
+{
+    index &= RDRAM_MASK_8;
+    uint real_word = uint(vram8.elems[index]);
+    vram_reference8.elems[index] = uint8_t(real_word);
+}
+
+void copy_rdram_16(uint index)
+{
+    index &= RDRAM_MASK_16;
+    uint real_word = uint(vram16.elems[index]);
+    vram_reference16.elems[index] = uint16_t(real_word);
+}
+
+void copy_rdram_32(uint index)
+{
+    index &= RDRAM_MASK_32;
+    uint real_word = vram32.elems[index];
+    vram_reference32.elems[index] = real_word;
+}
+
+void main()
+{
+    uint index = gl_GlobalInvocationID.x;
+    if (index >= registers.num_pixels)
+        return;
+
+    uint depth_index = index + registers.fb_depth_addr;
+    uint color_index = index + registers.fb_addr;
+
+    switch (FB_SIZE_LOG2)
+    {
+        case 0:
+            copy_rdram_8(color_index);
+            break;
+
+        case 1:
+            copy_rdram_16(color_index);
+            break;
+
+        case 2:
+            copy_rdram_32(color_index);
+            break;
+    }
+
+    if (!COLOR_DEPTH_ALIAS)
+        copy_rdram_16(depth_index);
+}
--- a/parallel-rdp/shaders/update_upscaled_domain_pre.comp
+++ b/parallel-rdp/shaders/update_upscaled_domain_pre.comp
@@ -0,0 +1,185 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#include "small_types.h"
+
+layout(local_size_x_id = 3) in;
+
+layout(constant_id = 0) const int RDRAM_SIZE = 8 * 1024 * 1024;
+const int RDRAM_MASK_8 = RDRAM_SIZE - 1;
+const int RDRAM_MASK_16 = RDRAM_MASK_8 >> 1;
+const int RDRAM_MASK_32 = RDRAM_MASK_8 >> 2;
+layout(constant_id = 1) const int FB_SIZE_LOG2 = 0;
+layout(constant_id = 2) const bool COLOR_DEPTH_ALIAS = false;
+layout(constant_id = 4) const int NUM_SAMPLES = 1;
+
+layout(push_constant) uniform Registers
+{
+    uint num_pixels, fb_addr, fb_depth_addr;
+} registers;
+
+layout(set = 0, binding = 0) readonly buffer RDRAMSingleSampled8
+{
+    uint8_t elems[];
+} vram8;
+
+layout(set = 0, binding = 0) readonly buffer RDRAMSingleSampled16
+{
+    uint16_t elems[];
+} vram16;
+
+layout(set = 0, binding = 0) readonly buffer RDRAMSingleSampled32
+{
+    uint elems[];
+} vram32;
+
+layout(set = 0, binding = 1) readonly buffer RDRAMHiddenSingleSampled
+{
+    uint8_t elems[];
+} hidden_vram;
+
+layout(set = 0, binding = 2) buffer RDRAMUpscalingReference8
+{
+    uint8_t elems[];
+} vram_reference8;
+
+layout(set = 0, binding = 2) buffer RDRAMUpscalingReference16
+{
+    uint16_t elems[];
+} vram_reference16;
+
+layout(set = 0, binding = 2) buffer RDRAMUpscalingReference32
+{
+    uint elems[];
+} vram_reference32;
+
+layout(set = 0, binding = 3) buffer RDRAMUpscaling8
+{
+    uint8_t elems[];
+} vram_upscaled8;
+
+layout(set = 0, binding = 3) buffer RDRAMUpscaling16
+{
+    uint16_t elems[];
+} vram_upscaled16;
+
+layout(set = 0, binding = 3) buffer RDRAMUpscaling32
+{
+    uint elems[];
+} vram_upscaled32;
+
+layout(set = 0, binding = 4) buffer RDRAMHiddenUpscaling
+{
+    uint8_t elems[];
+} hidden_vram_upscaled;
+
+void update_rdram_8(uint index)
+{
+    index &= RDRAM_MASK_8;
+
+    uint real_word = uint(vram8.elems[index]);
+    uint reference_word = uint(vram_reference8.elems[index]);
+
+    if (real_word != reference_word)
+    {
+        uint mirrored_index = index ^ 3u;
+        uint real_hidden_word = uint(hidden_vram.elems[mirrored_index >> 1u]);
+        for (int i = 0; i < NUM_SAMPLES; i++)
+        {
+            vram_upscaled8.elems[index + i * RDRAM_SIZE] = uint8_t(real_word);
+            if ((mirrored_index & 1u) != 0u)
+                hidden_vram_upscaled.elems[(mirrored_index >> 1u) + i * (RDRAM_SIZE >> 1)] = uint8_t(real_hidden_word);
+        }
+        vram_reference8.elems[index] = uint8_t(real_word);
+    }
+}
+
+void update_rdram_16(uint index)
+{
+    index &= RDRAM_MASK_16;
+
+    uint real_word = uint(vram16.elems[index]);
+    uint reference_word = uint(vram_reference16.elems[index]);
+
+    if (real_word != reference_word)
+    {
+        uint mirrored_index = index ^ 1u;
+        uint real_hidden_word = uint(hidden_vram.elems[mirrored_index]);
+        for (int i = 0; i < NUM_SAMPLES; i++)
+        {
+            vram_upscaled16.elems[index + i * (RDRAM_SIZE >> 1)] = uint16_t(real_word);
+            hidden_vram_upscaled.elems[mirrored_index + i * (RDRAM_SIZE >> 1)] = uint8_t(real_hidden_word);
+        }
+        vram_reference16.elems[index] = uint16_t(real_word);
+    }
+}
+
+void update_rdram_32(uint index)
+{
+    index &= RDRAM_MASK_32;
+
+    uint real_word = vram32.elems[index];
+    uint reference_word = vram_reference32.elems[index];
+
+    if (real_word != reference_word)
+    {
+        uint real_hidden_word0 = uint(hidden_vram.elems[2u * index]);
+        uint real_hidden_word1 = uint(hidden_vram.elems[2u * index + 1u]);
+
+        for (int i = 0; i < NUM_SAMPLES; i++)
+        {
+            vram_upscaled32.elems[index + i * (RDRAM_SIZE >> 2)] = real_word;
+            hidden_vram_upscaled.elems[2u * index + i * (RDRAM_SIZE >> 1)] = uint8_t(real_hidden_word0);
+            hidden_vram_upscaled.elems[2u * index + 1u + i * (RDRAM_SIZE >> 1)] = uint8_t(real_hidden_word1);
+        }
+        vram_reference32.elems[index] = real_word;
+    }
+}
+
+void main()
+{
+    uint index = gl_GlobalInvocationID.x;
+    if (index >= registers.num_pixels)
+        return;
+
+    uint depth_index = index + registers.fb_depth_addr;
+    uint color_index = index + registers.fb_addr;
+
+    switch (FB_SIZE_LOG2)
+    {
+        case 0:
+            update_rdram_8(color_index);
+            break;
+
+        case 1:
+            update_rdram_16(color_index);
+            break;
+
+        case 2:
+            update_rdram_32(color_index);
+            break;
+    }
+
+    if (!COLOR_DEPTH_ALIAS)
+        update_rdram_16(depth_index);
+}
--- a/parallel-rdp/shaders/update_upscaled_domain_resolve.comp
+++ b/parallel-rdp/shaders/update_upscaled_domain_resolve.comp
@@ -0,0 +1,277 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#include "small_types.h"
+#include "fb_formats.h"
+
+layout(local_size_x_id = 3) in;
+
+layout(constant_id = 0) const int RDRAM_SIZE = 8 * 1024 * 1024;
+const int RDRAM_MASK_8 = RDRAM_SIZE - 1;
+const int RDRAM_MASK_16 = RDRAM_MASK_8 >> 1;
+const int RDRAM_MASK_32 = RDRAM_MASK_8 >> 2;
+layout(constant_id = 1) const int FB_SIZE_LOG2 = 0;
+layout(constant_id = 2) const bool COLOR_DEPTH_ALIAS = false;
+layout(constant_id = 4) const int NUM_SAMPLES = 1;
+layout(constant_id = 5) const bool DITHER = false;
+layout(constant_id = 6) const bool RDRAM_UNSCALED_WRITE_MASK = false;
+
+layout(push_constant) uniform Registers
+{
+    uint num_pixels, fb_addr, fb_depth_addr, width, height;
+} registers;
+
+layout(set = 0, binding = 0) writeonly buffer RDRAMSingleSampled8
+{
+    uint8_t elems[];
+} vram8;
+
+layout(set = 0, binding = 0) writeonly buffer RDRAMSingleSampled16
+{
+    uint16_t elems[];
+} vram16;
+
+layout(set = 0, binding = 0) writeonly buffer RDRAMSingleSampled32
+{
+    uint elems[];
+} vram32;
+
+layout(set = 0, binding = 2) writeonly buffer RDRAMUpscalingReference8
+{
+    uint8_t elems[];
+} vram_reference8;
+
+layout(set = 0, binding = 2) writeonly buffer RDRAMUpscalingReference16
+{
+    uint16_t elems[];
+} vram_reference16;
+
+layout(set = 0, binding = 2) writeonly buffer RDRAMUpscalingReference32
+{
+    uint elems[];
+} vram_reference32;
+
+layout(set = 0, binding = 3) readonly buffer RDRAMUpscaling8
+{
+    uint8_t elems[];
+} vram_upscaled8;
+
+layout(set = 0, binding = 3) readonly buffer RDRAMUpscaling16
+{
+    uint16_t elems[];
+} vram_upscaled16;
+
+layout(set = 0, binding = 3) readonly buffer RDRAMUpscaling32
+{
+    uint elems[];
+} vram_upscaled32;
+
+layout(set = 0, binding = 4) readonly buffer RDRAMHiddenUpscaling
+{
+    uint8_t elems[];
+} hidden_vram_upscaled;
+
+void copy_rdram_8(uint index)
+{
+    uint r = 0u;
+    for (int i = 0; i < NUM_SAMPLES; i++)
+    {
+        uint real_word = uint(vram_upscaled8.elems[index]);
+        r += real_word;
+    }
+
+    r = (r + (NUM_SAMPLES >> 1)) / NUM_SAMPLES;
+    vram_reference8.elems[index] = uint8_t(r);
+    vram8.elems[index] = uint8_t(r);
+}
+
+uvec4 decode_rgba5551(uint word)
+{
+    return (uvec4(word) >> uvec4(11, 6, 1, 0)) & uvec4(0x1f, 0x1f, 0x1f, 1);
+}
+
+uint encode_rgba5551(uvec4 color)
+{
+    return (color.r << 11u) | (color.g << 6u) | (color.b << 1u) | color.a;
+}
+
+const uint bayer_dither_lut[16] = uint[](
+    0, 4, 1, 5,
+    4, 0, 5, 1,
+    3, 7, 2, 6,
+    7, 3, 6, 2);
+
+void copy_rdram_16(uint index, uint x, uint y)
+{
+    uvec4 rgba = uvec4(0u);
+    for (int i = 0; i < NUM_SAMPLES; i++)
+    {
+        uint real_word = uint(vram_upscaled16.elems[index + i * (RDRAM_SIZE >> 1)]);
+        rgba += decode_rgba5551(real_word);
+    }
+
+    if (DITHER)
+    {
+        uint dither_value = bayer_dither_lut[(y & 3u) * 4u + (x & 3u)] * NUM_SAMPLES;
+        rgba = (8u * rgba + dither_value) / (8 * NUM_SAMPLES);
+    }
+    else
+    {
+        rgba = (rgba + (NUM_SAMPLES >> 1)) / NUM_SAMPLES;
+    }
+
+    uint encoded = encode_rgba5551(rgba);
+    vram16.elems[index] = uint16_t(encoded);
+    vram_reference16.elems[index] = uint16_t(encoded);
+}
+
+void copy_rdram_16_single_sample(uint index)
+{
+    // Copies the first sample. We cannot meaningfully filter depth samples.
+    // The first sample should overlap exactly with the single-sampled version.
+    // Coverage clipping might slightly change the result, but shouldn't be different enough to break things.
+    uint upscaled_word = uint(vram_upscaled16.elems[index]);
+    vram16.elems[index] = uint16_t(upscaled_word);
+    vram_reference16.elems[index] = uint16_t(upscaled_word);
+}
+
+uvec4 decode_rgba8(uint word)
+{
+    return (uvec4(word) >> uvec4(24, 16, 8, 0)) & uvec4(0xff);
+}
+
+uint encode_rgba8(uvec4 color)
+{
+    return (color.r << 24u) | (color.g << 16u) | (color.b << 8u) | (color.a << 0u);
+}
+
+void copy_rdram_32(uint index)
+{
+    uvec4 rgba = uvec4(0u);
+    for (int i = 0; i < NUM_SAMPLES; i++)
+    {
+        uint real_word = vram_upscaled32.elems[index + i * (RDRAM_SIZE >> 2)];
+        rgba += decode_rgba8(real_word);
+    }
+
+    rgba = (rgba + (NUM_SAMPLES >> 1)) / NUM_SAMPLES;
+    uint encoded = encode_rgba8(rgba);
+    vram32.elems[index] = encoded;
+    vram_reference32.elems[index] = encoded;
+}
+
+void main()
+{
+    uvec2 coord = gl_GlobalInvocationID.xy;
+    uint index = coord.y * registers.width + coord.x;
+    uint depth_index = index + registers.fb_depth_addr;
+    uint color_index = index + registers.fb_addr;
+
+    uvec2 mask_coord = coord >> 2u;
+    uint mask_index = mask_coord.x + mask_coord.y * ((registers.width + 3) >> 2u);
+
+    uint write_mask;
+    if (coord.x < registers.width)
+        write_mask = vram_upscaled32.elems[NUM_SAMPLES * (RDRAM_SIZE >> 2) + mask_index];
+    else
+        write_mask = 0u;
+
+    uint shamt = 2u * ((coord.x & 3u) + 4u * (coord.y & 3u));
+    write_mask = write_mask >> shamt;
+    bool color_write_mask = (write_mask & 1u) != 0u;
+    bool depth_write_mask = (write_mask & 2u) != 0u;
+
+    if (color_write_mask)
+    {
+        switch (FB_SIZE_LOG2)
+        {
+        case 0:
+            color_index &= RDRAM_MASK_8;
+            color_index ^= 3u;
+            copy_rdram_8(color_index);
+            break;
+
+        case 1:
+            color_index &= RDRAM_MASK_16;
+            color_index ^= 1u;
+            copy_rdram_16(color_index, coord.x, coord.y);
+            break;
+
+        case 2:
+            color_index &= RDRAM_MASK_32;
+            copy_rdram_32(color_index);
+            break;
+        }
+    }
+
+    // Metal portability: Memory barriers must happen in uniform control flow.
+    if (RDRAM_UNSCALED_WRITE_MASK)
+    {
+        // Need this memory barrier to ensure the mask readback does not read
+        // an invalid value from RDRAM. If the mask is seen, the valid RDRAM value is
+        // also coherent.
+        memoryBarrierBuffer();
+
+        if (color_write_mask)
+        {
+            switch (FB_SIZE_LOG2)
+            {
+                case 0:
+                    vram8.elems[color_index + RDRAM_SIZE] = mem_u8(0xff);
+                    break;
+
+                case 1:
+                    vram16.elems[color_index + (RDRAM_SIZE >> 1)] = mem_u16(0xffff);
+                    break;
+
+                case 2:
+                    vram32.elems[color_index + (RDRAM_SIZE >> 2)] = ~0u;
+                    break;
+            }
+        }
+    }
+
+    // Don't bother writing back hidden VRAM. It is not visible to host anyways, and coverage is meaningless when it's filtered.
+    // If host later decides to modify the CPU memory, then the hidden VRAM values become complete bogus either way.
+
+    if (!COLOR_DEPTH_ALIAS)
+    {
+        if (depth_write_mask)
+        {
+            depth_index &= RDRAM_MASK_16;
+            depth_index ^= 1u;
+            copy_rdram_16_single_sample(depth_index);
+        }
+
+        // Metal portability: Memory barriers must happen in uniform control flow.
+        if (RDRAM_UNSCALED_WRITE_MASK)
+        {
+            // Need this memory barrier to ensure the mask readback does not read
+            // an invalid value from RDRAM. If the mask is seen, the valid RDRAM value is
+            // also coherent.
+            memoryBarrierBuffer();
+            if (depth_write_mask)
+                vram16.elems[depth_index + (RDRAM_SIZE >> 1u)] = mem_u16(0xffff);
+        }
+    }
+}
--- a/parallel-rdp/shaders/vi_blend_fields.frag
+++ b/parallel-rdp/shaders/vi_blend_fields.frag
@@ -0,0 +1,33 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+#extension GL_EXT_samplerless_texture_functions : require
+
+layout(location = 0) out vec4 FragColor;
+layout(set = 0, binding = 0) uniform texture2D uImage;
+
+void main()
+{
+    // A persistent pixel does not propagate more than one frame.
+    vec4 input_pixel = texelFetch(uImage, ivec2(gl_FragCoord.xy), 0);
+    FragColor = vec4(input_pixel.rgb * input_pixel.a, 0.0);
+}
--- a/parallel-rdp/shaders/vi_debug.h
+++ b/parallel-rdp/shaders/vi_debug.h
@@ -0,0 +1,60 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef VI_DEBUG_H_
+#define VI_DEBUG_H_
+
+#if defined(DEBUG_ENABLE) && DEBUG_ENABLE
+#include "debug_channel.h"
+
+void GENERIC_MESSAGE_(int line)
+{
+	add_debug_message(0, uvec3(gl_FragCoord.xy, 0), line);
+}
+
+void GENERIC_MESSAGE_(int line, uint v)
+{
+	add_debug_message(0, uvec3(gl_FragCoord.xy, 0), uvec2(line, v));
+}
+
+void GENERIC_MESSAGE_(int line, uvec2 v)
+{
+	add_debug_message(0, uvec3(gl_FragCoord.xy, 0), uvec3(line, v));
+}
+
+void GENERIC_MESSAGE_(int line, uvec3 v)
+{
+	add_debug_message(0, uvec3(gl_FragCoord.xy, 0), uvec4(line, v));
+}
+
+#define GENERIC_MESSAGE0() GENERIC_MESSAGE_(__LINE__)
+#define GENERIC_MESSAGE1(a) GENERIC_MESSAGE_(__LINE__, a)
+#define GENERIC_MESSAGE2(a, b) GENERIC_MESSAGE_(__LINE__, uvec2(a, b))
+#define GENERIC_MESSAGE3(a, b, c) GENERIC_MESSAGE_(__LINE__, uvec3(a, b, c))
+#else
+#define GENERIC_MESSAGE0()
+#define GENERIC_MESSAGE1(a)
+#define GENERIC_MESSAGE2(a, b)
+#define GENERIC_MESSAGE3(a, b, c)
+#endif
+
+#endif
--- a/parallel-rdp/shaders/vi_deinterlace.frag
+++ b/parallel-rdp/shaders/vi_deinterlace.frag
@@ -0,0 +1,31 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+layout(location = 0) in vec2 vUV;
+layout(set = 0, binding = 0) uniform sampler2D uSampler;
+layout(location = 0) out vec4 FragColor;
+
+void main()
+{
+    FragColor = textureLod(uSampler, vUV, 0.0);
+}
--- a/parallel-rdp/shaders/vi_deinterlace.vert
+++ b/parallel-rdp/shaders/vi_deinterlace.vert
@@ -0,0 +1,41 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+layout(location = 0) out vec2 vUV;
+
+layout(push_constant) uniform UBO
+{
+    float y_offset;
+} registers;
+
+void main()
+{
+    if (gl_VertexIndex == 0)
+        gl_Position = vec4(-1.0, -1.0, 0.0, 1.0);
+    else if (gl_VertexIndex == 1)
+        gl_Position = vec4(-1.0, +3.0, 0.0, 1.0);
+    else
+        gl_Position = vec4(+3.0, -1.0, 0.0, 1.0);
+
+    vUV = vec2(gl_Position.x * 0.5 + 0.5, gl_Position.y * 0.5 + 0.5 + registers.y_offset);
+}
--- a/parallel-rdp/shaders/vi_divot.frag
+++ b/parallel-rdp/shaders/vi_divot.frag
@@ -0,0 +1,93 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+#extension GL_EXT_samplerless_texture_functions : require
+
+#include "vi_debug.h"
+
+layout(location = 0) out uvec4 FragColor;
+#if defined(FETCH_BUG) && FETCH_BUG
+layout(location = 1) out uvec4 FragColorFetchBug;
+#endif
+
+layout(set = 0, binding = 0) uniform mediump utexture2DArray uFetchCache;
+
+void swap(inout uint a, inout uint b)
+{
+    uint tmp = a;
+    a = b;
+    b = tmp;
+}
+
+// Workaround Metal using median3.
+uint Median3(uint left, uint center, uint right)
+{
+    if (left < center)
+        swap(left, center);
+    if (center < right)
+        swap(center, right);
+    if (left < center)
+        swap(left, center);
+
+    return center;
+}
+
+void main()
+{
+    ivec2 pix = ivec2(gl_FragCoord.xy);
+
+    uvec4 left = texelFetch(uFetchCache, ivec3(pix, 0), 0);
+    uvec4 mid = texelFetchOffset(uFetchCache, ivec3(pix, 0), 0, ivec2(1, 0));
+    uvec4 right = texelFetchOffset(uFetchCache, ivec3(pix, 0), 0, ivec2(2, 0));
+
+    if ((left.a & mid.a & right.a) == 7u)
+    {
+        FragColor = mid;
+    }
+    else
+    {
+        // Median filter. TODO: Optimize with mid3?
+        uint r = Median3(left.r, mid.r, right.r);
+        uint g = Median3(left.g, mid.g, right.g);
+        uint b = Median3(left.b, mid.b, right.b);
+        FragColor = uvec4(r, g, b, mid.a);
+    }
+
+#if defined(FETCH_BUG) && FETCH_BUG
+    left = texelFetch(uFetchCache, ivec3(pix, 1), 0);
+    mid = texelFetchOffset(uFetchCache, ivec3(pix, 1), 0, ivec2(1, 0));
+    right = texelFetchOffset(uFetchCache, ivec3(pix, 1), 0, ivec2(2, 0));
+
+    if ((left.a & mid.a & right.a) == 7u)
+    {
+        FragColorFetchBug = mid;
+    }
+    else
+    {
+        // Median filter. TODO: Optimize with mid3?
+        uint r = Median3(left.r, mid.r, right.r);
+        uint g = Median3(left.g, mid.g, right.g);
+        uint b = Median3(left.b, mid.b, right.b);
+        FragColorFetchBug = uvec4(r, g, b, mid.a);
+    }
+#endif
+}
--- a/parallel-rdp/shaders/vi_fetch.frag
+++ b/parallel-rdp/shaders/vi_fetch.frag
@@ -0,0 +1,164 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+#extension GL_EXT_samplerless_texture_functions : require
+#include "small_types.h"
+#include "vi_status.h"
+#include "vi_debug.h"
+
+layout(set = 0, binding = 0) uniform mediump utexture2D uAAInput;
+
+layout(location = 0) out uvec4 FragColor;
+#if defined(FETCH_BUG) && FETCH_BUG
+layout(location = 1) out uvec4 FragColorFetchBug;
+#endif
+
+layout(push_constant) uniform Registers
+{
+    ivec2 offset;
+} registers;
+
+ivec2 pix;
+uvec4 fetch_color_offset(ivec2 offset)
+{
+    return texelFetch(uAAInput, pix + offset, 0);
+}
+
+void check_neighbor(uvec4 candidate,
+                    inout uvec3 lo, inout uvec3 hi,
+                    inout uvec3 second_lo, inout uvec3 second_hi)
+{
+    if (candidate.a == 7u)
+    {
+        second_lo = min(second_lo, max(candidate.rgb, lo));
+        second_hi = max(second_hi, min(candidate.rgb, hi));
+
+        lo = min(candidate.rgb, lo);
+        hi = max(candidate.rgb, hi);
+    }
+}
+
+void main()
+{
+    pix = ivec2(gl_FragCoord.xy) + registers.offset;
+
+    uvec4 mid_pixel = fetch_color_offset(ivec2(0));
+
+    // AA-filter. If coverage is not full, we blend current pixel against background.
+    uvec3 color;
+#if defined(FETCH_BUG) && FETCH_BUG
+    uvec3 color_bug;
+#endif
+
+    if (mid_pixel.a != 7u)
+    {
+        uvec3 lo = mid_pixel.rgb;
+        uvec3 hi = lo;
+        uvec3 second_lo = lo;
+        uvec3 second_hi = lo;
+
+        // Somehow, we're supposed to find the second lowest and second highest neighbor.
+        uvec4 left_up = fetch_color_offset(ivec2(-1, -1));
+        uvec4 right_up = fetch_color_offset(ivec2(+1, -1));
+        uvec4 to_left = fetch_color_offset(ivec2(-2, 0));
+        uvec4 to_right = fetch_color_offset(ivec2(+2, 0));
+        uvec4 left_down = fetch_color_offset(ivec2(-1, +1));
+        uvec4 right_down = fetch_color_offset(ivec2(+1, +1));
+
+        check_neighbor(left_up, lo, hi, second_lo, second_hi);
+        check_neighbor(right_up, lo, hi, second_lo, second_hi);
+        check_neighbor(to_left, lo, hi, second_lo, second_hi);
+        check_neighbor(to_right, lo, hi, second_lo, second_hi);
+
+#if defined(FETCH_BUG) && FETCH_BUG
+        // In the fetch-bug state, we apparently do not read the lower values.
+        // Instead, the lower values are treated as left and right.
+        uvec3 lo_bug = lo;
+        uvec3 hi_bug = hi;
+        uvec3 second_lo_bug = second_lo;
+        uvec3 second_hi_bug = second_hi;
+#endif
+
+        check_neighbor(left_down, lo, hi, second_lo, second_hi);
+        check_neighbor(right_down, lo, hi, second_lo, second_hi);
+#if defined(FETCH_BUG) && FETCH_BUG
+        check_neighbor(to_left, lo_bug, hi_bug, second_lo_bug, second_hi_bug);
+        check_neighbor(to_right, lo_bug, hi_bug, second_lo_bug, second_hi_bug);
+        second_lo = mix(second_lo, lo, equal(mid_pixel.rgb, lo));
+        second_hi = mix(second_hi, hi, equal(mid_pixel.rgb, hi));
+        second_lo_bug = mix(second_lo_bug, lo_bug, equal(mid_pixel.rgb, lo_bug));
+        second_hi_bug = mix(second_hi_bug, hi_bug, equal(mid_pixel.rgb, hi_bug));
+#endif
+
+        uvec3 offset = second_lo + second_hi - (mid_pixel.rgb << 1u);
+        uint coeff = 7u - mid_pixel.a;
+        color = mid_pixel.rgb + (((offset * coeff) + 4u) >> 3u);
+        color &= 0xffu;
+
+#if defined(FETCH_BUG) && FETCH_BUG
+        uvec3 offset_bug = second_lo_bug + second_hi_bug - (mid_pixel.rgb << 1u);
+        color_bug = mid_pixel.rgb + (((offset_bug * coeff) + 4u) >> 3u);
+        color_bug &= 0xffu;
+#endif
+    }
+    else if (DITHER_ENABLE)
+    {
+        // Dither filter.
+        ivec3 tmp_color = ivec3(mid_pixel.rgb >> 3u);
+        ivec3 tmp_accum = ivec3(0);
+        for (int y = -1; y <= 0; y++)
+        {
+            for (int x = -1; x <= 1; x++)
+            {
+                ivec3 col = ivec3(fetch_color_offset(ivec2(x, y)).rgb >> 3u);
+                tmp_accum += clamp(col - tmp_color, ivec3(-1), ivec3(1));
+            }
+        }
+
+#if defined(FETCH_BUG) && FETCH_BUG
+        ivec3 tmp_accum_bug = tmp_accum;
+#endif
+
+        tmp_accum += clamp(ivec3(fetch_color_offset(ivec2(-1, 1)).rgb >> 3u) - tmp_color, ivec3(-1), ivec3(1));
+        tmp_accum += clamp(ivec3(fetch_color_offset(ivec2(+1, 1)).rgb >> 3u) - tmp_color, ivec3(-1), ivec3(1));
+        tmp_accum += clamp(ivec3(fetch_color_offset(ivec2(0, 1)).rgb >> 3u) - tmp_color, ivec3(-1), ivec3(1));
+        color = (mid_pixel.rgb & 0xf8u) + tmp_accum;
+
+#if defined(FETCH_BUG) && FETCH_BUG
+        tmp_accum_bug += clamp(ivec3(fetch_color_offset(ivec2(-1, 0)).rgb >> 3u) - tmp_color, ivec3(-1), ivec3(1));
+        tmp_accum_bug += clamp(ivec3(fetch_color_offset(ivec2(+1, 0)).rgb >> 3u) - tmp_color, ivec3(-1), ivec3(1));
+        color_bug = (mid_pixel.rgb & 0xf8u) + tmp_accum_bug;
+#endif
+    }
+    else
+    {
+        color = mid_pixel.rgb;
+#if defined(FETCH_BUG) && FETCH_BUG
+        color_bug = mid_pixel.rgb;
+#endif
+    }
+
+    FragColor = uvec4(color, mid_pixel.a);
+#if defined(FETCH_BUG) && FETCH_BUG
+    FragColorFetchBug = uvec4(color_bug, mid_pixel.a);
+#endif
+}
--- a/parallel-rdp/shaders/vi_scale.frag
+++ b/parallel-rdp/shaders/vi_scale.frag
@@ -0,0 +1,154 @@
+#version 450
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+#extension GL_EXT_samplerless_texture_functions : require
+
+#include "small_types.h"
+#include "vi_status.h"
+#include "vi_debug.h"
+#include "noise.h"
+
+layout(set = 0, binding = 0) uniform mediump utexture2DArray uDivotOutput;
+layout(set = 0, binding = 1) uniform itextureBuffer uHorizontalInfo;
+layout(set = 1, binding = 0) uniform mediump utextureBuffer uGammaTable;
+layout(location = 0) out vec4 FragColor;
+
+layout(push_constant, std430) uniform Registers
+{
+    ivec2 frag_coord_offset;
+    int v_start;
+    int y_add;
+    int frame_count;
+
+    int serrate_shift;
+    int serrate_mask;
+    int serrate_select;
+
+    int info_y_shift;
+} registers;
+
+uvec3 vi_lerp(uvec3 a, uvec3 b, uint l)
+{
+    return (a + (((b - a) * l + 16u) >> 5u)) & 0xffu;
+}
+
+uvec3 integer_gamma(uvec3 color)
+{
+    uvec3 res;
+    if (GAMMA_DITHER)
+    {
+        color = (color << 6) + noise_get_full_gamma_dither() + 256u;
+        res = uvec3(
+            texelFetch(uGammaTable, int(color.r)).r,
+            texelFetch(uGammaTable, int(color.g)).r,
+            texelFetch(uGammaTable, int(color.b)).r);
+    }
+    else
+    {
+        res = uvec3(
+            texelFetch(uGammaTable, int(color.r)).r,
+            texelFetch(uGammaTable, int(color.g)).r,
+            texelFetch(uGammaTable, int(color.b)).r);
+    }
+    return res;
+}
+
+layout(constant_id = 2) const bool FETCH_BUG = false;
+
+void main()
+{
+    // Handles crop where we start scanning out at an offset.
+    ivec2 coord = ivec2(gl_FragCoord.xy) + registers.frag_coord_offset;
+
+    int info_index = coord.y >> registers.info_y_shift;
+    ivec4 horiz_info0 = texelFetch(uHorizontalInfo, 2 * info_index + 0);
+    ivec4 horiz_info1 = texelFetch(uHorizontalInfo, 2 * info_index + 1);
+
+    int h_start = horiz_info0.x;
+    int h_start_clamp = horiz_info0.y;
+    int h_end_clamp = horiz_info0.z;
+    int x_start = horiz_info0.w;
+    int x_add = horiz_info1.x;
+    int y_start = horiz_info1.y;
+    int y_add = horiz_info1.z;
+    int y_base = horiz_info1.w;
+
+    // Rebase Y relative to YStart.
+    coord.y -= registers.v_start;
+
+    // Scissor against HStart/End, also handles serrate where we skip every other line.
+    if (coord.x < h_start_clamp || coord.x >= h_end_clamp ||
+            ((coord.y & registers.serrate_mask) != registers.serrate_select))
+        discard;
+
+    // Shift the X coord to be relative to sampling, this can change per scanline.
+    coord.x -= h_start;
+
+    // Rebase Y in terms of progressive scan.
+    coord.y >>= registers.serrate_shift;
+
+    if (GAMMA_DITHER)
+        reseed_noise(coord.x, coord.y, registers.frame_count);
+
+    int x = coord.x * x_add + x_start;
+    int y = (coord.y - y_base) * y_add + y_start;
+    ivec2 base_coord = ivec2(x, y) >> 10;
+    uvec3 c00 = texelFetch(uDivotOutput, ivec3(base_coord, 0), 0).rgb;
+
+    int bug_offset = 0;
+    if (FETCH_BUG)
+    {
+        // This is super awkward.
+        // Basically there seems to be some kind of issue where if we interpolate in Y,
+        // we're going to get buggy output.
+        // If we hit this case, the next line we filter against will come from the "buggy" array slice.
+        // Why this makes sense, I have no idea.
+        //
+        // XXX: This assumes constant YAdd.
+        // No idea how this is supposed to work if YAdd can vary per scanline.
+        int prev_y = (y - registers.y_add) >> 10;
+        int next_y = (y + registers.y_add) >> 10;
+        if (coord.y != 0 && base_coord.y == prev_y && base_coord.y != next_y)
+            bug_offset = 1;
+    }
+
+    if (SCALE_AA)
+    {
+        int x_frac = (x >> 5) & 31;
+        int y_frac = (y >> 5) & 31;
+
+        uvec3 c10 = texelFetchOffset(uDivotOutput, ivec3(base_coord, 0), 0, ivec2(1, 0)).rgb;
+        uvec3 c01 = texelFetchOffset(uDivotOutput, ivec3(base_coord, bug_offset), 0, ivec2(0, 1)).rgb;
+        uvec3 c11 = texelFetchOffset(uDivotOutput, ivec3(base_coord, bug_offset), 0, ivec2(1)).rgb;
+
+        c00 = vi_lerp(c00, c01, y_frac);
+        c10 = vi_lerp(c10, c11, y_frac);
+        c00 = vi_lerp(c00, c10, x_frac);
+    }
+
+    if (GAMMA_ENABLE)
+        c00 = integer_gamma(c00);
+    else if (GAMMA_DITHER)
+        c00 = min(c00 + noise_get_partial_gamma_dither(), uvec3(0xff));
+
+    FragColor = vec4(vec3(c00) / 255.0, 1.0);
+}
--- a/parallel-rdp/shaders/vi_status.h
+++ b/parallel-rdp/shaders/vi_status.h
@@ -0,0 +1,48 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef VI_STATUS_H_
+#define VI_STATUS_H_
+
+layout(constant_id = 1) const int VI_STATUS = 0;
+const int VI_CONTROL_TYPE_BLANK_BIT = 0 << 0;
+const int VI_CONTROL_TYPE_RESERVED_BIT = 1 << 0;
+const int VI_CONTROL_TYPE_RGBA5551_BIT = 2 << 0;
+const int VI_CONTROL_TYPE_RGBA8888_BIT = 3 << 0;
+const int VI_CONTROL_TYPE_MASK = 3 << 0;
+const int VI_CONTROL_GAMMA_DITHER_ENABLE_BIT = 1 << 2;
+const int VI_CONTROL_GAMMA_ENABLE_BIT = 1 << 3;
+const int VI_CONTROL_DIVOT_ENABLE_BIT = 1 << 4;
+const int VI_CONTROL_SERRATE_BIT = 1 << 6;
+const int VI_CONTROL_DITHER_FILTER_ENABLE_BIT = 1 << 16;
+const int VI_CONTROL_META_AA_BIT = 1 << 17;
+const int VI_CONTROL_META_SCALE_BIT = 1 << 18;
+
+const bool FMT_RGBA5551 = (VI_STATUS & VI_CONTROL_TYPE_MASK) == VI_CONTROL_TYPE_RGBA5551_BIT;
+const bool FMT_RGBA8888 = (VI_STATUS & VI_CONTROL_TYPE_MASK) == VI_CONTROL_TYPE_RGBA8888_BIT;
+const bool DITHER_ENABLE = (VI_STATUS & VI_CONTROL_DITHER_FILTER_ENABLE_BIT) != 0;
+const bool FETCH_AA = (VI_STATUS & VI_CONTROL_META_AA_BIT) != 0;
+const bool SCALE_AA = (VI_STATUS & VI_CONTROL_META_SCALE_BIT) != 0;
+const bool GAMMA_ENABLE = (VI_STATUS & VI_CONTROL_GAMMA_ENABLE_BIT) != 0;
+const bool GAMMA_DITHER = (VI_STATUS & VI_CONTROL_GAMMA_DITHER_ENABLE_BIT) != 0;
+
+#endif
--- a/parallel-rdp/shaders/z_encode.h
+++ b/parallel-rdp/shaders/z_encode.h
@@ -0,0 +1,58 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef Z_ENCODE_H_
+#define Z_ENCODE_H_
+
+// The Z compression is kind of clever, and uses inverted FP, with more precision close to 1.
+// The compressed Z result is 14 bits, and decompresses to 18-bit UNORM.
+int z_decompress(u16 z_)
+{
+	int z = int(z_);
+	int exponent = z >> 11;
+	int mantissa = z & 0x7ff;
+	int shift = max(6 - exponent, 0);
+	int base = 0x40000 - (0x40000 >> exponent);
+	return (mantissa << shift) + base;
+}
+
+u16 z_compress(int z)
+{
+	int inv_z = max(0x3ffff - z, 1);
+	int exponent = 17 - findMSB(inv_z);
+	exponent = clamp(exponent, 0, 7);
+	int shift = max(6 - exponent, 0);
+	int mantissa = (z >> shift) & 0x7ff;
+	return u16((exponent << 11) + mantissa);
+}
+
+int dz_decompress(int dz)
+{
+	return 1 << dz;
+}
+
+int dz_compress(int dz)
+{
+	return max(findMSB(dz), 0);
+}
+
+#endif
--- a/parallel-rdp/video_interface.cpp
+++ b/parallel-rdp/video_interface.cpp
--- a/parallel-rdp/video_interface.hpp
+++ b/parallel-rdp/video_interface.hpp
@@ -0,0 +1,240 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include <stdint.h>
+#include "device.hpp"
+#include "rdp_common.hpp"
+
+namespace RDP
+{
+struct ScanoutOptions
+{
+	// Simple (obsolete) crop method. If crop_rect.enable is false, this
+	// crops top / bottom with number of pixels (doubled if interlace),
+	// and left / right are cropped in an aspect preserving way.
+	// If crop_rect.enable is true,
+	// this is ignored and the crop_rect struct is used instead.
+	// Crop pixels are adjusted for upscaling, pixels are assumed to
+	// be specified for the original resolution.
+	unsigned crop_overscan_pixels = 0;
+
+	struct CropRect
+	{
+		unsigned left = 0;
+		unsigned right = 0;
+		unsigned top = 0; // Doubled if interlace
+		unsigned bottom = 0; // Doubled if interlace
+		bool enable = false;
+	} crop_rect;
+
+	unsigned downscale_steps = 0;
+
+	// Works around certain game bugs. Considered a hack if enabled.
+	bool persist_frame_on_invalid_input = false;
+
+	// To be equivalent to reference behavior where
+	// pixels persist for an extra frame.
+	// Not hardware accurate, but needed for weave interlace mode.
+	bool blend_previous_frame = false;
+
+	// Upscale deinterlacing deinterlaces by upscaling in Y, with an Y coordinate offset matching the field.
+	// If disabled, weave interlacing is used.
+	// Weave deinterlacing should *not* be used, except to run test suite!
+	bool upscale_deinterlacing = true;
+
+	struct
+	{
+		bool aa = true;
+		bool scale = true;
+		bool serrate = true;
+		bool dither_filter = true;
+		bool divot_filter = true;
+		bool gamma_dither = true;
+	} vi;
+
+	// External memory support.
+	// If true, the scanout image will be created with external memory support.
+	// presist_frame_on_invalid_input must be false when using exports.
+	VkExternalMemoryHandleTypeFlagBits export_handle_type = {};
+	bool export_scanout = false;
+};
+
+struct VIScanoutBuffer
+{
+	Vulkan::BufferHandle buffer;
+	Vulkan::Fence fence;
+	unsigned width = 0;
+	unsigned height = 0;
+};
+
+class Renderer;
+
+class VideoInterface : public Vulkan::DebugChannelInterface
+{
+public:
+	void set_device(Vulkan::Device *device);
+	void set_renderer(Renderer *renderer);
+	void set_vi_register(VIRegister reg, uint32_t value);
+
+	void set_rdram(const Vulkan::Buffer *rdram, size_t offset, size_t size);
+	void set_hidden_rdram(const Vulkan::Buffer *hidden_rdram);
+
+	int resolve_shader_define(const char *name, const char *define) const;
+
+	Vulkan::ImageHandle scanout(VkImageLayout target_layout, const ScanoutOptions &options = {}, unsigned scale_factor = 1);
+	void scanout_memory_range(unsigned &offset, unsigned &length) const;
+	void set_shader_bank(const ShaderBank *bank);
+
+	enum PerScanlineRegisterBits
+	{
+		// Currently supported bits.
+		PER_SCANLINE_HSTART_BIT = 1 << 0,
+		PER_SCANLINE_XSCALE_BIT = 1 << 1
+	};
+	using PerScanlineRegisterFlags = uint32_t;
+
+	void begin_vi_register_per_scanline(PerScanlineRegisterFlags flags);
+	void set_vi_register_for_scanline(PerScanlineRegisterBits reg, uint32_t value);
+	void latch_vi_register_for_scanline(unsigned vi_line);
+	void end_vi_register_per_scanline();
+
+private:
+	Vulkan::Device *device = nullptr;
+	Renderer *renderer = nullptr;
+	uint32_t vi_registers[unsigned(VIRegister::Count)] = {};
+
+	struct PerScanlineRegisterState
+	{
+		uint32_t latched_state;
+		uint32_t line_state[VI_V_END_MAX];
+	};
+
+	struct
+	{
+		PerScanlineRegisterState h_start;
+		PerScanlineRegisterState x_scale;
+		PerScanlineRegisterFlags flags = 0;
+		unsigned line = 0;
+		bool ended = false;
+	} per_line_state;
+
+	const Vulkan::Buffer *rdram = nullptr;
+	const Vulkan::Buffer *hidden_rdram = nullptr;
+	Vulkan::BufferHandle gamma_lut;
+	Vulkan::BufferViewHandle gamma_lut_view;
+	const ShaderBank *shader_bank = nullptr;
+
+	void init_gamma_table();
+	bool previous_frame_blank = false;
+	bool debug_channel = false;
+	int filter_debug_channel_x = -1;
+	int filter_debug_channel_y = -1;
+
+	void message(const std::string &tag, uint32_t code,
+	             uint32_t x, uint32_t y, uint32_t z,
+	             uint32_t num_words, const Vulkan::DebugChannelInterface::Word *words) override;
+
+	// Frame state.
+	uint32_t frame_count = 0;
+	uint32_t last_valid_frame_count = 0;
+	Vulkan::ImageHandle prev_scanout_image;
+	VkImageLayout prev_image_layout = VK_IMAGE_LAYOUT_UNDEFINED;
+	bool prev_image_is_external = false;
+
+	size_t rdram_offset = 0;
+	size_t rdram_size = 0;
+	bool timestamp = false;
+
+	struct HorizontalInfo
+	{
+		int32_t h_start;
+		int32_t h_start_clamp;
+		int32_t h_end_clamp;
+		int32_t x_start;
+		int32_t x_add;
+		int32_t y_start;
+		int32_t y_add;
+		int32_t y_base;
+	};
+
+	struct HorizontalInfoLines
+	{
+		HorizontalInfo lines[VI_MAX_OUTPUT_SCANLINES];
+	};
+
+	static void bind_horizontal_info_view(Vulkan::CommandBuffer &cmd, const HorizontalInfoLines &lines);
+
+	struct Registers
+	{
+		int vi_width;
+		int vi_offset;
+		int v_current_line;
+		bool is_pal;
+		uint32_t status;
+
+		int init_y_add;
+
+		// Global scale pass scissor box.
+		int h_start_clamp, h_res_clamp;
+		int h_start, h_res;
+		int v_start, v_res;
+
+		// For AA stages.
+		int max_x, max_y;
+	};
+
+	Registers decode_vi_registers(HorizontalInfoLines *lines) const;
+	void clear_per_scanline_state();
+
+	Vulkan::ImageHandle vram_fetch_stage(const Registers &registers,
+	                                     unsigned scaling_factor) const;
+	Vulkan::ImageHandle aa_fetch_stage(Vulkan::CommandBuffer &cmd,
+	                                   Vulkan::Image &vram_image,
+	                                   const Registers &registers,
+	                                   unsigned scaling_factor) const;
+	Vulkan::ImageHandle divot_stage(Vulkan::CommandBuffer &cmd,
+	                                Vulkan::Image &aa_image,
+	                                const Registers &registers,
+	                                unsigned scaling_factor) const;
+	Vulkan::ImageHandle scale_stage(Vulkan::CommandBuffer &cmd,
+	                                const Vulkan::Image *divot_image,
+	                                Registers registers,
+	                                const HorizontalInfoLines &lines,
+	                                unsigned scaling_factor,
+	                                bool degenerate,
+	                                const ScanoutOptions &options,
+	                                bool final_pass) const;
+	Vulkan::ImageHandle downscale_stage(Vulkan::CommandBuffer &cmd,
+	                                    Vulkan::Image &scale_image,
+	                                    unsigned scaling_factor,
+	                                    unsigned downscale_factor,
+										const ScanoutOptions &options,
+										bool final_pass) const;
+	Vulkan::ImageHandle upscale_deinterlace(Vulkan::CommandBuffer &cmd,
+	                                        Vulkan::Image &scale_image,
+	                                        unsigned scaling_factor, bool field_select,
+	                                        const ScanoutOptions &options) const;
+	static bool need_fetch_bug_emulation(const Registers &reg, unsigned scaling_factor);
+};
+}
--- a/parallel-rdp/worker_thread.hpp
+++ b/parallel-rdp/worker_thread.hpp
@@ -0,0 +1,127 @@
+/* Copyright (c) 2020 Themaister
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include <queue>
+#include <mutex>
+#include <thread>
+#include <condition_variable>
+#include <utility>
+#include "thread_id.hpp"
+
+#ifdef PARALLEL_RDP_SHADER_DIR
+#include "global_managers.hpp"
+#endif
+
+namespace RDP
+{
+template <typename T, typename Executor>
+class WorkerThread
+{
+public:
+	explicit WorkerThread(
+#ifdef PARALLEL_RDP_SHADER_DIR
+			Granite::Global::GlobalManagersHandle globals,
+#endif
+			Executor exec)
+		: executor(std::move(exec))
+#ifdef PARALLEL_RDP_SHADER_DIR
+		, handles(std::move(globals))
+#endif
+	{
+		thr = std::thread(&WorkerThread::main_loop, this);
+	}
+
+	~WorkerThread()
+	{
+		if (thr.joinable())
+		{
+			{
+				std::lock_guard<std::mutex> holder{to_thread_mutex};
+				work_queue.push({});
+				to_thread_cond.notify_one();
+			}
+			thr.join();
+		}
+	}
+
+	template <typename Cond>
+	void wait(Cond &&cond)
+	{
+		std::unique_lock<std::mutex> holder{to_main_mutex};
+		to_main_cond.wait(holder, std::forward<Cond>(cond));
+	}
+
+	void push(T &&t)
+	{
+		std::lock_guard<std::mutex> holder{to_thread_mutex};
+		work_queue.push(std::move(t));
+		to_thread_cond.notify_one();
+	}
+
+private:
+	std::thread thr;
+	std::mutex to_thread_mutex;
+	std::condition_variable to_thread_cond;
+	std::mutex to_main_mutex;
+	std::condition_variable to_main_cond;
+	std::queue<T> work_queue;
+	Executor executor;
+
+#ifdef PARALLEL_RDP_SHADER_DIR
+	Granite::Global::GlobalManagersHandle handles;
+#endif
+
+	void main_loop()
+	{
+#ifdef PARALLEL_RDP_SHADER_DIR
+		Granite::Global::set_thread_context(*handles);
+		handles.reset();
+#endif
+
+		// Avoid benign errors in logging.
+		// This thread never actually needs the thread ID.
+		Util::register_thread_index(0);
+
+		for (;;)
+		{
+			T value;
+
+			{
+				std::unique_lock<std::mutex> holder{to_thread_mutex};
+				to_thread_cond.wait(holder, [this]() { return !work_queue.empty(); });
+				value = std::move(work_queue.front());
+				work_queue.pop();
+			}
+
+			if (executor.is_sentinel(value))
+				break;
+
+			executor.perform_work(value);
+			std::lock_guard<std::mutex> holder{to_main_mutex};
+			executor.notify_work_locked(value);
+			to_main_cond.notify_one();
+		}
+	}
+};
+}
--- a/util/aligned_alloc.cpp
+++ b/util/aligned_alloc.cpp
@@ -0,0 +1,83 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#include "aligned_alloc.hpp"
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#ifdef _WIN32
+#include <malloc.h>
+#endif
+
+namespace Util
+{
+void *memalign_alloc(size_t boundary, size_t size)
+{
+#if defined(_WIN32)
+    return _aligned_malloc(size, boundary);
+#elif defined(_ISOC11_SOURCE)
+    return aligned_alloc(boundary, (size + boundary - 1) & ~(boundary - 1));
+#elif (_POSIX_C_SOURCE >= 200112L) || (_XOPEN_SOURCE >= 600)
+	void *ptr = nullptr;
+	if (posix_memalign(&ptr, boundary, size) < 0)
+		return nullptr;
+	return ptr;
+#else
+    // Align stuff ourselves. Kinda ugly, but will work anywhere.
+    void **place;
+    uintptr_t addr = 0;
+    void *ptr = malloc(boundary + size + sizeof(uintptr_t));
+
+    if (ptr == nullptr)
+        return nullptr;
+
+    addr = ((uintptr_t)ptr + sizeof(uintptr_t) + boundary) & ~(boundary - 1);
+    place = (void **) addr;
+    place[-1] = ptr;
+
+    return (void *) addr;
+#endif
+}
+
+void *memalign_calloc(size_t boundary, size_t size)
+{
+    void *ret = memalign_alloc(boundary, size);
+    if (ret)
+        memset(ret, 0, size);
+    return ret;
+}
+
+void memalign_free(void *ptr)
+{
+#if defined(_WIN32)
+    _aligned_free(ptr);
+#elif !defined(_ISOC11_SOURCE) && !((_POSIX_C_SOURCE >= 200112L) || (_XOPEN_SOURCE >= 600))
+    if (ptr != nullptr)
+    {
+        void **p = (void **) ptr;
+        free(p[-1]);
+    }
+#else
+    free(ptr);
+#endif
+}
+}
--- a/util/aligned_alloc.hpp
+++ b/util/aligned_alloc.hpp
@@ -0,0 +1,68 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include <stddef.h>
+#include <stdexcept>
+#include <new>
+
+namespace Util
+{
+void *memalign_alloc(size_t boundary, size_t size);
+void *memalign_calloc(size_t boundary, size_t size);
+void memalign_free(void *ptr);
+
+struct AlignedDeleter { void operator()(void *ptr) { memalign_free(ptr); }};
+
+template <typename T>
+struct AlignedAllocation
+{
+    static void *operator new(size_t size)
+    {
+        void *ret = ::Util::memalign_alloc(alignof(T), size);
+#ifdef __EXCEPTIONS
+        if (!ret) throw std::bad_alloc();
+#endif
+        return ret;
+    }
+
+    static void *operator new[](size_t size)
+    {
+        void *ret = ::Util::memalign_alloc(alignof(T), size);
+#ifdef __EXCEPTIONS
+        if (!ret) throw std::bad_alloc();
+#endif
+        return ret;
+    }
+
+    static void operator delete(void *ptr)
+    {
+        return ::Util::memalign_free(ptr);
+    }
+
+    static void operator delete[](void *ptr)
+    {
+        return ::Util::memalign_free(ptr);
+    }
+};
+}
--- a/util/arena_allocator.cpp
+++ b/util/arena_allocator.cpp
@@ -0,0 +1,197 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+*
+* Permission is hereby granted, free of charge, to any person obtaining
+* a copy of this software and associated documentation files (the
+* "Software"), to deal in the Software without restriction, including
+* without limitation the rights to use, copy, modify, merge, publish,
+* distribute, sublicense, and/or sell copies of the Software, and to
+* permit persons to whom the Software is furnished to do so, subject to
+* the following conditions:
+*
+* The above copyright notice and this permission notice shall be
+* included in all copies or substantial portions of the Software.
+*
+* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+*/
+
+#include "arena_allocator.hpp"
+#include "bitops.hpp"
+#include <assert.h>
+
+namespace Util
+{
+void LegionAllocator::allocate(uint32_t num_blocks, uint32_t &out_mask, uint32_t &out_offset)
+{
+	assert(NumSubBlocks >= num_blocks);
+	assert(num_blocks != 0);
+
+	uint32_t block_mask;
+	if (num_blocks == NumSubBlocks)
+		block_mask = ~0u;
+	else
+		block_mask = ((1u << num_blocks) - 1u);
+
+	uint32_t mask = free_blocks[num_blocks - 1];
+	uint32_t b = trailing_zeroes(mask);
+
+	assert(((free_blocks[0] >> b) & block_mask) == block_mask);
+
+	uint32_t sb = block_mask << b;
+	free_blocks[0] &= ~sb;
+	update_longest_run();
+
+	out_mask = sb;
+	out_offset = b;
+}
+
+void LegionAllocator::free(uint32_t mask)
+{
+	assert((free_blocks[0] & mask) == 0);
+	free_blocks[0] |= mask;
+	update_longest_run();
+}
+
+void LegionAllocator::update_longest_run()
+{
+	uint32_t f = free_blocks[0];
+	longest_run = 0;
+
+	while (f)
+	{
+		free_blocks[longest_run++] = f;
+		f &= f >> 1;
+	}
+}
+
+bool SliceSubAllocator::allocate_backing_heap(AllocatedSlice *allocation)
+{
+	uint32_t count = sub_block_size * Util::LegionAllocator::NumSubBlocks;
+
+	if (parent)
+	{
+		return parent->allocate(count, allocation);
+	}
+	else if (global_allocator)
+	{
+		uint32_t index = global_allocator->allocate(count);
+		if (index == UINT32_MAX)
+			return false;
+
+		*allocation = {};
+		allocation->count = count;
+		allocation->buffer_index = index;
+		return true;
+	}
+	else
+	{
+		return false;
+	}
+}
+
+void SliceSubAllocator::free_backing_heap(AllocatedSlice *allocation) const
+{
+	if (parent)
+		parent->free(allocation->heap, allocation->mask);
+	else if (global_allocator)
+		global_allocator->free(allocation->buffer_index);
+}
+
+void SliceSubAllocator::prepare_allocation(AllocatedSlice *allocation, Util::IntrusiveList<MiniHeap>::Iterator heap,
+                                           const Util::SuballocationResult &suballoc)
+{
+	allocation->buffer_index = heap->allocation.buffer_index;
+	allocation->offset = heap->allocation.offset + suballoc.offset;
+	allocation->count = suballoc.size;
+	allocation->mask = suballoc.mask;
+	allocation->heap = heap;
+	allocation->alloc = this;
+}
+
+void SliceAllocator::init(uint32_t sub_block_size, uint32_t num_sub_blocks_in_arena_log2,
+                          Util::SliceBackingAllocator *alloc)
+{
+	global_allocator = alloc;
+	assert(num_sub_blocks_in_arena_log2 < SliceAllocatorCount * 5 && num_sub_blocks_in_arena_log2 >= 5);
+	unsigned num_hierarchies = (num_sub_blocks_in_arena_log2 + 4) / 5;
+	assert(num_hierarchies <= SliceAllocatorCount);
+
+	for (unsigned i = 0; i < num_hierarchies - 1; i++)
+		allocators[i].parent = &allocators[i + 1];
+	allocators[num_hierarchies - 1].global_allocator = alloc;
+
+	unsigned shamt[SliceAllocatorCount] = {};
+	shamt[num_hierarchies - 1] = num_sub_blocks_in_arena_log2 - Util::floor_log2(Util::LegionAllocator::NumSubBlocks);
+
+	// Spread out the multiplier if possible.
+	for (unsigned i = num_hierarchies - 1; i > 1; i--)
+	{
+		shamt[i - 1] = shamt[i] - shamt[i] / (i);
+		assert(shamt[i] - shamt[i - 1] <= Util::floor_log2(Util::LegionAllocator::NumSubBlocks));
+	}
+
+	for (unsigned i = 0; i < num_hierarchies; i++)
+	{
+		allocators[i].set_sub_block_size(sub_block_size << shamt[i]);
+		allocators[i].set_object_pool(&object_pool);
+	}
+}
+
+void SliceAllocator::free(const Util::AllocatedSlice &slice)
+{
+	if (slice.alloc)
+		slice.alloc->free(slice.heap, slice.mask);
+	else if (slice.buffer_index != UINT32_MAX)
+		global_allocator->free(slice.buffer_index);
+}
+
+void SliceAllocator::prime(const void *opaque_meta)
+{
+	for (auto &alloc : allocators)
+	{
+		if (alloc.global_allocator)
+		{
+			alloc.global_allocator->prime(alloc.get_sub_block_size() * Util::LegionAllocator::NumSubBlocks, opaque_meta);
+			break;
+		}
+	}
+}
+
+bool SliceAllocator::allocate(uint32_t count, Util::AllocatedSlice *slice)
+{
+	for (auto &alloc : allocators)
+	{
+		uint32_t max_alloc_size = alloc.get_max_allocation_size();
+		if (count <= max_alloc_size)
+			return alloc.allocate(count, slice);
+	}
+
+	LOGE("Allocation of %u elements is too large for SliceAllocator.\n", count);
+	return false;
+}
+
+void SliceBackingAllocatorVA::free(uint32_t)
+{
+	allocated = false;
+}
+
+uint32_t SliceBackingAllocatorVA::allocate(uint32_t)
+{
+	if (allocated)
+		return UINT32_MAX;
+	else
+	{
+		allocated = true;
+		return 0;
+	}
+}
+
+void SliceBackingAllocatorVA::prime(uint32_t, const void *)
+{
+}
+}
--- a/util/arena_allocator.hpp
+++ b/util/arena_allocator.hpp
@@ -0,0 +1,336 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+*
+* Permission is hereby granted, free of charge, to any person obtaining
+* a copy of this software and associated documentation files (the
+* "Software"), to deal in the Software without restriction, including
+* without limitation the rights to use, copy, modify, merge, publish,
+* distribute, sublicense, and/or sell copies of the Software, and to
+* permit persons to whom the Software is furnished to do so, subject to
+* the following conditions:
+*
+* The above copyright notice and this permission notice shall be
+* included in all copies or substantial portions of the Software.
+*
+* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+*/
+
+#pragma once
+
+#include <stdint.h>
+#include <assert.h>
+#include "intrusive_list.hpp"
+#include "logging.hpp"
+#include "object_pool.hpp"
+#include "bitops.hpp"
+
+namespace Util
+{
+// Expands the buddy allocator to consider 32 "buddies".
+// The allocator is logical and works in terms of units, not bytes.
+class LegionAllocator
+{
+public:
+	enum
+	{
+		NumSubBlocks = 32u,
+		AllFree = ~0u
+	};
+
+	LegionAllocator(const LegionAllocator &) = delete;
+	void operator=(const LegionAllocator &) = delete;
+
+	LegionAllocator()
+	{
+		for (auto &v : free_blocks)
+			v = AllFree;
+		longest_run = 32;
+	}
+
+	~LegionAllocator()
+	{
+		if (free_blocks[0] != AllFree)
+			LOGE("Memory leak in block detected.\n");
+	}
+
+	inline bool full() const
+	{
+		return free_blocks[0] == 0;
+	}
+
+	inline bool empty() const
+	{
+		return free_blocks[0] == AllFree;
+	}
+
+	inline uint32_t get_longest_run() const
+	{
+		return longest_run;
+	}
+
+	void allocate(uint32_t num_blocks, uint32_t &mask, uint32_t &offset);
+	void free(uint32_t mask);
+
+private:
+	uint32_t free_blocks[NumSubBlocks];
+	uint32_t longest_run = 0;
+	void update_longest_run();
+};
+
+// Represents that a legion heap is backed by some kind of allocation.
+template <typename BackingAllocation>
+struct LegionHeap : Util::IntrusiveListEnabled<LegionHeap<BackingAllocation>>
+{
+	BackingAllocation allocation;
+	Util::LegionAllocator heap;
+};
+
+template <typename BackingAllocation>
+struct AllocationArena
+{
+	Util::IntrusiveList<LegionHeap<BackingAllocation>> heaps[Util::LegionAllocator::NumSubBlocks];
+	Util::IntrusiveList<LegionHeap<BackingAllocation>> full_heaps;
+	uint32_t heap_availability_mask = 0;
+};
+
+struct SuballocationResult
+{
+	uint32_t offset;
+	uint32_t size;
+	uint32_t mask;
+};
+
+template <typename DerivedAllocator, typename BackingAllocation>
+class ArenaAllocator
+{
+public:
+	using MiniHeap = LegionHeap<BackingAllocation>;
+
+	~ArenaAllocator()
+	{
+		bool error = false;
+
+		if (heap_arena.full_heaps.begin())
+			error = true;
+
+		for (auto &h : heap_arena.heaps)
+			if (h.begin())
+				error = true;
+
+		if (error)
+			LOGE("Memory leaked in class allocator!\n");
+	}
+
+	inline void set_sub_block_size(uint32_t size)
+	{
+		assert(Util::is_pow2(size));
+		sub_block_size_log2 = Util::floor_log2(size);
+		sub_block_size = size;
+	}
+
+	inline uint32_t get_max_allocation_size() const
+	{
+		return sub_block_size * Util::LegionAllocator::NumSubBlocks;
+	}
+
+	inline uint32_t get_sub_block_size() const
+	{
+		return sub_block_size;
+	}
+
+	inline uint32_t get_block_alignment() const
+	{
+		return get_sub_block_size();
+	}
+
+	inline bool allocate(uint32_t size, BackingAllocation *alloc)
+	{
+		unsigned num_blocks = (size + sub_block_size - 1) >> sub_block_size_log2;
+		uint32_t size_mask = (1u << (num_blocks - 1)) - 1;
+		uint32_t index = trailing_zeroes(heap_arena.heap_availability_mask & ~size_mask);
+
+		if (index < LegionAllocator::NumSubBlocks)
+		{
+			auto itr = heap_arena.heaps[index].begin();
+			assert(itr);
+			assert(index >= (num_blocks - 1));
+
+			auto &heap = *itr;
+			static_cast<DerivedAllocator *>(this)->prepare_allocation(alloc, itr, suballocate(num_blocks, heap));
+
+			unsigned new_index = heap.heap.get_longest_run() - 1;
+
+			if (heap.heap.full())
+			{
+				heap_arena.full_heaps.move_to_front(heap_arena.heaps[index], itr);
+				if (!heap_arena.heaps[index].begin())
+					heap_arena.heap_availability_mask &= ~(1u << index);
+			}
+			else if (new_index != index)
+			{
+				auto &new_heap = heap_arena.heaps[new_index];
+				new_heap.move_to_front(heap_arena.heaps[index], itr);
+				heap_arena.heap_availability_mask |= 1u << new_index;
+				if (!heap_arena.heaps[index].begin())
+					heap_arena.heap_availability_mask &= ~(1u << index);
+			}
+
+			return true;
+		}
+
+		// We didn't find a vacant heap, make a new one.
+		auto *node = object_pool->allocate();
+		if (!node)
+			return false;
+
+		auto &heap = *node;
+
+		if (!static_cast<DerivedAllocator *>(this)->allocate_backing_heap(&heap.allocation))
+		{
+			object_pool->free(node);
+			return false;
+		}
+
+		// This cannot fail.
+		static_cast<DerivedAllocator *>(this)->prepare_allocation(alloc, node, suballocate(num_blocks, heap));
+
+		if (heap.heap.full())
+		{
+			heap_arena.full_heaps.insert_front(node);
+		}
+		else
+		{
+			unsigned new_index = heap.heap.get_longest_run() - 1;
+			heap_arena.heaps[new_index].insert_front(node);
+			heap_arena.heap_availability_mask |= 1u << new_index;
+		}
+
+		return true;
+	}
+
+	inline void free(typename IntrusiveList<MiniHeap>::Iterator itr, uint32_t mask)
+	{
+		auto *heap = itr.get();
+		auto &block = heap->heap;
+		bool was_full = block.full();
+
+		unsigned index = block.get_longest_run() - 1;
+		block.free(mask);
+		unsigned new_index = block.get_longest_run() - 1;
+
+		if (block.empty())
+		{
+			static_cast<DerivedAllocator *>(this)->free_backing_heap(&heap->allocation);
+
+			if (was_full)
+				heap_arena.full_heaps.erase(heap);
+			else
+			{
+				heap_arena.heaps[index].erase(heap);
+				if (!heap_arena.heaps[index].begin())
+					heap_arena.heap_availability_mask &= ~(1u << index);
+			}
+
+			object_pool->free(heap);
+		}
+		else if (was_full)
+		{
+			heap_arena.heaps[new_index].move_to_front(heap_arena.full_heaps, heap);
+			heap_arena.heap_availability_mask |= 1u << new_index;
+		}
+		else if (index != new_index)
+		{
+			heap_arena.heaps[new_index].move_to_front(heap_arena.heaps[index], heap);
+			heap_arena.heap_availability_mask |= 1u << new_index;
+			if (!heap_arena.heaps[index].begin())
+				heap_arena.heap_availability_mask &= ~(1u << index);
+		}
+	}
+
+	inline void set_object_pool(ObjectPool<MiniHeap> *object_pool_)
+	{
+		object_pool = object_pool_;
+	}
+
+protected:
+	AllocationArena<BackingAllocation> heap_arena;
+	ObjectPool<LegionHeap<BackingAllocation>> *object_pool = nullptr;
+
+	uint32_t sub_block_size = 1;
+	uint32_t sub_block_size_log2 = 0;
+
+private:
+	inline SuballocationResult suballocate(uint32_t num_blocks, MiniHeap &heap)
+	{
+		SuballocationResult res = {};
+		res.size = num_blocks << sub_block_size_log2;
+		heap.heap.allocate(num_blocks, res.mask, res.offset);
+		res.offset <<= sub_block_size_log2;
+		return res;
+	}
+};
+
+struct SliceSubAllocator;
+
+struct AllocatedSlice
+{
+	uint32_t buffer_index = UINT32_MAX;
+	uint32_t offset = 0;
+	uint32_t count = 0;
+	uint32_t mask = 0;
+
+	SliceSubAllocator *alloc = nullptr;
+	Util::IntrusiveList<Util::LegionHeap<AllocatedSlice>>::Iterator heap = {};
+};
+
+struct SliceBackingAllocator
+{
+	virtual ~SliceBackingAllocator() = default;
+	virtual uint32_t allocate(uint32_t count) = 0;
+	virtual void free(uint32_t index) = 0;
+	virtual void prime(uint32_t count, const void *opaque_meta) = 0;
+};
+
+struct SliceBackingAllocatorVA : SliceBackingAllocator
+{
+	uint32_t allocate(uint32_t count) override;
+	void free(uint32_t index) override;
+	void prime(uint32_t count, const void *opaque_meta) override;
+	bool allocated = false;
+};
+
+struct SliceSubAllocator : Util::ArenaAllocator<SliceSubAllocator, AllocatedSlice>
+{
+	SliceSubAllocator *parent = nullptr;
+	SliceBackingAllocator *global_allocator = nullptr;
+
+	// Implements curious recurring template pattern calls.
+	bool allocate_backing_heap(AllocatedSlice *allocation);
+	void free_backing_heap(AllocatedSlice *allocation) const;
+	void prepare_allocation(AllocatedSlice *allocation, Util::IntrusiveList<MiniHeap>::Iterator heap,
+	                        const Util::SuballocationResult &suballoc);
+};
+
+class SliceAllocator
+{
+public:
+	bool allocate(uint32_t count, Util::AllocatedSlice *slice);
+	void free(const Util::AllocatedSlice &slice);
+	void prime(const void *opaque_meta);
+
+protected:
+	SliceAllocator() = default;
+	void init(uint32_t sub_block_size, uint32_t num_sub_blocks_in_arena_log2, SliceBackingAllocator *alloc);
+
+private:
+	Util::ObjectPool<Util::LegionHeap<Util::AllocatedSlice>> object_pool;
+	SliceBackingAllocator *global_allocator = nullptr;
+	enum { SliceAllocatorCount = 5 };
+	Util::SliceSubAllocator allocators[SliceAllocatorCount];
+};
+}
--- a/util/bitops.hpp
+++ b/util/bitops.hpp
@@ -0,0 +1,173 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#ifdef _MSC_VER
+#include <intrin.h>
+#endif
+
+namespace Util
+{
+#ifdef __GNUC__
+#define leading_zeroes_(x) ((x) == 0 ? 32 : __builtin_clz(x))
+#define trailing_zeroes_(x) ((x) == 0 ? 32 : __builtin_ctz(x))
+#define trailing_ones_(x) __builtin_ctz(~uint32_t(x))
+#define leading_zeroes64_(x) ((x) == 0 ? 64 : __builtin_clzll(x))
+#define trailing_zeroes64_(x) ((x) == 0 ? 64 : __builtin_ctzll(x))
+#define trailing_ones64_(x) __builtin_ctzll(~uint64_t(x))
+#define popcount32_(x) __builtin_popcount(x)
+
+static inline uint32_t leading_zeroes(uint32_t x) { return leading_zeroes_(x); }
+static inline uint32_t trailing_zeroes(uint32_t x) { return trailing_zeroes_(x); }
+static inline uint32_t trailing_ones(uint32_t x) { return trailing_ones_(x); }
+static inline uint32_t leading_zeroes64(uint64_t x) { return leading_zeroes64_(x); }
+static inline uint32_t trailing_zeroes64(uint64_t x) { return trailing_zeroes64_(x); }
+static inline uint32_t trailing_ones64(uint64_t x) { return trailing_ones64_(x); }
+static inline uint32_t popcount32(uint32_t x) { return popcount32_(x); }
+
+#elif defined(_MSC_VER)
+namespace Internal
+{
+static inline uint32_t popcount32(uint32_t x)
+{
+	return __popcnt(x);
+}
+
+static inline uint32_t clz(uint32_t x)
+{
+	unsigned long result;
+	if (_BitScanReverse(&result, x))
+		return 31 - result;
+	else
+		return 32;
+}
+
+static inline uint32_t ctz(uint32_t x)
+{
+	unsigned long result;
+	if (_BitScanForward(&result, x))
+		return result;
+	else
+		return 32;
+}
+
+static inline uint32_t clz64(uint64_t x)
+{
+	unsigned long result;
+	if (_BitScanReverse64(&result, x))
+		return 63 - result;
+	else
+		return 64;
+}
+
+static inline uint32_t ctz64(uint64_t x)
+{
+	unsigned long result;
+	if (_BitScanForward64(&result, x))
+		return result;
+	else
+		return 64;
+}
+}
+
+static inline uint32_t leading_zeroes(uint32_t x) { return Internal::clz(x); }
+static inline uint32_t trailing_zeroes(uint32_t x) { return Internal::ctz(x); }
+static inline uint32_t trailing_ones(uint32_t x) { return Internal::ctz(~x); }
+static inline uint32_t leading_zeroes64(uint64_t x) { return Internal::clz64(x); }
+static inline uint32_t trailing_zeroes64(uint64_t x) { return Internal::ctz64(x); }
+static inline uint32_t trailing_ones64(uint64_t x) { return Internal::ctz64(~x); }
+static inline uint32_t popcount32(uint32_t x) { return Internal::popcount32(x); }
+#else
+#error "Implement me."
+#endif
+
+template <typename T>
+inline void for_each_bit64(uint64_t value, const T &func)
+{
+	while (value)
+	{
+		uint32_t bit = trailing_zeroes64(value);
+		func(bit);
+		value &= ~(1ull << bit);
+	}
+}
+
+template <typename T>
+inline void for_each_bit(uint32_t value, const T &func)
+{
+	while (value)
+	{
+		uint32_t bit = trailing_zeroes(value);
+		func(bit);
+		value &= ~(1u << bit);
+	}
+}
+
+template <typename T>
+inline void for_each_bit_range(uint32_t value, const T &func)
+{
+	if (value == ~0u)
+	{
+		func(0, 32);
+		return;
+	}
+
+	uint32_t bit_offset = 0;
+	while (value)
+	{
+		uint32_t bit = trailing_zeroes(value);
+		bit_offset += bit;
+		value >>= bit;
+		uint32_t range = trailing_ones(value);
+		func(bit_offset, range);
+		value &= ~((1u << range) - 1);
+	}
+}
+
+template <typename T>
+inline bool is_pow2(T value)
+{
+	return (value & (value - T(1))) == T(0);
+}
+
+inline uint32_t next_pow2(uint32_t v)
+{
+	v--;
+	v |= v >> 16;
+	v |= v >> 8;
+	v |= v >> 4;
+	v |= v >> 2;
+	v |= v >> 1;
+	return v + 1;
+}
+
+inline uint32_t prev_pow2(uint32_t v)
+{
+	return next_pow2(v + 1) >> 1;
+}
+
+inline uint32_t floor_log2(uint32_t v)
+{
+	return 31 - leading_zeroes(v);
+}
+}
--- a/util/dynamic_array.hpp
+++ b/util/dynamic_array.hpp
@@ -0,0 +1,59 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include "aligned_alloc.hpp"
+#include <memory>
+#include <algorithm>
+#include <type_traits>
+
+namespace Util
+{
+template <typename T>
+class DynamicArray
+{
+public:
+	// Only POD-like types work here since we don't invoke placement new or delete.
+	static_assert(std::is_trivially_default_constructible<T>::value, "T must be trivially constructible.");
+	static_assert(std::is_trivially_destructible<T>::value, "T must be trivially destructible.");
+
+	inline void reserve(size_t n)
+	{
+		if (n > N)
+		{
+			buffer.reset(static_cast<T *>(memalign_alloc(std::max<size_t>(64, alignof(T)), n * sizeof(T))));
+			N = n;
+		}
+	}
+
+	inline T &operator[](size_t index) { return buffer.get()[index]; }
+	inline const T &operator[](size_t index) const { return buffer.get()[index]; }
+	inline T *data() { return buffer.get(); }
+	inline const T *data() const { return buffer.get(); }
+	inline size_t get_capacity() const { return N; }
+
+private:
+	std::unique_ptr<T, AlignedDeleter> buffer;
+	size_t N = 0;
+};
+}
--- a/util/enum_cast.hpp
+++ b/util/enum_cast.hpp
@@ -0,0 +1,34 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include <type_traits>
+
+namespace Util
+{
+template <typename T>
+constexpr typename std::underlying_type<T>::type ecast(T x)
+{
+	return static_cast<typename std::underlying_type<T>::type>(x);
+}
+}
--- a/util/environment.cpp
+++ b/util/environment.cpp
@@ -0,0 +1,96 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifdef _WIN32
+#define WIN32_LEAN_AND_MEAN
+#include <windows.h>
+#endif
+
+#include "environment.hpp"
+#include <string>
+#include <stdlib.h>
+
+namespace Util
+{
+bool get_environment(const char *env, std::string &str)
+{
+#ifdef _WIN32
+	char buf[4096];
+	DWORD count = GetEnvironmentVariableA(env, buf, sizeof(buf));
+	if (count)
+	{
+		str = { buf, buf + count };
+		return true;
+	}
+	else
+		return false;
+#else
+	if (const char *v = getenv(env))
+	{
+		str = v;
+		return true;
+	}
+	else
+		return false;
+#endif
+}
+
+void set_environment(const char *env, const char *value)
+{
+#ifdef _WIN32
+	SetEnvironmentVariableA(env, value);
+#else
+	setenv(env, value, 1);
+#endif
+}
+
+std::string get_environment_string(const char *env, const char *default_value)
+{
+	std::string v;
+	if (!get_environment(env, v))
+		v = default_value;
+	return v;
+}
+
+unsigned get_environment_uint(const char *env, unsigned default_value)
+{
+	unsigned value = default_value;
+	std::string v;
+	if (get_environment(env, v))
+		value = unsigned(std::stoul(v));
+	return value;
+}
+
+int get_environment_int(const char *env, int default_value)
+{
+	int value = default_value;
+	std::string v;
+	if (get_environment(env, v))
+		value = int(std::stol(v));
+	return value;
+}
+
+bool get_environment_bool(const char *env, bool default_value)
+{
+	return get_environment_int(env, int(default_value)) != 0;
+}
+}
--- a/util/environment.hpp
+++ b/util/environment.hpp
@@ -0,0 +1,35 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include <string>
+
+namespace Util
+{
+bool get_environment(const char *env, std::string &str);
+std::string get_environment_string(const char *env, const char *default_value);
+unsigned get_environment_uint(const char *env, unsigned default_value);
+int get_environment_int(const char *env, int default_value);
+bool get_environment_bool(const char *env, bool default_value);
+void set_environment(const char *env, const char *value);
+}
--- a/util/hash.hpp
+++ b/util/hash.hpp
@@ -0,0 +1,105 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+#include <stdint.h>
+#include <string>
+
+namespace Util
+{
+using Hash = uint64_t;
+
+class Hasher
+{
+public:
+	explicit Hasher(Hash h_)
+		: h(h_)
+	{
+	}
+
+	Hasher() = default;
+
+	template <typename T>
+	inline void data(const T *data_, size_t size)
+	{
+		size /= sizeof(*data_);
+		for (size_t i = 0; i < size; i++)
+			h = (h * 0x100000001b3ull) ^ data_[i];
+	}
+
+	inline void u32(uint32_t value)
+	{
+		h = (h * 0x100000001b3ull) ^ value;
+	}
+
+	inline void s32(int32_t value)
+	{
+		u32(uint32_t(value));
+	}
+
+	inline void f32(float value)
+	{
+		union
+		{
+			float f32;
+			uint32_t u32;
+		} u;
+		u.f32 = value;
+		u32(u.u32);
+	}
+
+	inline void u64(uint64_t value)
+	{
+		u32(value & 0xffffffffu);
+		u32(value >> 32);
+	}
+
+	template <typename T>
+	inline void pointer(T *ptr)
+	{
+		u64(reinterpret_cast<uintptr_t>(ptr));
+	}
+
+	inline void string(const char *str)
+	{
+		char c;
+		u32(0xff);
+		while ((c = *str++) != '\0')
+			u32(uint8_t(c));
+	}
+
+	inline void string(const std::string &str)
+	{
+		u32(0xff);
+		for (auto &c : str)
+			u32(uint8_t(c));
+	}
+
+	inline Hash get() const
+	{
+		return h;
+	}
+
+private:
+	Hash h = 0xcbf29ce484222325ull;
+};
+}
--- a/util/intrusive.hpp
+++ b/util/intrusive.hpp
@@ -0,0 +1,310 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include <stddef.h>
+#include <utility>
+#include <memory>
+#include <atomic>
+#include <type_traits>
+
+namespace Util
+{
+class SingleThreadCounter
+{
+public:
+	inline void add_ref()
+	{
+		count++;
+	}
+
+	inline bool release()
+	{
+		return --count == 0;
+	}
+
+private:
+	uint32_t count = 1;
+};
+
+class MultiThreadCounter
+{
+public:
+	MultiThreadCounter()
+	{
+		count.store(1, std::memory_order_relaxed);
+	}
+
+	inline void add_ref()
+	{
+		count.fetch_add(1, std::memory_order_relaxed);
+	}
+
+	inline bool release()
+	{
+		auto result = count.fetch_sub(1, std::memory_order_acq_rel);
+		return result == 1;
+	}
+
+private:
+	std::atomic_uint32_t count;
+};
+
+template <typename T>
+class IntrusivePtr;
+
+template <typename T, typename Deleter = std::default_delete<T>, typename ReferenceOps = SingleThreadCounter>
+class IntrusivePtrEnabled
+{
+public:
+	using IntrusivePtrType = IntrusivePtr<T>;
+	using EnabledBase = T;
+	using EnabledDeleter = Deleter;
+	using EnabledReferenceOp = ReferenceOps;
+
+	void release_reference()
+	{
+		if (reference_count.release())
+			Deleter()(static_cast<T *>(this));
+	}
+
+	void add_reference()
+	{
+		reference_count.add_ref();
+	}
+
+	IntrusivePtrEnabled() = default;
+
+	IntrusivePtrEnabled(const IntrusivePtrEnabled &) = delete;
+
+	void operator=(const IntrusivePtrEnabled &) = delete;
+
+protected:
+	Util::IntrusivePtr<T> reference_from_this();
+
+private:
+	ReferenceOps reference_count;
+};
+
+template <typename T>
+class IntrusivePtr
+{
+public:
+	template <typename U>
+	friend class IntrusivePtr;
+
+	IntrusivePtr() = default;
+
+	explicit IntrusivePtr(T *handle)
+		: data(handle)
+	{
+	}
+
+	T &operator*()
+	{
+		return *data;
+	}
+
+	const T &operator*() const
+	{
+		return *data;
+	}
+
+	T *operator->()
+	{
+		return data;
+	}
+
+	const T *operator->() const
+	{
+		return data;
+	}
+
+	explicit operator bool() const
+	{
+		return data != nullptr;
+	}
+
+	bool operator==(const IntrusivePtr &other) const
+	{
+		return data == other.data;
+	}
+
+	bool operator!=(const IntrusivePtr &other) const
+	{
+		return data != other.data;
+	}
+
+	T *get()
+	{
+		return data;
+	}
+
+	const T *get() const
+	{
+		return data;
+	}
+
+	void reset()
+	{
+		using ReferenceBase = IntrusivePtrEnabled<
+				typename T::EnabledBase,
+				typename T::EnabledDeleter,
+				typename T::EnabledReferenceOp>;
+
+		// Static up-cast here to avoid potential issues with multiple intrusive inheritance.
+		// Also makes sure that the pointer type actually inherits from this type.
+		if (data)
+			static_cast<ReferenceBase *>(data)->release_reference();
+		data = nullptr;
+	}
+
+	template <typename U>
+	IntrusivePtr &operator=(const IntrusivePtr<U> &other)
+	{
+		static_assert(std::is_base_of<T, U>::value,
+		              "Cannot safely assign downcasted intrusive pointers.");
+
+		using ReferenceBase = IntrusivePtrEnabled<
+				typename T::EnabledBase,
+				typename T::EnabledDeleter,
+				typename T::EnabledReferenceOp>;
+
+		reset();
+		data = static_cast<T *>(other.data);
+
+		// Static up-cast here to avoid potential issues with multiple intrusive inheritance.
+		// Also makes sure that the pointer type actually inherits from this type.
+		if (data)
+			static_cast<ReferenceBase *>(data)->add_reference();
+		return *this;
+	}
+
+	IntrusivePtr &operator=(const IntrusivePtr &other)
+	{
+		using ReferenceBase = IntrusivePtrEnabled<
+				typename T::EnabledBase,
+				typename T::EnabledDeleter,
+				typename T::EnabledReferenceOp>;
+
+		if (this != &other)
+		{
+			reset();
+			data = other.data;
+			if (data)
+				static_cast<ReferenceBase *>(data)->add_reference();
+		}
+		return *this;
+	}
+
+	template <typename U>
+	IntrusivePtr(const IntrusivePtr<U> &other)
+	{
+		*this = other;
+	}
+
+	IntrusivePtr(const IntrusivePtr &other)
+	{
+		*this = other;
+	}
+
+	~IntrusivePtr()
+	{
+		reset();
+	}
+
+	template <typename U>
+	IntrusivePtr &operator=(IntrusivePtr<U> &&other) noexcept
+	{
+		reset();
+		data = other.data;
+		other.data = nullptr;
+		return *this;
+	}
+
+	IntrusivePtr &operator=(IntrusivePtr &&other) noexcept
+	{
+		if (this != &other)
+		{
+			reset();
+			data = other.data;
+			other.data = nullptr;
+		}
+		return *this;
+	}
+
+	template <typename U>
+	IntrusivePtr(IntrusivePtr<U> &&other) noexcept
+	{
+		*this = std::move(other);
+	}
+
+	template <typename U>
+	IntrusivePtr(IntrusivePtr &&other) noexcept
+	{
+		*this = std::move(other);
+	}
+
+	T *release() &
+	{
+		T *ret = data;
+		data = nullptr;
+		return ret;
+	}
+
+	T *release() &&
+	{
+		T *ret = data;
+		data = nullptr;
+		return ret;
+	}
+
+private:
+	T *data = nullptr;
+};
+
+template <typename T, typename Deleter, typename ReferenceOps>
+IntrusivePtr<T> IntrusivePtrEnabled<T, Deleter, ReferenceOps>::reference_from_this()
+{
+	add_reference();
+	return IntrusivePtr<T>(static_cast<T *>(this));
+}
+
+template <typename Derived>
+using DerivedIntrusivePtrType = IntrusivePtr<Derived>;
+
+template <typename T, typename... P>
+DerivedIntrusivePtrType<T> make_handle(P &&... p)
+{
+	return DerivedIntrusivePtrType<T>(new T(std::forward<P>(p)...));
+}
+
+template <typename Base, typename Derived, typename... P>
+typename Base::IntrusivePtrType make_derived_handle(P &&... p)
+{
+	return typename Base::IntrusivePtrType(new Derived(std::forward<P>(p)...));
+}
+
+template <typename T>
+using ThreadSafeIntrusivePtrEnabled = IntrusivePtrEnabled<T, std::default_delete<T>, MultiThreadCounter>;
+}
--- a/util/intrusive_hash_map.hpp
+++ b/util/intrusive_hash_map.hpp
@@ -0,0 +1,690 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include "hash.hpp"
+#include "intrusive_list.hpp"
+#include "object_pool.hpp"
+#include "read_write_lock.hpp"
+#include <assert.h>
+#include <vector>
+
+namespace Util
+{
+template <typename T>
+class IntrusiveHashMapEnabled : public IntrusiveListEnabled<T>
+{
+public:
+	IntrusiveHashMapEnabled() = default;
+	IntrusiveHashMapEnabled(Util::Hash hash)
+		: intrusive_hashmap_key(hash)
+	{
+	}
+
+	void set_hash(Util::Hash hash)
+	{
+		intrusive_hashmap_key = hash;
+	}
+
+	Util::Hash get_hash() const
+	{
+		return intrusive_hashmap_key;
+	}
+
+private:
+	Hash intrusive_hashmap_key = 0;
+};
+
+template <typename T>
+struct IntrusivePODWrapper : public IntrusiveHashMapEnabled<IntrusivePODWrapper<T>>
+{
+	template <typename U>
+	explicit IntrusivePODWrapper(U&& value_)
+		: value(std::forward<U>(value_))
+	{
+	}
+
+	IntrusivePODWrapper() = default;
+
+	T& get()
+	{
+		return value;
+	}
+
+	const T& get() const
+	{
+		return value;
+	}
+
+	T value = {};
+};
+
+// This HashMap is non-owning. It just arranges a list of pointers.
+// It's kind of special purpose container used by the Vulkan backend.
+// Dealing with memory ownership is done through composition by a different class.
+// T must inherit from IntrusiveHashMapEnabled<T>.
+// Each instance of T can only be part of one hashmap.
+
+template <typename T>
+class IntrusiveHashMapHolder
+{
+public:
+	enum { InitialSize = 16, InitialLoadCount = 3 };
+
+	T *find(Hash hash) const
+	{
+		if (values.empty())
+			return nullptr;
+
+		Hash hash_mask = values.size() - 1;
+		auto masked = hash & hash_mask;
+		for (unsigned i = 0; i < load_count; i++)
+		{
+			if (values[masked] && get_hash(values[masked]) == hash)
+				return values[masked];
+			masked = (masked + 1) & hash_mask;
+		}
+
+		return nullptr;
+	}
+
+	template <typename P>
+	bool find_and_consume_pod(Hash hash, P &p) const
+	{
+		T *t = find(hash);
+		if (t)
+		{
+			p = t->get();
+			return true;
+		}
+		else
+			return false;
+	}
+
+	// Inserts, if value already exists, insertion does not happen.
+	// Return value is the data which is not part of the hashmap.
+	// It should be deleted or similar.
+	// Returns nullptr if nothing was in the hashmap for this key.
+	T *insert_yield(T *&value)
+	{
+		if (values.empty())
+			grow();
+
+		Hash hash_mask = values.size() - 1;
+		auto hash = get_hash(value);
+		auto masked = hash & hash_mask;
+
+		for (unsigned i = 0; i < load_count; i++)
+		{
+			if (values[masked] && get_hash(values[masked]) == hash)
+			{
+				T *ret = value;
+				value = values[masked];
+				return ret;
+			}
+			else if (!values[masked])
+			{
+				values[masked] = value;
+				list.insert_front(value);
+				return nullptr;
+			}
+			masked = (masked + 1) & hash_mask;
+		}
+
+		grow();
+		return insert_yield(value);
+	}
+
+	T *insert_replace(T *value)
+	{
+		if (values.empty())
+			grow();
+
+		Hash hash_mask = values.size() - 1;
+		auto hash = get_hash(value);
+		auto masked = hash & hash_mask;
+
+		for (unsigned i = 0; i < load_count; i++)
+		{
+			if (values[masked] && get_hash(values[masked]) == hash)
+			{
+				std::swap(values[masked], value);
+				list.erase(value);
+				list.insert_front(values[masked]);
+				return value;
+			}
+			else if (!values[masked])
+			{
+				assert(!values[masked]);
+				values[masked] = value;
+				list.insert_front(value);
+				return nullptr;
+			}
+			masked = (masked + 1) & hash_mask;
+		}
+
+		grow();
+		return insert_replace(value);
+	}
+
+	T *erase(Hash hash)
+	{
+		Hash hash_mask = values.size() - 1;
+		auto masked = hash & hash_mask;
+
+		for (unsigned i = 0; i < load_count; i++)
+		{
+			if (values[masked] && get_hash(values[masked]) == hash)
+			{
+				auto *value = values[masked];
+				list.erase(value);
+				values[masked] = nullptr;
+				return value;
+			}
+			masked = (masked + 1) & hash_mask;
+		}
+		return nullptr;
+	}
+
+	void erase(T *value)
+	{
+		erase(get_hash(value));
+	}
+
+	void clear()
+	{
+		list.clear();
+		values.clear();
+		load_count = 0;
+	}
+
+	typename IntrusiveList<T>::Iterator begin() const
+	{
+		return list.begin();
+	}
+
+	typename IntrusiveList<T>::Iterator end() const
+	{
+		return list.end();
+	}
+
+	IntrusiveList<T> &inner_list()
+	{
+		return list;
+	}
+
+	const IntrusiveList<T> &inner_list() const
+	{
+		return list;
+	}
+
+private:
+
+	inline bool compare_key(Hash masked, Hash hash) const
+	{
+		return get_key_for_index(masked) == hash;
+	}
+
+	inline Hash get_hash(const T *value) const
+	{
+		return static_cast<const IntrusiveHashMapEnabled<T> *>(value)->get_hash();
+	}
+
+	inline Hash get_key_for_index(Hash masked) const
+	{
+		return get_hash(values[masked]);
+	}
+
+	bool insert_inner(T *value)
+	{
+		Hash hash_mask = values.size() - 1;
+		auto hash = get_hash(value);
+		auto masked = hash & hash_mask;
+
+		for (unsigned i = 0; i < load_count; i++)
+		{
+			if (!values[masked])
+			{
+				values[masked] = value;
+				return true;
+			}
+			masked = (masked + 1) & hash_mask;
+		}
+		return false;
+	}
+
+	void grow()
+	{
+		bool success;
+		do
+		{
+			for (auto &v : values)
+				v = nullptr;
+
+			if (values.empty())
+			{
+				values.resize(InitialSize);
+				load_count = InitialLoadCount;
+				//LOGI("Growing hashmap to %u elements.\n", InitialSize);
+			}
+			else
+			{
+				values.resize(values.size() * 2);
+				//LOGI("Growing hashmap to %u elements.\n", unsigned(values.size()));
+				load_count++;
+			}
+
+			// Re-insert.
+			success = true;
+			for (auto &t : list)
+			{
+				if (!insert_inner(&t))
+				{
+					success = false;
+					break;
+				}
+			}
+		} while (!success);
+	}
+
+	std::vector<T *> values;
+	IntrusiveList<T> list;
+	unsigned load_count = 0;
+};
+
+template <typename T>
+class IntrusiveHashMap
+{
+public:
+	~IntrusiveHashMap()
+	{
+		clear();
+	}
+
+	IntrusiveHashMap() = default;
+	IntrusiveHashMap(const IntrusiveHashMap &) = delete;
+	void operator=(const IntrusiveHashMap &) = delete;
+
+	void clear()
+	{
+		auto &list = hashmap.inner_list();
+		auto itr = list.begin();
+		while (itr != list.end())
+		{
+			auto *to_free = itr.get();
+			itr = list.erase(itr);
+			pool.free(to_free);
+		}
+
+		hashmap.clear();
+	}
+
+	T *find(Hash hash) const
+	{
+		return hashmap.find(hash);
+	}
+
+	T &operator[](Hash hash)
+	{
+		auto *t = find(hash);
+		if (!t)
+			t = emplace_yield(hash);
+		return *t;
+	}
+
+	template <typename P>
+	bool find_and_consume_pod(Hash hash, P &p) const
+	{
+		return hashmap.find_and_consume_pod(hash, p);
+	}
+
+	void erase(T *value)
+	{
+		hashmap.erase(value);
+		pool.free(value);
+	}
+
+	void erase(Hash hash)
+	{
+		auto *value = hashmap.erase(hash);
+		if (value)
+			pool.free(value);
+	}
+
+	template <typename... P>
+	T *emplace_replace(Hash hash, P&&... p)
+	{
+		T *t = allocate(std::forward<P>(p)...);
+		return insert_replace(hash, t);
+	}
+
+	template <typename... P>
+	T *emplace_yield(Hash hash, P&&... p)
+	{
+		T *t = allocate(std::forward<P>(p)...);
+		return insert_yield(hash, t);
+	}
+
+	template <typename... P>
+	T *allocate(P&&... p)
+	{
+		return pool.allocate(std::forward<P>(p)...);
+	}
+
+	void free(T *value)
+	{
+		pool.free(value);
+	}
+
+	T *insert_replace(Hash hash, T *value)
+	{
+		static_cast<IntrusiveHashMapEnabled<T> *>(value)->set_hash(hash);
+		T *to_delete = hashmap.insert_replace(value);
+		if (to_delete)
+			pool.free(to_delete);
+		return value;
+	}
+
+	T *insert_yield(Hash hash, T *value)
+	{
+		static_cast<IntrusiveHashMapEnabled<T> *>(value)->set_hash(hash);
+		T *to_delete = hashmap.insert_yield(value);
+		if (to_delete)
+			pool.free(to_delete);
+		return value;
+	}
+
+	typename IntrusiveList<T>::Iterator begin() const
+	{
+		return hashmap.begin();
+	}
+
+	typename IntrusiveList<T>::Iterator end() const
+	{
+		return hashmap.end();
+	}
+
+	IntrusiveHashMap &get_thread_unsafe()
+	{
+		return *this;
+	}
+
+	const IntrusiveHashMap &get_thread_unsafe() const
+	{
+		return *this;
+	}
+
+private:
+	IntrusiveHashMapHolder<T> hashmap;
+	ObjectPool<T> pool;
+};
+
+template <typename T>
+using IntrusiveHashMapWrapper = IntrusiveHashMap<IntrusivePODWrapper<T>>;
+
+template <typename T>
+class ThreadSafeIntrusiveHashMap
+{
+public:
+	T *find(Hash hash) const
+	{
+		lock.lock_read();
+		T *t = hashmap.find(hash);
+		lock.unlock_read();
+
+		// We can race with the intrusive list internal pointers,
+		// but that's an internal detail which should never be touched outside the hashmap.
+		return t;
+	}
+
+	template <typename P>
+	bool find_and_consume_pod(Hash hash, P &p) const
+	{
+		lock.lock_read();
+		bool ret = hashmap.find_and_consume_pod(hash, p);
+		lock.unlock_read();
+		return ret;
+	}
+
+	void clear()
+	{
+		lock.lock_write();
+		hashmap.clear();
+		lock.unlock_write();
+	}
+
+	// Assumption is that readers will not be erased while in use by any other thread.
+	void erase(T *value)
+	{
+		lock.lock_write();
+		hashmap.erase(value);
+		lock.unlock_write();
+	}
+
+	void erase(Hash hash)
+	{
+		lock.lock_write();
+		hashmap.erase(hash);
+		lock.unlock_write();
+	}
+
+	template <typename... P>
+	T *allocate(P&&... p)
+	{
+		lock.lock_write();
+		T *t = hashmap.allocate(std::forward<P>(p)...);
+		lock.unlock_write();
+		return t;
+	}
+
+	void free(T *value)
+	{
+		lock.lock_write();
+		hashmap.free(value);
+		lock.unlock_write();
+	}
+
+	T *insert_replace(Hash hash, T *value)
+	{
+		lock.lock_write();
+		value = hashmap.insert_replace(hash, value);
+		lock.unlock_write();
+		return value;
+	}
+
+	T *insert_yield(Hash hash, T *value)
+	{
+		lock.lock_write();
+		value = hashmap.insert_yield(hash, value);
+		lock.unlock_write();
+		return value;
+	}
+
+	// This one is very sketchy, since callers need to make sure there are no readers of this hash.
+	template <typename... P>
+	T *emplace_replace(Hash hash, P&&... p)
+	{
+		lock.lock_write();
+		T *t = hashmap.emplace_replace(hash, std::forward<P>(p)...);
+		lock.unlock_write();
+		return t;
+	}
+
+	template <typename... P>
+	T *emplace_yield(Hash hash, P&&... p)
+	{
+		lock.lock_write();
+		T *t = hashmap.emplace_yield(hash, std::forward<P>(p)...);
+		lock.unlock_write();
+		return t;
+	}
+
+	// Not supposed to be called in racy conditions,
+	// we could have a global read lock and unlock while iterating if necessary.
+	typename IntrusiveList<T>::Iterator begin()
+	{
+		return hashmap.begin();
+	}
+
+	typename IntrusiveList<T>::Iterator end()
+	{
+		return hashmap.end();
+	}
+
+	IntrusiveHashMap<T> &get_thread_unsafe()
+	{
+		return hashmap;
+	}
+
+	const IntrusiveHashMap<T> &get_thread_unsafe() const
+	{
+		return hashmap;
+	}
+
+private:
+	IntrusiveHashMap<T> hashmap;
+	mutable RWSpinLock lock;
+};
+
+// A special purpose hashmap which is split into a read-only, immutable portion and a plain thread-safe one.
+// User can move read-write thread-safe portion to read-only portion when user knows it's safe to do so.
+template <typename T>
+class ThreadSafeIntrusiveHashMapReadCached
+{
+public:
+	~ThreadSafeIntrusiveHashMapReadCached()
+	{
+		clear();
+	}
+
+	T *find(Hash hash) const
+	{
+		T *t = read_only.find(hash);
+		if (t)
+			return t;
+
+		lock.lock_read();
+		t = read_write.find(hash);
+		lock.unlock_read();
+		return t;
+	}
+
+	void move_to_read_only()
+	{
+		auto &list = read_write.inner_list();
+		auto itr = list.begin();
+		while (itr != list.end())
+		{
+			auto *to_move = itr.get();
+			read_write.erase(to_move);
+			T *to_delete = read_only.insert_yield(to_move);
+			if (to_delete)
+				object_pool.free(to_delete);
+			itr = list.begin();
+		}
+	}
+
+	template <typename P>
+	bool find_and_consume_pod(Hash hash, P &p) const
+	{
+		if (read_only.find_and_consume_pod(hash, p))
+			return true;
+
+		lock.lock_read();
+		bool ret = read_write.find_and_consume_pod(hash, p);
+		lock.unlock_read();
+		return ret;
+	}
+
+	void clear()
+	{
+		lock.lock_write();
+		clear_list(read_only.inner_list());
+		clear_list(read_write.inner_list());
+		read_only.clear();
+		read_write.clear();
+		lock.unlock_write();
+	}
+
+	template <typename... P>
+	T *allocate(P&&... p)
+	{
+		lock.lock_write();
+		T *t = object_pool.allocate(std::forward<P>(p)...);
+		lock.unlock_write();
+		return t;
+	}
+
+	void free(T *ptr)
+	{
+		lock.lock_write();
+		object_pool.free(ptr);
+		lock.unlock_write();
+	}
+
+	T *insert_yield(Hash hash, T *value)
+	{
+		static_cast<IntrusiveHashMapEnabled<T> *>(value)->set_hash(hash);
+		lock.lock_write();
+		T *to_delete = read_write.insert_yield(value);
+		if (to_delete)
+			object_pool.free(to_delete);
+		lock.unlock_write();
+		return value;
+	}
+
+	template <typename... P>
+	T *emplace_yield(Hash hash, P&&... p)
+	{
+		T *t = allocate(std::forward<P>(p)...);
+		return insert_yield(hash, t);
+	}
+
+	IntrusiveHashMapHolder<T> &get_read_only()
+	{
+		return read_only;
+	}
+
+	IntrusiveHashMapHolder<T> &get_read_write()
+	{
+		return read_write;
+	}
+
+private:
+	IntrusiveHashMapHolder<T> read_only;
+	IntrusiveHashMapHolder<T> read_write;
+	ObjectPool<T> object_pool;
+	mutable RWSpinLock lock;
+
+	void clear_list(IntrusiveList<T> &list)
+	{
+		auto itr = list.begin();
+		while (itr != list.end())
+		{
+			auto *to_free = itr.get();
+			itr = list.erase(itr);
+			object_pool.free(to_free);
+		}
+	}
+};
+}
--- a/util/intrusive_list.hpp
+++ b/util/intrusive_list.hpp
@@ -0,0 +1,197 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+namespace Util
+{
+template <typename T>
+struct IntrusiveListEnabled
+{
+	IntrusiveListEnabled<T> *prev = nullptr;
+	IntrusiveListEnabled<T> *next = nullptr;
+};
+
+template <typename T>
+class IntrusiveList
+{
+public:
+	void clear()
+	{
+		head = nullptr;
+		tail = nullptr;
+	}
+
+	class Iterator
+	{
+	public:
+		friend class IntrusiveList<T>;
+		Iterator(IntrusiveListEnabled<T> *node_)
+		    : node(node_)
+		{
+		}
+
+		Iterator() = default;
+
+		explicit operator bool() const
+		{
+			return node != nullptr;
+		}
+
+		bool operator==(const Iterator &other) const
+		{
+			return node == other.node;
+		}
+
+		bool operator!=(const Iterator &other) const
+		{
+			return node != other.node;
+		}
+
+		T &operator*()
+		{
+			return *static_cast<T *>(node);
+		}
+
+		const T &operator*() const
+		{
+			return *static_cast<T *>(node);
+		}
+
+		T *get()
+		{
+			return static_cast<T *>(node);
+		}
+
+		const T *get() const
+		{
+			return static_cast<const T *>(node);
+		}
+
+		T *operator->()
+		{
+			return static_cast<T *>(node);
+		}
+
+		const T *operator->() const
+		{
+			return static_cast<T *>(node);
+		}
+
+		Iterator &operator++()
+		{
+			node = node->next;
+			return *this;
+		}
+
+		Iterator &operator--()
+		{
+			node = node->prev;
+			return *this;
+		}
+
+	private:
+		IntrusiveListEnabled<T> *node = nullptr;
+	};
+
+	Iterator begin() const
+	{
+		return Iterator(head);
+	}
+
+	Iterator rbegin() const
+	{
+		return Iterator(tail);
+	}
+
+	Iterator end() const
+	{
+		return Iterator();
+	}
+
+	Iterator erase(Iterator itr)
+	{
+		auto *node = itr.get();
+		auto *next = node->next;
+		auto *prev = node->prev;
+
+		if (prev)
+			prev->next = next;
+		else
+			head = next;
+
+		if (next)
+			next->prev = prev;
+		else
+			tail = prev;
+
+		return next;
+	}
+
+	void insert_front(Iterator itr)
+	{
+		auto *node = itr.get();
+		if (head)
+			head->prev = node;
+		else
+			tail = node;
+
+		node->next = head;
+		node->prev = nullptr;
+		head = node;
+	}
+
+	void insert_back(Iterator itr)
+	{
+		auto *node = itr.get();
+		if (tail)
+			tail->next = node;
+		else
+			head = node;
+
+		node->prev = tail;
+		node->next = nullptr;
+		tail = node;
+	}
+
+	void move_to_front(IntrusiveList<T> &other, Iterator itr)
+	{
+		other.erase(itr);
+		insert_front(itr);
+	}
+
+	void move_to_back(IntrusiveList<T> &other, Iterator itr)
+	{
+		other.erase(itr);
+		insert_back(itr);
+	}
+
+	bool empty() const
+	{
+		return head == nullptr;
+	}
+
+private:
+	IntrusiveListEnabled<T> *head = nullptr;
+	IntrusiveListEnabled<T> *tail = nullptr;
+};
+}
--- a/util/logging.cpp
+++ b/util/logging.cpp
@@ -0,0 +1,72 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#include "logging.hpp"
+
+#ifdef _WIN32
+#define WIN32_LEAN_AND_MEAN
+#include <windows.h>
+#endif
+
+namespace Util
+{
+static thread_local LoggingInterface *logging_iface;
+
+bool interface_log(const char *tag, const char *fmt, ...)
+{
+	if (!logging_iface)
+		return false;
+
+	va_list va;
+	va_start(va, fmt);
+	bool ret = logging_iface->log(tag, fmt, va);
+	va_end(va);
+	return ret;
+}
+
+void set_thread_logging_interface(LoggingInterface *iface)
+{
+	logging_iface = iface;
+}
+
+#ifdef _WIN32
+void debug_output_log(const char *tag, const char *fmt, ...)
+{
+	if (!IsDebuggerPresent())
+		return;
+
+	va_list va;
+	va_start(va, fmt);
+	auto len = vsnprintf(nullptr, 0, fmt, va);
+	if (len > 0)
+	{
+		size_t tag_len = strlen(tag);
+		char *buf = new char[len + tag_len + 1];
+		memcpy(buf, tag, tag_len);
+		vsnprintf(buf + tag_len, len + 1, fmt, va);
+		OutputDebugStringA(buf);
+		delete[] buf;
+	}
+	va_end(va);
+}
+#endif
+}
--- a/util/logging.hpp
+++ b/util/logging.hpp
@@ -0,0 +1,96 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include <stdio.h>
+#include <string.h>
+#include <stdarg.h>
+
+namespace Util
+{
+class LoggingInterface
+{
+public:
+	virtual ~LoggingInterface() = default;
+	virtual bool log(const char *tag, const char *fmt, va_list va) = 0;
+};
+
+bool interface_log(const char *tag, const char *fmt, ...);
+void set_thread_logging_interface(LoggingInterface *iface);
+}
+
+#if defined(_WIN32)
+namespace Util
+{
+void debug_output_log(const char *tag, const char *fmt, ...);
+}
+
+#define LOGE_FALLBACK(...) do { \
+	fprintf(stderr, "[ERROR]: " __VA_ARGS__); \
+	fflush(stderr); \
+	::Util::debug_output_log("[ERROR]: ", __VA_ARGS__); \
+} while(false)
+
+#define LOGW_FALLBACK(...) do { \
+	fprintf(stderr, "[WARN]: " __VA_ARGS__); \
+	fflush(stderr); \
+	::Util::debug_output_log("[WARN]: ", __VA_ARGS__); \
+} while(false)
+
+#define LOGI_FALLBACK(...) do { \
+	fprintf(stderr, "[INFO]: " __VA_ARGS__); \
+	fflush(stderr); \
+	::Util::debug_output_log("[INFO]: ", __VA_ARGS__); \
+} while(false)
+#elif defined(ANDROID)
+#include <android/log.h>
+#define LOGE_FALLBACK(...) do { __android_log_print(ANDROID_LOG_ERROR, "Granite", __VA_ARGS__); } while(0)
+#define LOGW_FALLBACK(...) do { __android_log_print(ANDROID_LOG_WARN, "Granite", __VA_ARGS__); } while(0)
+#define LOGI_FALLBACK(...) do { __android_log_print(ANDROID_LOG_INFO, "Granite", __VA_ARGS__); } while(0)
+#else
+#define LOGE_FALLBACK(...)                        \
+	do                                            \
+	{                                             \
+		fprintf(stderr, "[ERROR]: " __VA_ARGS__); \
+		fflush(stderr);                           \
+	} while (false)
+
+#define LOGW_FALLBACK(...)                       \
+	do                                           \
+	{                                            \
+		fprintf(stderr, "[WARN]: " __VA_ARGS__); \
+		fflush(stderr);                          \
+	} while (false)
+
+#define LOGI_FALLBACK(...)                       \
+	do                                           \
+	{                                            \
+		fprintf(stderr, "[INFO]: " __VA_ARGS__); \
+		fflush(stderr);                          \
+	} while (false)
+#endif
+
+#define LOGE(...) do { if (!::Util::interface_log("[ERROR]: ", __VA_ARGS__)) { LOGE_FALLBACK(__VA_ARGS__); }} while(0)
+#define LOGW(...) do { if (!::Util::interface_log("[WARN]: ", __VA_ARGS__)) { LOGW_FALLBACK(__VA_ARGS__); }} while(0)
+#define LOGI(...) do { if (!::Util::interface_log("[INFO]: ", __VA_ARGS__)) { LOGI_FALLBACK(__VA_ARGS__); }} while(0)
+
--- a/util/object_pool.hpp
+++ b/util/object_pool.hpp
@@ -0,0 +1,132 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include <memory>
+#include <mutex>
+#include <vector>
+#include <algorithm>
+#include <stdlib.h>
+#include "aligned_alloc.hpp"
+
+//#define OBJECT_POOL_DEBUG
+
+namespace Util
+{
+template<typename T>
+class ObjectPool
+{
+public:
+	template<typename... P>
+	T *allocate(P &&... p)
+	{
+#ifndef OBJECT_POOL_DEBUG
+		if (vacants.empty())
+		{
+			unsigned num_objects = 64u << memory.size();
+			T *ptr = static_cast<T *>(memalign_alloc(std::max<size_t>(64, alignof(T)),
+			                                         num_objects * sizeof(T)));
+			if (!ptr)
+				return nullptr;
+
+			for (unsigned i = 0; i < num_objects; i++)
+				vacants.push_back(&ptr[i]);
+
+			memory.emplace_back(ptr);
+		}
+
+		T *ptr = vacants.back();
+		vacants.pop_back();
+		new(ptr) T(std::forward<P>(p)...);
+		return ptr;
+#else
+		return new T(std::forward<P>(p)...);
+#endif
+	}
+
+	void free(T *ptr)
+	{
+#ifndef OBJECT_POOL_DEBUG
+		ptr->~T();
+		vacants.push_back(ptr);
+#else
+		delete ptr;
+#endif
+	}
+
+	void clear()
+	{
+#ifndef OBJECT_POOL_DEBUG
+		vacants.clear();
+		memory.clear();
+#endif
+	}
+
+protected:
+#ifndef OBJECT_POOL_DEBUG
+	std::vector<T *> vacants;
+
+	struct MallocDeleter
+	{
+		void operator()(T *ptr)
+		{
+			memalign_free(ptr);
+		}
+	};
+
+	std::vector<std::unique_ptr<T, MallocDeleter>> memory;
+#endif
+};
+
+template<typename T>
+class ThreadSafeObjectPool : private ObjectPool<T>
+{
+public:
+	template<typename... P>
+	T *allocate(P &&... p)
+	{
+		std::lock_guard<std::mutex> holder{lock};
+		return ObjectPool<T>::allocate(std::forward<P>(p)...);
+	}
+
+	void free(T *ptr)
+	{
+#ifndef OBJECT_POOL_DEBUG
+		ptr->~T();
+		std::lock_guard<std::mutex> holder{lock};
+		this->vacants.push_back(ptr);
+#else
+		delete ptr;
+#endif
+	}
+
+	void clear()
+	{
+		std::lock_guard<std::mutex> holder{lock};
+		ObjectPool<T>::clear();
+	}
+
+private:
+	std::mutex lock;
+};
+}
--- a/util/read_write_lock.hpp
+++ b/util/read_write_lock.hpp
@@ -0,0 +1,149 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include <atomic>
+
+#ifdef __SSE2__
+#include <emmintrin.h>
+#endif
+
+namespace Util
+{
+class RWSpinLock
+{
+public:
+	enum { Reader = 2, Writer = 1 };
+	RWSpinLock()
+	{
+		counter.store(0);
+	}
+
+	inline void lock_read()
+	{
+		unsigned v = counter.fetch_add(Reader, std::memory_order_acquire);
+		while ((v & Writer) != 0)
+		{
+#ifdef __SSE2__
+			_mm_pause();
+#endif
+			v = counter.load(std::memory_order_acquire);
+		}
+	}
+
+	inline bool try_lock_read()
+	{
+		unsigned v = counter.fetch_add(Reader, std::memory_order_acquire);
+		if ((v & Writer) != 0)
+		{
+			unlock_read();
+			return false;
+		}
+
+		return true;
+	}
+
+	inline void unlock_read()
+	{
+		counter.fetch_sub(Reader, std::memory_order_release);
+	}
+
+	inline void lock_write()
+	{
+		uint32_t expected = 0;
+		while (!counter.compare_exchange_weak(expected, Writer,
+		                                      std::memory_order_acquire,
+		                                      std::memory_order_relaxed))
+		{
+#ifdef __SSE2__
+			_mm_pause();
+#endif
+			expected = 0;
+		}
+	}
+
+	inline bool try_lock_write()
+	{
+		uint32_t expected = 0;
+		return counter.compare_exchange_strong(expected, Writer,
+		                                       std::memory_order_acquire,
+		                                       std::memory_order_relaxed);
+	}
+
+	inline void unlock_write()
+	{
+		counter.fetch_and(~Writer, std::memory_order_release);
+	}
+
+	inline void promote_reader_to_writer()
+	{
+		uint32_t expected = Reader;
+		if (!counter.compare_exchange_strong(expected, Writer,
+		                                     std::memory_order_acquire,
+		                                     std::memory_order_relaxed))
+		{
+			unlock_read();
+			lock_write();
+		}
+	}
+
+private:
+	std::atomic_uint32_t counter;
+};
+
+class RWSpinLockReadHolder
+{
+public:
+	explicit RWSpinLockReadHolder(RWSpinLock &lock_)
+		: lock(lock_)
+	{
+		lock.lock_read();
+	}
+
+	~RWSpinLockReadHolder()
+	{
+		lock.unlock_read();
+	}
+
+private:
+	RWSpinLock &lock;
+};
+
+class RWSpinLockWriteHolder
+{
+public:
+	explicit RWSpinLockWriteHolder(RWSpinLock &lock_)
+			: lock(lock_)
+	{
+		lock.lock_write();
+	}
+
+	~RWSpinLockWriteHolder()
+	{
+		lock.unlock_write();
+	}
+
+private:
+	RWSpinLock &lock;
+};
+}
--- a/util/small_vector.hpp
+++ b/util/small_vector.hpp
@@ -0,0 +1,456 @@
+/* Copyright (c) 2019-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include <stddef.h>
+#include <stdlib.h>
+#include <utility>
+#include <exception>
+#include <algorithm>
+#include <initializer_list>
+
+namespace Util
+{
+// std::aligned_storage does not support size == 0, so roll our own.
+template <typename T, size_t N>
+class AlignedBuffer
+{
+public:
+	T *data()
+	{
+		return reinterpret_cast<T *>(aligned_char);
+	}
+
+private:
+	alignas(T) char aligned_char[sizeof(T) * N];
+};
+
+template <typename T>
+class AlignedBuffer<T, 0>
+{
+public:
+	T *data()
+	{
+		return nullptr;
+	}
+};
+
+// An immutable version of SmallVector which erases type information about storage.
+template <typename T>
+class VectorView
+{
+public:
+	T &operator[](size_t i)
+	{
+		return ptr[i];
+	}
+
+	const T &operator[](size_t i) const
+	{
+		return ptr[i];
+	}
+
+	bool empty() const
+	{
+		return buffer_size == 0;
+	}
+
+	size_t size() const
+	{
+		return buffer_size;
+	}
+
+	T *data()
+	{
+		return ptr;
+	}
+
+	const T *data() const
+	{
+		return ptr;
+	}
+
+	T *begin()
+	{
+		return ptr;
+	}
+
+	T *end()
+	{
+		return ptr + buffer_size;
+	}
+
+	const T *begin() const
+	{
+		return ptr;
+	}
+
+	const T *end() const
+	{
+		return ptr + buffer_size;
+	}
+
+	T &front()
+	{
+		return ptr[0];
+	}
+
+	const T &front() const
+	{
+		return ptr[0];
+	}
+
+	T &back()
+	{
+		return ptr[buffer_size - 1];
+	}
+
+	const T &back() const
+	{
+		return ptr[buffer_size - 1];
+	}
+
+	// Avoid sliced copies. Base class should only be read as a reference.
+	VectorView(const VectorView &) = delete;
+	void operator=(const VectorView &) = delete;
+
+protected:
+	VectorView() = default;
+	T *ptr = nullptr;
+	size_t buffer_size = 0;
+};
+
+// Simple vector which supports up to N elements inline, without malloc/free.
+// We use a lot of throwaway vectors all over the place which triggers allocations.
+// This class only implements the subset of std::vector we need in SPIRV-Cross.
+// It is *NOT* a drop-in replacement in general projects.
+template <typename T, size_t N = 8>
+class SmallVector : public VectorView<T>
+{
+public:
+	SmallVector()
+	{
+		this->ptr = stack_storage.data();
+		buffer_capacity = N;
+	}
+
+	SmallVector(const T *arg_list_begin, const T *arg_list_end)
+	    : SmallVector()
+	{
+		auto count = size_t(arg_list_end - arg_list_begin);
+		reserve(count);
+		for (size_t i = 0; i < count; i++, arg_list_begin++)
+			new (&this->ptr[i]) T(*arg_list_begin);
+		this->buffer_size = count;
+	}
+
+	SmallVector(SmallVector &&other) noexcept : SmallVector()
+	{
+		*this = std::move(other);
+	}
+
+	SmallVector(const std::initializer_list<T> &init_list) : SmallVector()
+	{
+		insert(this->end(), init_list.begin(), init_list.end());
+	}
+
+	SmallVector &operator=(SmallVector &&other) noexcept
+	{
+		clear();
+		if (other.ptr != other.stack_storage.data())
+		{
+			// Pilfer allocated pointer.
+			if (this->ptr != stack_storage.data())
+				free(this->ptr);
+			this->ptr = other.ptr;
+			this->buffer_size = other.buffer_size;
+			buffer_capacity = other.buffer_capacity;
+			other.ptr = nullptr;
+			other.buffer_size = 0;
+			other.buffer_capacity = 0;
+		}
+		else
+		{
+			// Need to move the stack contents individually.
+			reserve(other.buffer_size);
+			for (size_t i = 0; i < other.buffer_size; i++)
+			{
+				new (&this->ptr[i]) T(std::move(other.ptr[i]));
+				other.ptr[i].~T();
+			}
+			this->buffer_size = other.buffer_size;
+			other.buffer_size = 0;
+		}
+		return *this;
+	}
+
+	SmallVector(const SmallVector &other)
+	    : SmallVector()
+	{
+		*this = other;
+	}
+
+	SmallVector &operator=(const SmallVector &other)
+	{
+		clear();
+		reserve(other.buffer_size);
+		for (size_t i = 0; i < other.buffer_size; i++)
+			new (&this->ptr[i]) T(other.ptr[i]);
+		this->buffer_size = other.buffer_size;
+		return *this;
+	}
+
+	explicit SmallVector(size_t count)
+	    : SmallVector()
+	{
+		resize(count);
+	}
+
+	~SmallVector()
+	{
+		clear();
+		if (this->ptr != stack_storage.data())
+			free(this->ptr);
+	}
+
+	void clear()
+	{
+		for (size_t i = 0; i < this->buffer_size; i++)
+			this->ptr[i].~T();
+		this->buffer_size = 0;
+	}
+
+	void push_back(const T &t)
+	{
+		reserve(this->buffer_size + 1);
+		new (&this->ptr[this->buffer_size]) T(t);
+		this->buffer_size++;
+	}
+
+	void push_back(T &&t)
+	{
+		reserve(this->buffer_size + 1);
+		new (&this->ptr[this->buffer_size]) T(std::move(t));
+		this->buffer_size++;
+	}
+
+	void pop_back()
+	{
+		// Work around false positive warning on GCC 8.3.
+		// Calling pop_back on empty vector is undefined.
+		if (!this->empty())
+			resize(this->buffer_size - 1);
+	}
+
+	template <typename... Ts>
+	void emplace_back(Ts &&... ts)
+	{
+		reserve(this->buffer_size + 1);
+		new (&this->ptr[this->buffer_size]) T(std::forward<Ts>(ts)...);
+		this->buffer_size++;
+	}
+
+	void reserve(size_t count)
+	{
+		if (count > buffer_capacity)
+		{
+			size_t target_capacity = buffer_capacity;
+			if (target_capacity == 0)
+				target_capacity = 1;
+			if (target_capacity < N)
+				target_capacity = N;
+
+			while (target_capacity < count)
+				target_capacity <<= 1u;
+
+			T *new_buffer =
+			    target_capacity > N ? static_cast<T *>(malloc(target_capacity * sizeof(T))) : stack_storage.data();
+
+			if (!new_buffer)
+				std::terminate();
+
+			// In case for some reason two allocations both come from same stack.
+			if (new_buffer != this->ptr)
+			{
+				// We don't deal with types which can throw in move constructor.
+				for (size_t i = 0; i < this->buffer_size; i++)
+				{
+					new (&new_buffer[i]) T(std::move(this->ptr[i]));
+					this->ptr[i].~T();
+				}
+			}
+
+			if (this->ptr != stack_storage.data())
+				free(this->ptr);
+			this->ptr = new_buffer;
+			buffer_capacity = target_capacity;
+		}
+	}
+
+	void insert(T *itr, const T *insert_begin, const T *insert_end)
+	{
+		auto count = size_t(insert_end - insert_begin);
+		if (itr == this->end())
+		{
+			reserve(this->buffer_size + count);
+			for (size_t i = 0; i < count; i++, insert_begin++)
+				new (&this->ptr[this->buffer_size + i]) T(*insert_begin);
+			this->buffer_size += count;
+		}
+		else
+		{
+			if (this->buffer_size + count > buffer_capacity)
+			{
+				auto target_capacity = this->buffer_size + count;
+				if (target_capacity == 0)
+					target_capacity = 1;
+				if (target_capacity < N)
+					target_capacity = N;
+
+				while (target_capacity < count)
+					target_capacity <<= 1u;
+
+				// Need to allocate new buffer. Move everything to a new buffer.
+				T *new_buffer =
+				    target_capacity > N ? static_cast<T *>(malloc(target_capacity * sizeof(T))) : stack_storage.data();
+				if (!new_buffer)
+					std::terminate();
+
+				// First, move elements from source buffer to new buffer.
+				// We don't deal with types which can throw in move constructor.
+				auto *target_itr = new_buffer;
+				auto *original_source_itr = this->begin();
+
+				if (new_buffer != this->ptr)
+				{
+					while (original_source_itr != itr)
+					{
+						new (target_itr) T(std::move(*original_source_itr));
+						original_source_itr->~T();
+						++original_source_itr;
+						++target_itr;
+					}
+				}
+
+				// Copy-construct new elements.
+				for (auto *source_itr = insert_begin; source_itr != insert_end; ++source_itr, ++target_itr)
+					new (target_itr) T(*source_itr);
+
+				// Move over the other half.
+				if (new_buffer != this->ptr || insert_begin != insert_end)
+				{
+					while (original_source_itr != this->end())
+					{
+						new (target_itr) T(std::move(*original_source_itr));
+						original_source_itr->~T();
+						++original_source_itr;
+						++target_itr;
+					}
+				}
+
+				if (this->ptr != stack_storage.data())
+					free(this->ptr);
+				this->ptr = new_buffer;
+				buffer_capacity = target_capacity;
+			}
+			else
+			{
+				// Move in place, need to be a bit careful about which elements are constructed and which are not.
+				// Move the end and construct the new elements.
+				auto *target_itr = this->end() + count;
+				auto *source_itr = this->end();
+				while (target_itr != this->end() && source_itr != itr)
+				{
+					--target_itr;
+					--source_itr;
+					new (target_itr) T(std::move(*source_itr));
+				}
+
+				// For already constructed elements we can move-assign.
+				std::move_backward(itr, source_itr, target_itr);
+
+				// For the inserts which go to already constructed elements, we can do a plain copy.
+				while (itr != this->end() && insert_begin != insert_end)
+					*itr++ = *insert_begin++;
+
+				// For inserts into newly allocated memory, we must copy-construct instead.
+				while (insert_begin != insert_end)
+				{
+					new (itr) T(*insert_begin);
+					++itr;
+					++insert_begin;
+				}
+			}
+
+			this->buffer_size += count;
+		}
+	}
+
+	void insert(T *itr, const T &value)
+	{
+		insert(itr, &value, &value + 1);
+	}
+
+	T *erase(T *itr)
+	{
+		std::move(itr + 1, this->end(), itr);
+		this->ptr[--this->buffer_size].~T();
+		return itr;
+	}
+
+	void erase(T *start_erase, T *end_erase)
+	{
+		if (end_erase == this->end())
+		{
+			resize(size_t(start_erase - this->begin()));
+		}
+		else
+		{
+			auto new_size = this->buffer_size - (end_erase - start_erase);
+			std::move(end_erase, this->end(), start_erase);
+			resize(new_size);
+		}
+	}
+
+	void resize(size_t new_size)
+	{
+		if (new_size < this->buffer_size)
+		{
+			for (size_t i = new_size; i < this->buffer_size; i++)
+				this->ptr[i].~T();
+		}
+		else if (new_size > this->buffer_size)
+		{
+			reserve(new_size);
+			for (size_t i = this->buffer_size; i < new_size; i++)
+				new (&this->ptr[i]) T();
+		}
+
+		this->buffer_size = new_size;
+	}
+
+private:
+	size_t buffer_capacity = 0;
+	AlignedBuffer<T, N> stack_storage;
+};
+}
--- a/util/stack_allocator.hpp
+++ b/util/stack_allocator.hpp
@@ -0,0 +1,62 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include <algorithm>
+
+namespace Util
+{
+template <typename T, size_t N>
+class StackAllocator
+{
+public:
+	T *allocate(size_t count)
+	{
+		if (count == 0)
+			return nullptr;
+		if (offset + count > N)
+			return nullptr;
+
+		T *ret = buffer + offset;
+		offset += count;
+		return ret;
+	}
+
+	T *allocate_cleared(size_t count)
+	{
+		T *ret = allocate(count);
+		if (ret)
+			std::fill(ret, ret + count, T());
+		return ret;
+	}
+
+	void reset()
+	{
+		offset = 0;
+	}
+
+private:
+	T buffer[N];
+	size_t offset = 0;
+};
+}
--- a/util/temporary_hashmap.hpp
+++ b/util/temporary_hashmap.hpp
@@ -0,0 +1,177 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include "hash.hpp"
+#include "object_pool.hpp"
+#include "intrusive_list.hpp"
+#include "intrusive_hash_map.hpp"
+#include <vector>
+
+namespace Util
+{
+template <typename T>
+class TemporaryHashmapEnabled
+{
+public:
+	void set_hash(Hash hash_)
+	{
+		hash = hash_;
+	}
+
+	void set_index(unsigned index_)
+	{
+		index = index_;
+	}
+
+	Hash get_hash()
+	{
+		return hash;
+	}
+
+	unsigned get_index() const
+	{
+		return index;
+	}
+
+private:
+	Hash hash = 0;
+	unsigned index = 0;
+};
+
+template <typename T, unsigned RingSize = 4, bool ReuseObjects = false>
+class TemporaryHashmap
+{
+public:
+	~TemporaryHashmap()
+	{
+		clear();
+	}
+
+	void clear()
+	{
+		for (auto &ring : rings)
+		{
+			while (!ring.empty())
+			{
+				auto itr = ring.begin();
+				ring.erase(itr);
+				auto &node = *itr;
+				object_pool.free(static_cast<T *>(&node));
+			}
+		}
+		hashmap.clear();
+
+		for (auto &vacant : vacants)
+			object_pool.free(static_cast<T *>(&*vacant));
+		vacants.clear();
+		object_pool.clear();
+	}
+
+	void begin_frame()
+	{
+		index = (index + 1) & (RingSize - 1);
+		auto &ring = rings[index];
+
+		while (!ring.empty())
+		{
+			auto itr = ring.begin();
+			ring.erase(itr);
+			auto &node = *itr;
+			hashmap.erase(node.get_hash());
+			free_object(&node, ReuseTag<ReuseObjects>());
+		}
+	}
+
+	T *request(Hash hash)
+	{
+		auto *v = hashmap.find(hash);
+		if (v)
+		{
+			auto node = v->get();
+			if (node->get_index() != index)
+			{
+				rings[index].move_to_front(rings[node->get_index()], node);
+				node->set_index(index);
+			}
+
+			return &*node;
+		}
+		else
+			return nullptr;
+	}
+
+	template <typename... P>
+	void make_vacant(P &&... p)
+	{
+		vacants.push_back(object_pool.allocate(std::forward<P>(p)...));
+	}
+
+	T *request_vacant(Hash hash)
+	{
+		if (vacants.empty())
+			return nullptr;
+
+		auto top = vacants.back();
+		vacants.pop_back();
+		top->set_index(index);
+		top->set_hash(hash);
+		hashmap.emplace_replace(hash, top);
+		rings[index].insert_front(top);
+		return &*top;
+	}
+
+	template <typename... P>
+	T *emplace(Hash hash, P &&... p)
+	{
+		auto *node = object_pool.allocate(std::forward<P>(p)...);
+		node->set_index(index);
+		node->set_hash(hash);
+		hashmap.emplace_replace(hash, node);
+		rings[index].insert_front(node);
+		return node;
+	}
+
+private:
+	IntrusiveList<T> rings[RingSize];
+	ObjectPool<T> object_pool;
+	unsigned index = 0;
+	IntrusiveHashMap<IntrusivePODWrapper<typename IntrusiveList<T>::Iterator>> hashmap;
+	std::vector<typename IntrusiveList<T>::Iterator> vacants;
+
+	template <bool reuse>
+	struct ReuseTag
+	{
+	};
+
+	void free_object(T *object, const ReuseTag<false> &)
+	{
+		object_pool.free(object);
+	}
+
+	void free_object(T *object, const ReuseTag<true> &)
+	{
+		vacants.push_back(object);
+	}
+};
+}
--- a/util/thread_id.cpp
+++ b/util/thread_id.cpp
@@ -0,0 +1,45 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#include "thread_id.hpp"
+#include "logging.hpp"
+
+namespace Util
+{
+static thread_local unsigned thread_id_to_index = ~0u;
+
+unsigned get_current_thread_index()
+{
+	auto ret = thread_id_to_index;
+	if (ret == ~0u)
+	{
+		LOGE("Thread does not exist in thread manager or is not the main thread.\n");
+		return 0;
+	}
+	return ret;
+}
+
+void register_thread_index(unsigned index)
+{
+	thread_id_to_index = index;
+}
+}
--- a/util/thread_id.hpp
+++ b/util/thread_id.hpp
@@ -0,0 +1,29 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+namespace Util
+{
+unsigned get_current_thread_index();
+void register_thread_index(unsigned thread_index);
+}
--- a/util/thread_name.cpp
+++ b/util/thread_name.cpp
@@ -0,0 +1,59 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#include "thread_name.hpp"
+
+#if !defined(_WIN32)
+#include <pthread.h>
+#else
+#define WIN32_LEAN_AND_MEAN
+#include <windows.h>
+#include <string>
+#endif
+
+namespace Util
+{
+void set_current_thread_name(const char *name)
+{
+#if defined(__linux__)
+	pthread_setname_np(pthread_self(), name);
+#elif defined(__APPLE__)
+	pthread_setname_np(name);
+#elif defined(_WIN32)
+	using PFN_SetThreadDescription = HRESULT (WINAPI *)(HANDLE, PCWSTR);
+	auto module = GetModuleHandleA("kernel32.dll");
+	PFN_SetThreadDescription SetThreadDescription = module ? reinterpret_cast<PFN_SetThreadDescription>(
+	    (void *)GetProcAddress(module, "SetThreadDescription")) : nullptr;
+
+	if (SetThreadDescription)
+	{
+		std::wstring wname;
+		while (*name != '\0')
+		{
+			wname.push_back(*name);
+			name++;
+		}
+		SetThreadDescription(GetCurrentThread(), wname.c_str());
+	}
+#endif
+}
+}
--- a/util/thread_name.hpp
+++ b/util/thread_name.hpp
@@ -0,0 +1,28 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+namespace Util
+{
+void set_current_thread_name(const char *name);
+}
--- a/util/timeline_trace_file.cpp
+++ b/util/timeline_trace_file.cpp
@@ -0,0 +1,185 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#include "logging.hpp"
+#include "timeline_trace_file.hpp"
+#include "thread_name.hpp"
+#include "timer.hpp"
+#include <string.h>
+#include <stdio.h>
+
+namespace Util
+{
+static thread_local char trace_tid[32];
+static thread_local TimelineTraceFile *trace_file;
+
+void TimelineTraceFile::set_tid(const char *tid)
+{
+	snprintf(trace_tid, sizeof(trace_tid), "%s", tid);
+}
+
+void TimelineTraceFile::set_per_thread(TimelineTraceFile *file)
+{
+	trace_file = file;
+}
+
+TimelineTraceFile *TimelineTraceFile::get_per_thread()
+{
+	return trace_file;
+}
+
+void TimelineTraceFile::Event::set_desc(const char *desc_)
+{
+	snprintf(desc, sizeof(desc), "%s", desc_);
+}
+
+void TimelineTraceFile::Event::set_tid(const char *tid_)
+{
+	snprintf(tid, sizeof(tid), "%s", tid_);
+}
+
+TimelineTraceFile::Event *TimelineTraceFile::begin_event(const char *desc, uint32_t pid)
+{
+	auto *e = event_pool.allocate();
+	e->pid = pid;
+	e->set_tid(trace_tid);
+	e->set_desc(desc);
+	e->start_ns = get_current_time_nsecs();
+	return e;
+}
+
+TimelineTraceFile::Event *TimelineTraceFile::allocate_event()
+{
+	auto *e = event_pool.allocate();
+	e->desc[0] = '\0';
+	e->tid[0] = '\0';
+	e->pid = 0;
+	e->start_ns = 0;
+	e->end_ns = 0;
+	return e;
+}
+
+void TimelineTraceFile::submit_event(Event *e)
+{
+	std::lock_guard<std::mutex> holder{lock};
+	queued_events.push(e);
+	cond.notify_one();
+}
+
+void TimelineTraceFile::end_event(Event *e)
+{
+	e->end_ns = get_current_time_nsecs();
+	submit_event(e);
+}
+
+TimelineTraceFile::TimelineTraceFile(const std::string &path)
+{
+	thr = std::thread(&TimelineTraceFile::looper, this, path);
+}
+
+void TimelineTraceFile::looper(std::string path)
+{
+	set_current_thread_name("json-trace-io");
+
+	FILE *file = fopen(path.c_str(), "w");
+	if (!file)
+		LOGE("Failed to open file: %s.\n", path.c_str());
+
+	if (file)
+		fputs("[\n", file);
+
+	uint64_t base_ts = get_current_time_nsecs();
+
+	for (;;)
+	{
+		Event *e;
+		{
+			std::unique_lock<std::mutex> holder{lock};
+			cond.wait(holder, [this]() {
+				return !queued_events.empty();
+			});
+			e = queued_events.front();
+			queued_events.pop();
+		}
+
+		if (!e)
+			break;
+
+		auto start_us = int64_t(e->start_ns - base_ts) * 1e-3;
+		auto end_us = int64_t(e->end_ns - base_ts) * 1e-3;
+
+		if (file && start_us <= end_us)
+		{
+			fprintf(file, "{ \"name\": \"%s\", \"ph\": \"B\", \"tid\": \"%s\", \"pid\": \"%u\", \"ts\": %f },\n",
+			        e->desc, e->tid, e->pid, start_us);
+			fprintf(file, "{ \"name\": \"%s\", \"ph\": \"E\", \"tid\": \"%s\", \"pid\": \"%u\", \"ts\": %f },\n",
+			        e->desc, e->tid, e->pid, end_us);
+		}
+
+		event_pool.free(e);
+	}
+
+	// Intentionally truncate the JSON so that we can emit "," after the last element.
+	if (file)
+		fclose(file);
+}
+
+TimelineTraceFile::~TimelineTraceFile()
+{
+	submit_event(nullptr);
+	if (thr.joinable())
+		thr.join();
+}
+
+TimelineTraceFile::ScopedEvent::ScopedEvent(TimelineTraceFile *file_, const char *tag, uint32_t pid)
+	: file(file_)
+{
+	if (file && tag && *tag != '\0')
+		event = file->begin_event(tag, pid);
+}
+
+TimelineTraceFile::ScopedEvent::~ScopedEvent()
+{
+	if (event)
+		file->end_event(event);
+}
+
+TimelineTraceFile::ScopedEvent &
+TimelineTraceFile::ScopedEvent::operator=(TimelineTraceFile::ScopedEvent &&other) noexcept
+{
+	if (this != &other)
+	{
+		if (event)
+			file->end_event(event);
+		event = other.event;
+		file = other.file;
+		other.event = nullptr;
+		other.file = nullptr;
+	}
+	return *this;
+}
+
+TimelineTraceFile::ScopedEvent::ScopedEvent(TimelineTraceFile::ScopedEvent &&other) noexcept
+{
+	*this = std::move(other);
+}
+}
--- a/util/timeline_trace_file.hpp
+++ b/util/timeline_trace_file.hpp
@@ -0,0 +1,96 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include <string>
+#include <thread>
+#include <condition_variable>
+#include <mutex>
+#include <memory>
+#include <queue>
+#include "object_pool.hpp"
+
+namespace Util
+{
+class TimelineTraceFile
+{
+public:
+	explicit TimelineTraceFile(const std::string &path);
+	~TimelineTraceFile();
+
+	static void set_tid(const char *tid);
+	static TimelineTraceFile *get_per_thread();
+	static void set_per_thread(TimelineTraceFile *file);
+
+	struct Event
+	{
+		char desc[256];
+		char tid[32];
+		uint32_t pid;
+		uint64_t start_ns, end_ns;
+
+		void set_desc(const char *desc);
+		void set_tid(const char *tid);
+	};
+	Event *begin_event(const char *desc, uint32_t pid = 0);
+	void end_event(Event *e);
+
+	Event *allocate_event();
+	void submit_event(Event *e);
+
+	struct ScopedEvent
+	{
+		ScopedEvent(TimelineTraceFile *file, const char *tag, uint32_t pid = 0);
+		ScopedEvent() = default;
+		~ScopedEvent();
+		void operator=(const ScopedEvent &) = delete;
+		ScopedEvent(const ScopedEvent &) = delete;
+		ScopedEvent(ScopedEvent &&other) noexcept;
+		ScopedEvent &operator=(ScopedEvent &&other) noexcept;
+		TimelineTraceFile *file = nullptr;
+		Event *event = nullptr;
+	};
+
+private:
+	void looper(std::string path);
+	std::thread thr;
+	std::mutex lock;
+	std::condition_variable cond;
+
+	ThreadSafeObjectPool<Event> event_pool;
+	std::queue<Event *> queued_events;
+};
+
+#ifndef GRANITE_SHIPPING
+
+#define GRANITE_MACRO_CONCAT_IMPL(a, b) a##b
+#define GRANITE_MACRO_CONCAT(a, b) GRANITE_MACRO_CONCAT_IMPL(a, b)
+#define GRANITE_SCOPED_TIMELINE_EVENT(str) \
+	::Util::TimelineTraceFile::ScopedEvent GRANITE_MACRO_CONCAT(_timeline_scoped_count_, __COUNTER__){GRANITE_THREAD_GROUP() ? GRANITE_THREAD_GROUP()->get_timeline_trace_file() : nullptr, str}
+#define GRANITE_SCOPED_TIMELINE_EVENT_FILE(file, str) \
+	::Util::TimelineTraceFile::ScopedEvent GRANITE_MACRO_CONCAT(_timeline_scoped_count_, __COUNTER__){file, str}
+#else
+#define GRANITE_SCOPED_TIMELINE_EVENT(...) ((void)0)
+#define GRANITE_SCOPED_TIMELINE_EVENT_FILE(...) ((void)0)
+#endif
+}
--- a/util/timer.cpp
+++ b/util/timer.cpp
@@ -0,0 +1,131 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#include "timer.hpp"
+
+#ifdef _WIN32
+#define WIN32_LEAN_AND_MEAN
+#include <windows.h>
+#else
+#include <time.h>
+#endif
+
+namespace Util
+{
+FrameTimer::FrameTimer()
+{
+	reset();
+}
+
+void FrameTimer::reset()
+{
+	start = get_time();
+	last = start;
+	last_period = 0;
+}
+
+void FrameTimer::enter_idle()
+{
+	idle_start = get_time();
+}
+
+void FrameTimer::leave_idle()
+{
+	auto idle_end = get_time();
+	idle_time += idle_end - idle_start;
+}
+
+double FrameTimer::get_frame_time() const
+{
+	return double(last_period) * 1e-9;
+}
+
+double FrameTimer::frame()
+{
+	auto new_time = get_time() - idle_time;
+	last_period = new_time - last;
+	last = new_time;
+	return double(last_period) * 1e-9;
+}
+
+double FrameTimer::frame(double frame_time)
+{
+	last_period = int64_t(frame_time * 1e9);
+	last += last_period;
+	return frame_time;
+}
+
+double FrameTimer::get_elapsed() const
+{
+	return double(last - start) * 1e-9;
+}
+
+int64_t FrameTimer::get_time()
+{
+	return get_current_time_nsecs();
+}
+
+#ifdef _WIN32
+struct QPCFreq
+{
+	QPCFreq()
+	{
+		LARGE_INTEGER freq;
+		QueryPerformanceFrequency(&freq);
+		inv_freq = 1e9 / double(freq.QuadPart);
+	}
+
+	double inv_freq;
+} static static_qpc_freq;
+#endif
+
+int64_t get_current_time_nsecs()
+{
+#ifdef _WIN32
+	LARGE_INTEGER li;
+	if (!QueryPerformanceCounter(&li))
+		return 0;
+	return int64_t(double(li.QuadPart) * static_qpc_freq.inv_freq);
+#else
+	struct timespec ts = {};
+#if defined(ANDROID) || defined(__FreeBSD__)
+	constexpr auto timebase = CLOCK_MONOTONIC;
+#else
+	constexpr auto timebase = CLOCK_MONOTONIC_RAW;
+#endif
+	if (clock_gettime(timebase, &ts) < 0)
+		return 0;
+	return ts.tv_sec * 1000000000ll + ts.tv_nsec;
+#endif
+}
+
+void Timer::start()
+{
+	t = get_current_time_nsecs();
+}
+
+double Timer::end()
+{
+	auto nt = get_current_time_nsecs();
+	return double(nt - t) * 1e-9;
+}
+}
--- a/util/timer.hpp
+++ b/util/timer.hpp
@@ -0,0 +1,63 @@
+/* Copyright (c) 2017-2023 Hans-Kristian Arntzen
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#pragma once
+
+#include <stdint.h>
+
+namespace Util
+{
+class FrameTimer
+{
+public:
+	FrameTimer();
+
+	void reset();
+	double frame();
+	double frame(double frame_time);
+	double get_elapsed() const;
+	double get_frame_time() const;
+
+	void enter_idle();
+	void leave_idle();
+
+private:
+	int64_t start;
+	int64_t last;
+	int64_t last_period;
+	int64_t idle_start;
+	int64_t idle_time = 0;
+	int64_t get_time();
+};
+
+class Timer
+{
+public:
+	void start();
+	double end();
+
+private:
+	int64_t t = 0;
+};
+
+int64_t get_current_time_nsecs();
+}
--- a/Show More
+++ b/Show More