Advanced GPU Optimization: Metal & Vulkan Compute from zero to hero

Welcome to the exciting world of GPU compute programming! In this comprehensive guide, we'll explore two powerful, low‑level APIs: Metal for Apple devices (especially Apple Silicon) and Vulkan Compute for cross‑platform high‑performance computing. By the end, you'll be able to write efficient compute kernels, manage GPU resources, and optimize for real‑world tasks like matrix multiplication and image processing.

Whether you're a macOS/iOS developer wanting to leverage the Neural Engine and GPU on Apple Silicon, or a cross‑platform developer aiming for maximum performance on Windows, Linux, or Android, this tutorial has you covered.

Introduction: Why Metal and Vulkan Compute?

Metal

Metal is Apple's graphics and compute API, available on all modern Apple devices (iPhone, iPad, Mac). With the advent of Apple Silicon (M1, M2, M3 families), the GPU is tightly integrated with the CPU, offering unified memory and exceptional performance. Metal provides a low‑overhead, explicit API that gives you fine‑grained control over the GPU.

Vulkan Compute

Vulkan is a cross‑platform graphics and compute API from the Khronos Group. Its compute capabilities are exposed through compute pipelines and shaders written in GLSL or HLSL (compiled to SPIR‑V). Vulkan is designed for high‑performance, low‑latency applications and runs on Windows, Linux, Android, macOS (via MoltenVK), and more.

Both Metal and Vulkan follow a similar explicit, command‑buffer‑based model. Understanding one makes learning the other easier. Let's dive in!

Part 1: Metal on Apple Silicon

1.1 Setting Up Metal

To develop with Metal, you'll need:

A Mac with Xcode installed (Xcode includes the Metal compiler, framework, and debugging tools).
An Apple Silicon Mac (M1 or later) for native testing; Intel Macs also work but performance characteristics differ.
For iOS/tvOS, you'll need an Apple Developer account (free for device testing).

Create a new Xcode project (e.g., a command‑line tool for macOS) and add the Metal framework. In your source file, import Metal:

#import <Metal/Metal.h>
// or in Swift: import Metal

1.2 Metal Shading Language (MSL)

Metal uses its own C++‑based shading language, MSL. For compute, you write kernels (functions with kernel keyword). Basic MSL features:

Data types: float, int, half, vectors (float4, int2), matrices (float4x4).
Address spaces: device (global memory), threadgroup (shared memory within a workgroup), constant (read‑only), thread (private).
Built‑in variables:
- uint3 [[thread_position_in_grid]] – global thread ID.
- uint3 [[thread_position_in_threadgroup]] – local thread ID within threadgroup.
- uint3 [[threadgroup_position_in_grid]] – threadgroup ID.
- uint3 [[threads_per_threadgroup]] – size of threadgroup.

Example kernel that adds two vectors:

#include <metal_stdlib>
using namespace metal;

kernel void vecAdd(device const float* a [[buffer(0)]],
                   device const float* b [[buffer(1)]],
                   device float* c [[buffer(2)]],
                   uint id [[thread_position_in_grid]]) {
    c[id] = a[id] + b[id];
}

Note the use of [[buffer(n)]] to bind resources.

1.3 Metal Compute Pipeline

The host side (Objective‑C or Swift) sets up the pipeline:

MTLDevice: The abstract representation of the GPU.
MTLCommandQueue: Queue for submitting command buffers.
MTLComputePipelineState: Compiled compute pipeline (kernel + device).
MTLCommandBuffer: Container for commands.
MTLComputeCommandEncoder: Encodes compute commands.
Dispatch: Specify threadgroup dimensions.

1.4 Your First Metal Compute Program: Vector Addition

Let's walk through the host code in Objective‑C. We'll assume you have a Metal device and have compiled the kernel source.

Step 1: Get the default device

id<MTLDevice> device = MTLCreateSystemDefaultDevice();

Step 2: Create a command queue

id<MTLCommandQueue> commandQueue = [device newCommandQueue];

Step 3: Load the kernel

// Assume you have a .metal file in your project named "kernel.metal"
id<MTLLibrary> library = [device newDefaultLibrary];
id<MTLFunction> kernelFunction = [library newFunctionWithName:@"vecAdd"];

Step 4: Create the pipeline state

NSError *error = nil;
id<MTLComputePipelineState> pipelineState = [device newComputePipelineStateWithFunction:kernelFunction error:&error];

Step 5: Create buffers

NSUInteger dataSize = VEC_SIZE * sizeof(float);

id<MTLBuffer> bufferA = [device newBufferWithBytes:h_a length:dataSize options:MTLResourceStorageModeShared];
id<MTLBuffer> bufferB = [device newBufferWithBytes:h_b length:dataSize options:MTLResourceStorageModeShared];
id<MTLBuffer> bufferC = [device newBufferWithLength:dataSize options:MTLResourceStorageModeShared];

MTLResourceStorageModeShared means the buffer is accessible by both CPU and GPU (unified memory on Apple Silicon).

Step 6: Create command buffer and encoder

id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
id<MTLComputeCommandEncoder> encoder = [commandBuffer computeCommandEncoder];
[encoder setComputePipelineState:pipelineState];
[encoder setBuffer:bufferA offset:0 atIndex:0];
[encoder setBuffer:bufferB offset:0 atIndex:1];
[encoder setBuffer:bufferC offset:0 atIndex:2];

Step 7: Dispatch threads

MTLSize threadgroupSize = MTLSizeMake(256, 1, 1); // threads per threadgroup
MTLSize threadgroupCount = MTLSizeMake((VEC_SIZE + 255) / 256, 1, 1);
[encoder dispatchThreadgroups:threadgroupCount threadsPerThreadgroup:threadgroupSize];

Step 8: End encoding and commit

[encoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted]; // or use completion handler

Step 9: Read results

float *c_ptr = (float*)[bufferC contents];
// c_ptr now contains the result

Step 10: Verify and clean up (ARC will handle objects).

That's it! You've just run your first Metal compute kernel.

1.5 Memory Management in Metal

Metal buffers are created with a storage mode:

MTLResourceStorageModeShared: CPU and GPU access same memory (ideal for Apple Silicon).
MTLResourceStorageModePrivate: Only GPU accessible; used for intermediate buffers.
MTLResourceStorageModeManaged: For discrete GPUs (Intel Macs); requires explicit synchronization.

On Apple Silicon, use Shared for simplicity and performance.

You can also create textures (id<MTLTexture>) for image data. Textures are useful for compute kernels that need spatial locality.

1.6 Performance Considerations

Threadgroup memory (shared memory)

Metal provides threadgroup memory (like CUDA shared memory) for fast data exchange within a workgroup. Declare it in the kernel:

kernel void myKernel(device float* in [[buffer(0)]],
                     device float* out [[buffer(1)]],
                     threadgroup float* shared [[threadgroup(0)]],
                     uint tid [[thread_position_in_threadgroup]],
                     uint gid [[thread_position_in_grid]]) {
    // use shared[tid] ...
}

On the host, you specify the size of threadgroup memory when dispatching.

Coalescing and bank conflicts

Threads in a threadgroup access memory. Ensure that within a warp (32 threads on Apple GPUs), accesses are coalesced. Avoid bank conflicts in threadgroup memory by accessing different banks.

Occupancy and threadgroup size

Use pipelineState.threadExecutionWidth to get the optimal threadgroup size multiple. Typical sizes are multiples of 32.

1.7 Advanced Metal Compute

Using textures

Compute kernels can read/write textures. Useful for image processing.

kernel void sobel(texture2d<float, access::read> inTexture [[texture(0)]],
                  texture2d<float, access::write> outTexture [[texture(1)]],
                  uint2 gid [[thread_position_in_grid]]) {
    float4 color = inTexture.read(gid);
    // compute...
    outTexture.write(result, gid);
}

Multiple dispatches and synchronization

You can encode multiple dispatches in one command buffer, and even use MTLComputeCommandEncoder's setThreadgroupMemoryLength to adjust per‑dispatch.

Profiling

Use Xcode's GPU Debugger and Metal System Trace to analyze performance, view occupancy, and find bottlenecks.

1.8 Example: Tiled Matrix Multiplication in Metal

Let's implement a tiled matrix multiplication kernel using threadgroup memory.

Kernel (MSL):

kernel void matMulTiled(device const float* A [[buffer(0)]],
                        device const float* B [[buffer(1)]],
                        device float* C [[buffer(2)]],
                        constant uint& N [[buffer(3)]],
                        threadgroup float* As [[threadgroup(0)]],
                        threadgroup float* Bs [[threadgroup(1)]],
                        uint2 tgid [[threadgroup_position_in_grid]],
                        uint2 lid [[thread_position_in_threadgroup]],
                        uint2 dims [[threads_per_threadgroup]]) {
    uint tileSize = dims.x; // assuming square threadgroup
    uint row = tgid.y * tileSize + lid.y;
    uint col = tgid.x * tileSize + lid.x;

    float sum = 0.0f;
    for (uint tile = 0; tile < N / tileSize; ++tile) {
        // Load tile of A and B
        As[lid.y * tileSize + lid.x] = A[row * N + (tile * tileSize + lid.x)];
        Bs[lid.y * tileSize + lid.x] = B[(tile * tileSize + lid.y) * N + col];
        threadgroup_barrier(mem_flags::mem_threadgroup);

        // Compute partial product
        for (uint k = 0; k < tileSize; ++k) {
            sum += As[lid.y * tileSize + k] * Bs[k * tileSize + lid.x];
        }
        threadgroup_barrier(mem_flags::mem_threadgroup);
    }
    C[row * N + col] = sum;
}

Host setup:

Buffer A, B, C.
Set N as a constant buffer.
Dispatch with threadgroup size (16,16) and grid size (N/16, N/16).

This kernel leverages threadgroup memory for data reuse, significantly improving performance.

Part 2: Vulkan Compute

2.1 Setting Up Vulkan

To develop Vulkan compute applications, you'll need:

Vulkan SDK from LunarG. Install and set VULKAN_SDK environment variable.
A Vulkan‑capable GPU and drivers (most modern GPUs from NVIDIA, AMD, Intel).
For macOS, you can use MoltenVK (included in the SDK) which translates Vulkan to Metal. Performance is good for compute.

Create a C++ project and include vulkan/vulkan.h. Link against the Vulkan loader (vulkan-1 on Windows, vulkan on Linux).

2.2 Vulkan Compute Pipeline

The Vulkan compute pipeline involves several steps:

Create a VkInstance (with validation layers for debugging).
Pick a physical device (VkPhysicalDevice) that supports compute.
Create a logical device (VkDevice) with a compute queue.
Create a command pool and allocate command buffers.
Create shader module from SPIR‑V.
Create descriptor set layout and pipeline layout.
Create compute pipeline.
Create buffers and allocate device memory.
Update descriptor sets to bind buffers.
Record command buffer: bind pipeline, bind descriptor sets, dispatch.
Submit to queue and synchronize.

Let's go through each step with code.

2.3 GLSL for Compute Shaders

We'll write compute shaders in GLSL (or HLSL compiled to SPIR‑V). A simple vector addition shader:

#version 450
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;

layout(binding = 0) buffer InputA { float data[]; } a;
layout(binding = 1) buffer InputB { float data[]; } b;
layout(binding = 2) buffer OutputC { float data[]; } c;

void main() {
    uint id = gl_GlobalInvocationID.x;
    c.data[id] = a.data[id] + b.data[id];
}

Compile this to SPIR‑V using glslangValidator or glslc. For example:

glslc -c -o vecAdd.spv vecAdd.comp

2.4 Memory and Resources

Vulkan uses VkBuffer objects backed by VkDeviceMemory. You must allocate memory, bind it, and map it for host access.

Descriptor sets are used to bind resources (buffers, images) to shaders. You need:

VkDescriptorSetLayout describing the bindings.
VkDescriptorPool to allocate descriptor sets.
VkDescriptorSet containing actual buffer references.

Push constants are small amounts of data that can be set directly in command buffers (fast).

2.5 Your First Vulkan Compute Program: Vector Addition

Let's outline the host code in C++. Error checking omitted for brevity.

Step 1: Create instance and pick device (standard Vulkan initialization). We'll assume you have a VkDevice and a compute queue.

Step 2: Create shader module

VkShaderModuleCreateInfo createInfo = {};
createInfo.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;
createInfo.codeSize = codeSize;
createInfo.pCode = (uint32_t*)spvCode;
vkCreateShaderModule(device, &createInfo, nullptr, &shaderModule);

Step 3: Create descriptor set layout (bindings for three buffers)

VkDescriptorSetLayoutBinding bindings[3] = {};
bindings[0].binding = 0; bindings[0].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER; bindings[0].descriptorCount = 1; bindings[0].stageFlags = VK_SHADER_STAGE_COMPUTE_BIT;
// similarly for 1 and 2
VkDescriptorSetLayoutCreateInfo layoutInfo = {};
layoutInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO;
layoutInfo.bindingCount = 3; layoutInfo.pBindings = bindings;
vkCreateDescriptorSetLayout(device, &layoutInfo, nullptr, &descriptorSetLayout);

Step 4: Create pipeline layout

VkPipelineLayoutCreateInfo pipelineLayoutInfo = {};
pipelineLayoutInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_LAYOUT_CREATE_INFO;
pipelineLayoutInfo.setLayoutCount = 1; pipelineLayoutInfo.pSetLayouts = &descriptorSetLayout;
vkCreatePipelineLayout(device, &pipelineLayoutInfo, nullptr, &pipelineLayout);

Step 5: Create compute pipeline

VkComputePipelineCreateInfo pipelineInfo = {};
pipelineInfo.sType = VK_STRUCTURE_TYPE_COMPUTE_PIPELINE_CREATE_INFO;
pipelineInfo.stage.sType = VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO;
pipelineInfo.stage.stage = VK_SHADER_STAGE_COMPUTE_BIT;
pipelineInfo.stage.module = shaderModule;
pipelineInfo.stage.pName = "main";
pipelineInfo.layout = pipelineLayout;
vkCreateComputePipelines(device, VK_NULL_HANDLE, 1, &pipelineInfo, nullptr, &pipeline);

Step 6: Create buffers and allocate memory

VkBufferCreateInfo bufferInfo = {};
bufferInfo.sType = VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO;
bufferInfo.size = dataSize;
bufferInfo.usage = VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT; // for copying from host
bufferInfo.sharingMode = VK_SHARING_MODE_EXCLUSIVE;
vkCreateBuffer(device, &bufferInfo, nullptr, &bufferA);
// Similarly for B, C

// Get memory requirements, allocate memory, bind buffers
VkMemoryRequirements memReqs;
vkGetBufferMemoryRequirements(device, bufferA, &memReqs);
VkMemoryAllocateInfo allocInfo = {};
allocInfo.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;
allocInfo.allocationSize = memReqs.size;
allocInfo.memoryTypeIndex = findMemoryType(memReqs.memoryTypeBits, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT);
vkAllocateMemory(device, &allocInfo, nullptr, &memoryA);
vkBindBufferMemory(device, bufferA, memoryA, 0);

Step 7: Map memory and copy initial data

void* mapped;
vkMapMemory(device, memoryA, 0, dataSize, 0, &mapped);
memcpy(mapped, h_a, dataSize);
vkUnmapMemory(device, memoryA);
// Similarly for B

Step 8: Create descriptor pool and descriptor set

VkDescriptorPoolSize poolSize = {};
poolSize.type = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
poolSize.descriptorCount = 3;
VkDescriptorPoolCreateInfo poolInfo = {};
poolInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO;
poolInfo.maxSets = 1;
poolInfo.poolSizeCount = 1; poolInfo.pPoolSizes = &poolSize;
vkCreateDescriptorPool(device, &poolInfo, nullptr, &descriptorPool);

VkDescriptorSetAllocateInfo allocInfo = {};
allocInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO;
allocInfo.descriptorPool = descriptorPool;
allocInfo.descriptorSetCount = 1; allocInfo.pSetLayouts = &descriptorSetLayout;
vkAllocateDescriptorSets(device, &allocInfo, &descriptorSet);

Step 9: Update descriptor set with buffer info

VkDescriptorBufferInfo bufferInfoA = { bufferA, 0, dataSize };
VkWriteDescriptorSet writeA = {};
writeA.sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
writeA.dstSet = descriptorSet;
writeA.dstBinding = 0;
writeA.descriptorCount = 1;
writeA.descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
writeA.pBufferInfo = &bufferInfoA;
// Similarly for B, C (bindings 1,2)
vkUpdateDescriptorSets(device, 3, writes, 0, nullptr);

Step 10: Create command pool and command buffer

VkCommandPoolCreateInfo poolCmdInfo = {};
poolCmdInfo.sType = VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO;
poolCmdInfo.queueFamilyIndex = computeQueueFamilyIndex;
vkCreateCommandPool(device, &poolCmdInfo, nullptr, &commandPool);

VkCommandBufferAllocateInfo cmdAllocInfo = {};
cmdAllocInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO;
cmdAllocInfo.commandPool = commandPool;
cmdAllocInfo.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY;
cmdAllocInfo.commandBufferCount = 1;
vkAllocateCommandBuffers(device, &cmdAllocInfo, &commandBuffer);

Step 11: Record commands

VkCommandBufferBeginInfo beginInfo = {};
beginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
vkBeginCommandBuffer(commandBuffer, &beginInfo);

vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline);
vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayout, 0, 1, &descriptorSet, 0, nullptr);

uint32_t groupCountX = (VEC_SIZE + 255) / 256;
vkCmdDispatch(commandBuffer, groupCountX, 1, 1);

vkEndCommandBuffer(commandBuffer);

Step 12: Submit and wait

VkSubmitInfo submitInfo = {};
submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submitInfo.commandBufferCount = 1; submitInfo.pCommandBuffers = &commandBuffer;
vkQueueSubmit(computeQueue, 1, &submitInfo, VK_NULL_HANDLE);
vkQueueWaitIdle(computeQueue);

Step 13: Read back results (map memory of buffer C and copy).

That's the complete flow. It's verbose, but gives you full control.

2.6 Synchronization

Vulkan uses fences (GPU‑CPU sync) and semaphores (GPU‑GPU sync). In compute, you often just wait for the queue idle. For finer control, use vkWaitForFences after submitting with a fence.

Memory dependencies are handled by pipeline barriers. For example, if you write to a buffer then read it in a later dispatch, you need a barrier to ensure visibility.

2.7 Performance Optimization

Workgroup size

Choose a workgroup size that is a multiple of the subgroup size (typically 32 or 64) and that maximizes occupancy. Use VkPhysicalDeviceProperties::maxComputeWorkGroupSize.

Shared memory

Declare shared memory in GLSL with shared float tile[16][16]; and use barrier() for synchronization. Similar to Metal threadgroup memory.

Subgroup operations

Vulkan exposes subgroup operations via the VK_EXT_subgroup extension. These allow efficient reductions, scans, and shuffles within a subgroup (like warp intrinsics).

Avoiding bank conflicts – same principle as Metal.

2.8 Example: Tiled Matrix Multiplication in Vulkan

Shader (GLSL):

#version 450
layout(local_size_x = 16, local_size_y = 16) in;

layout(binding = 0) buffer InputA { float data[]; } A;
layout(binding = 1) buffer InputB { float data[]; } B;
layout(binding = 2) buffer OutputC { float data[]; } C;
layout(push_constant) uniform PushConsts { uint N; } pc;

shared float As[16][16];
shared float Bs[16][16];

void main() {
    uint row = gl_GlobalInvocationID.y;
    uint col = gl_GlobalInvocationID.x;
    uint localRow = gl_LocalInvocationID.y;
    uint localCol = gl_LocalInvocationID.x;

    float sum = 0.0;
    for (uint tile = 0; tile < pc.N / 16; ++tile) {
        As[localRow][localCol] = A.data[row * pc.N + (tile * 16 + localCol)];
        Bs[localRow][localCol] = B.data[(tile * 16 + localRow) * pc.N + col];
        barrier();

        for (uint k = 0; k < 16; ++k) {
            sum += As[localRow][k] * Bs[k][localCol];
        }
        barrier();
    }
    C.data[row * pc.N + col] = sum;
}

Host changes:

Use push constant to pass N.
Dispatch with grid size (N/16, N/16) and local size (16,16).
Ensure descriptor set binds buffers A, B, C.

Part 3: Comparative Analysis and Cross‑Platform Strategies

3.1 Metal vs Vulkan Compute: Similarities and Differences

Aspect	Metal	Vulkan Compute
API Style	Objective‑C / Swift (objects, references)	C (handles, structures)
Shader Language	MSL (C++‑based)	GLSL / HLSL compiled to SPIR‑V
Resource Binding	[[buffer(n)]] and [[texture(n)]]	Descriptor sets with bindings
Memory Model	Unified on Apple Silicon; discrete otherwise	Explicit memory management
Workgroup/Threadgroup	`threadgroup` memory, `threadgroup_barrier`	`shared` memory, `barrier()`
Error Handling	NSError, exceptions (Swift)	Return codes (VkResult)
Debugging/Profiling	Xcode GPU Debugger, Metal System Trace	RenderDoc, Vulkan Validation Layers, GPU vendors' tools

Both are explicit, low‑level, and provide fine‑grained control. Metal is slightly simpler to use, especially on Apple platforms, while Vulkan offers cross‑platform portability.

3.2 Writing Portable Compute Code

If you need to support both Metal and Vulkan, consider:

Writing compute kernels in a common language (e.g., using a library like OpenCL C or SYCL that compiles to both, but that's beyond this guide).
Maintaining two separate codebases with shared algorithms.
Using conditional compilation and abstraction layers (like MoltenVK to run Vulkan on Metal, though compute performance is good).

For compute kernels, the core logic is often identical; only the syntax for resource binding and built‑in variables differs. You can use macros to abstract differences.

Example kernel snippet (pseudo‑code):

#ifdef METAL
#define GLOBAL_ID [[thread_position_in_grid]]
#define SHARED threadgroup
#define BARRIER threadgroup_barrier(mem_flags::mem_threadgroup)
#else
#define GLOBAL_ID gl_GlobalInvocationID
#define SHARED shared
#define BARRIER barrier()
#endif

Part 4: Advanced Topics

4.1 Indirect Dispatch

Both APIs support indirect dispatch where the workgroup counts are read from a buffer, allowing GPU‑generated workloads.

Metal: [encoder dispatchThreadgroupsWithIndirectBuffer:indirectBuffer indirectBufferOffset:0 threadsPerThreadgroup:threadgroupSize];

Vulkan: vkCmdDispatchIndirect with a buffer containing the dispatch dimensions.

4.2 GPU Timestamps and Queries

Metal: Use MTLCounterSampleBuffer (on supported devices) to sample GPU timestamps.

Vulkan: Use query pools (VK_QUERY_TYPE_TIMESTAMP) to record timestamps in command buffers.

4.3 Interop with Graphics Pipelines

Both APIs allow mixing compute and graphics in the same command buffer. For example, you can use a compute shader to generate data for a vertex buffer, then render it.

4.4 Multi‑GPU (Vulkan) and Multi‑Device (Metal)

Vulkan supports multi‑GPU via logical devices with multiple physical devices, but it's complex. Metal has MTLDevice per GPU; you can create separate queues and manage workloads manually.

4.5 Real‑world Example: Sobel Edge Detection

Let's outline a Sobel kernel in both APIs.

Metal Sobel (texture version):

kernel void sobel(texture2d<float, access::read> inTex [[texture(0)]],
                  texture2d<float, access::write> outTex [[texture(1)]],
                  uint2 gid [[thread_position_in_grid]]) {
    if (gid.x < outTex.get_width() && gid.y < outTex.get_height()) {
        float sumX = 0, sumY = 0;
        for (int dy = -1; dy <= 1; dy++) {
            for (int dx = -1; dx <= 1; dx++) {
                uint2 coord = uint2(gid.x + dx, gid.y + dy);
                float val = inTex.read(coord).r;
                sumX += val * sobelX[dy+1][dx+1];
                sumY += val * sobelY[dy+1][dx+1];
            }
        }
        float mag = min(1.0f, hypot(sumX, sumY));
        outTex.write(float4(mag, mag, mag, 1), gid);
    }
}

Vulkan Sobel (buffer version, assuming image stored as float array):

#version 450
layout(local_size_x = 16, local_size_y = 16) in;
layout(binding = 0) buffer Input { float pixels[]; } inBuf;
layout(binding = 1) buffer Output { float pixels[]; } outBuf;
layout(push_constant) uniform Dims { uint width; uint height; } dims;

void main() {
    uint x = gl_GlobalInvocationID.x;
    uint y = gl_GlobalInvocationID.y;
    if (x >= dims.width || y >= dims.height) return;
    float sumX = 0, sumY = 0;
    for (int dy = -1; dy <= 1; dy++) {
        for (int dx = -1; dx <= 1; dx++) {
            int nx = int(x) + dx, ny = int(y) + dy;
            if (nx < 0 || nx >= dims.width || ny < 0 || ny >= dims.height) continue;
            float val = inBuf.pixels[ny * dims.width + nx];
            sumX += val * sobelX[dy+1][dx+1];
            sumY += val * sobelY[dy+1][dx+1];
        }
    }
    float mag = min(1.0, length(vec2(sumX, sumY)));
    outBuf.pixels[y * dims.width + x] = mag;
}

Conclusion

Congratulations! You've journeyed through the depths of Metal compute on Apple Silicon and Vulkan compute. You now know how to set up the environment, write kernels, manage memory, optimize performance, and implement complex patterns like matrix multiplication and edge detection.

Next Steps

Metal: Explore Metal Performance Shaders (MPS) for highly optimized primitives (convolution, matrix multiplication, etc.). Look into Metal Ray Tracing for advanced graphics.
Vulkan: Dive into the Vulkan specification, experiment with subgroup operations, and try the Vulkan Ray Tracing extensions. Explore Vulkan Video for encode/decode.
Cross‑platform: Consider using SYCL or OpenCL for broader portability, but now you have the low‑level knowledge to understand what's happening under the hood.

The world of GPU compute is vast and exciting. Keep experimenting, profiling, and optimizing. And remember – the best way to learn is to write code. So fire up Xcode or your Vulkan SDK and start computing!

If you found this guide helpful, share it with your fellow developers. Have questions or want to share your own experiences? Leave a comment below. Happy coding!

https://dev.to/javadinteger/advanced-gpu-optimization-metal-vulkan-compute-from-zero-to-hero-4cfg

Javad