Have you ever looked at the x86 SSE or AVX registers in a debugger and wanted to use their bytes as a contiguous on-core buffer?

XMM registers

Clearly you can't get pointers into a buffer fashioned out of registers but it turns out that such a construction isn't hard to implement with some careful considerations in mind. We'll be using the g++ compiler in this post as MSVC doesn't currently support inline x64 assembly. We will be focusing on Windows and will be using MinGW to compile our code. We could also write this technique entirely in assembly but this option isn't as enticing because switch statements, templates, and classes make things very convenient. We will be targetting 64-bit processors that support SSE4 which allows us to only use XMM registers (as opposed to the newer YMM or ZMM extensions). The goal is going to be creating an array ADT whos backing storage is going to be XMM registers.

The reason why we won't be considering other operating systems is because the System V ABI doesn't preserve any of the XMM registers between calls and puts the burden on the caller to save them on the stack. If you think about it, this sort of defeats the purpose of using a register buffer if we're always going to be pushing our bytes to memory in user space.

Now that that's out of the way, let's take a look at the implementation.

XMM Registers

There are 16 individual 16-byte wide XMM registers which are meant to be used for SIMD and floating point operations. We won't be using them for their intended reason and instead just want to use their bytes as storage space. The Microsoft x64 ABI treats XMM0-XMM5 as volatile registers which aren't preserved between function calls and are used to pass doubles to functions and perform floating point operations (x87 is all but deprecated these days) among other things. However, registers XMM6-XMM15 are required to be preserved by the callee. We'll be using these registers to form our buffer so that we won't have to worry about functions inadvertently clobbering our bytes. If a function decides to use one of our buffer registers it will have to save/restore it using the stack per the rules of the ABI. This sort of makes the title of this post a misnomer but the kernel also saves our registers on context switch so RAM cannot be completely avoided. We're just having fun in the end anyway.

Since we'll be using XMM6-XMM15 that gives us 10 registers * 16 bytes = 160 bytes to play with in our buffer. But how do we manipulate the bytes in our buffer? There are a few instructions that will come in handy:


This instruction allows us to move double qword (double(8) = 16 bytes) values between two XMM registers. An example of its use would be:

movdqa xmm0, xmm5   ; mov the dqword value of xmm5 into xmm0

We won't be using this to move memory into an XMM register so we won't have to worry about required memory alignment restrictions. There are different instruction mnemonic suffixes for different sizes of data we want to move. Another MMX/SSE mov variant would be movq which deals with 8 byte qword values rather than 16 bytes at one time.


The pinsr* family of instructions allows us to insert data at a specified offset into an XMM register. The suffix specifiers to this instruction will select the size of data we want to insert into the XMM. Let's look at some examples:

; Insert the qword value of rbx into the 0th location of xmm0.
; If xmm0 = 0x00000000000000000000000000000000 and rax = 0xAAAAAAAAAAAAAAAA
; after the following instruction executes xmm0 = 0x0000000000000000AAAAAAAAAAAAAAAA
pinsrq xmm0, rax, 0

; Using the same initial values as above, after this instruction executes
; xmm0 = 0xAAAAAAAAAAAAAAAA0000000000000000. The immediate constant specifies
; the offset multiple of the data size rather than the byte offset.
pinsrq xmm0, rax, 1 ; insert the qword value of rax into the 1*8th byte location of xmm0.

; Last example. Assume that ebx = 0xFFFFFFFF and xmm0 is zero'd out. This instruction
; will produce xmm0 = 0x00000000FFFFFFFF0000000000000000 because it's putting the 
; dword of -1 into the 2*4=8th byte position in the number.
pinsrd xmm0, ebx, 2

We'll be using this instruction to insert data into our buffer. However, we will only be using the pinsrq size variant.


The pextr* family of instructions extracts data at specified offsets. It's the inverse of pinsr*. We will be using this instruction to extract data out of our buffer.

; Get the first 8 bytes out of the xmm0 register and move them into rax
pextrq rax, xmm0, 0

; Get the second 8 bytes out of the xmm0 register and move them into rbx
pextrq rbx, xmm0, 1

That's it. We'll only be using these three instructions to do everything we need. Let's move on to seeing how we can write this in C++.

The Implementation

The first thing we want to do is write a helper function which takes an offset and returns the qword at that offset. Every XMM register holds two qwords so we will need another function called get_reg_in_xmm0 which will copy the data from the dqword XMM register that contains our qword into XMM0 which is a volatile register whose value we don't need to care about preserving. After we have the dqword that we need in XMM0, we want to extract the correct 8 bytes from it. the pextrq instruction requires that the extraction offset be an immediate constant rather than a register. Because of this we will check to see if we need to extract the high or low qword from the XMM register and move it into rax. After that we can return the value that we extracted.

uint64_t register_buffer::get_qword(std::size_t offset)

    // is offset an odd multiple of 8?
    if((offset / 8) & 1) {
        asm("pextrq rax, xmm0, 1"); // extract the high qword
    } else {
        asm("pextrq rax, xmm0, 0"); // extract the low qword

    uint64_t buf_data;
    asm("mov %0, rax":"=r"(buf_data));

    return buf_data;

For the above function to work we need to be able to move the value of the XMM register corresponding to the requested offset into XMM0. This is easy enough using integer division to find the exact register that holds the data we need:

void register_buffer::get_reg_in_xmm0(std::size_t offset) {
    switch(offset / 16) {
        case 0: asm("movdqa xmm0, xmm6");  break;
        case 1: asm("movdqa xmm0, xmm7");  break;
        case 2: asm("movdqa xmm0, xmm8");  break;
        case 3: asm("movdqa xmm0, xmm9");  break;
        case 4: asm("movdqa xmm0, xmm10"); break;
        case 5: asm("movdqa xmm0, xmm11"); break;
        case 6: asm("movdqa xmm0, xmm12"); break;
        case 7: asm("movdqa xmm0, xmm13"); break;
        case 8: asm("movdqa xmm0, xmm14"); break;
        case 9: asm("movdqa xmm0, xmm15"); break;

We can get qwords out of our buffer but we need to be able to set them as well. To do that we will repeat the process of obtaining the correct qword from the offset into XMM0. After that we will see which qword we need to set depending on if the offset corresponds to the high or low qword currently in XMM0. After that we just move the user-specified uint64_t value into rax and then move that into the right spot within XMM0. We then do the reverse of the get_reg_in_xmm0 function and move the dirty value we just created back to the correct XMM register for the specified offset.

void register_buffer::set_qword(std::size_t offset, uint64_t val)

    if((offset / 8) & 1) {
        asm("mov rax, %0"::"r"(val));
        asm("pinsrq xmm0, rax, 1");
    } else {
        asm("mov rax, %0"::"r"(val));
        asm("pinsrq xmm0, rax, 0");

    switch(offset / 16) {
        case 0: asm("movdqa xmm6,  xmm0"); break;
        case 1: asm("movdqa xmm7,  xmm0"); break;
        case 2: asm("movdqa xmm8,  xmm0"); break;
        case 3: asm("movdqa xmm9,  xmm0"); break;
        case 4: asm("movdqa xmm10, xmm0"); break;
        case 5: asm("movdqa xmm11, xmm0"); break;
        case 6: asm("movdqa xmm12, xmm0"); break;
        case 7: asm("movdqa xmm13, xmm0"); break;
        case 8: asm("movdqa xmm14, xmm0"); break;
        case 9: asm("movdqa xmm15, xmm0"); break;

The reason that we have been working on qwords instead of other datatypes right away is primarily because the pinsr* and pextr* instructions require an immediate constant offset into the specified XMM register. Currently we only need to specify 0 or 1 as the offset because there are only two qwords per XMM register. If we had more offsets because the datatype used was smaller, we would need to specify yet more offsets and have even more tests to see which exact part the user-specified offset fell under. By just operating on qwords and then (as we'll see in a moment) modifying the data within the obtained qwords at a higher level of abstraction, we save ourselves a lot of trouble.

Now that we have our support functions for getting/setting qword values, let's extend that to getting/setting any datatype as long as it is of byte/word/dword/qword size:

class register_buffer
    register_buffer(register_buffer const&) = delete;
    void operator=(register_buffer const&) = delete;

    // register_buffer is a singleton because there is only one instance of 
    // the XMM registers. You cannot share this buffer between multiple threads as each
    // thread gets its own buffer (that is not accounted for in this implementation)
    static register_buffer& instance()
        static register_buffer rb;
        return rb;

    template<typename T>
    T get(std::size_t index)
        static_assert(sizeof(T) == 1 ||
                      sizeof(T) == 2 ||
                      sizeof(T) == 4 ||
                      sizeof(T) == 8, "Invalid get<T> type size");

        // Get logical index into our buffer ADT and obtain the containing qword
        // from the logical index
        std::size_t offset = index * sizeof(T);
        uint64_t qword_val = get_qword(offset);

        // Position the requested value to be in the least significant bytes of 
        // the qword before casting the qword to be of size T.
        uint64_t ret = qword_val >> (offset % 8) * 8;
        return *reinterpret_cast<T*>(&ret);

    template<typename T>
    void set(std::size_t index, T val)
        static_assert(sizeof(T) == 1 ||
                      sizeof(T) == 2 ||
                      sizeof(T) == 4 ||
                      sizeof(T) == 8, "Invalid set<T> type size");

        // Get previously existing qword that contains the bytes we
        // want to set
        std::size_t offset = index * sizeof(T);
        uint64_t qword_val = get_qword(offset);

        // Round down to the nearest multiple of 8 and then obtain
        // the distance between that multiple and the current offset.
        // This lets us easily set the correct bytes by using
        // pointer arithmetic
        std::size_t reg_index = (offset - (offset & ~7)) / sizeof(T);
        *(reinterpret_cast<T*>(&qword_val) + reg_index) = val;

        set_qword(offset, qword_val);

    // num xmm regs in buf * xmm reg size
    const std::size_t size = 10 * 16;

    static void get_reg_in_xmm0(std::size_t offset);
    static uint64_t get_qword(std::size_t offset);
    static void set_qword(std::size_t offset, uint64_t val);

    register_buffer() {}

Now that our class is fully implemented let's play around with our buffer:

#include <cstdio>
#include <cstdint>

#include "register_buffer.h"

int main()
    auto& rb = register_buffer::instance();

    rb.set<uint32_t>(0, 0xAAAAAAAA);
    rb.set<uint32_t>(1, 0xBBBBBBBB);
    rb.set<uint32_t>(2, 0xCCCCCCCC);
    rb.set<uint32_t>(3, 0xDDDDDDDD);
    rb.set<uint32_t>(4, 0xEEEEEEEE);
    rb.set<uint32_t>(5, 0xFFFFFFFF);

    printf("%016llX\n", rb.get<uint64_t>(0));
    printf("%016llX\n", rb.get<uint64_t>(1));
    printf("%016llX\n\n", rb.get<uint64_t>(2));

    rb.set<uint64_t>(1, 0x1111111111111111);
    printf("%016llX\n\n", rb.get<uint64_t>(1));

    rb.set<uint16_t>(5, 0x2222);
    printf("%016llX\n", rb.get<uint64_t>(1));
    rb.set<uint16_t>(6, 0x3333);
    printf("%016llX\n", rb.get<uint64_t>(1));
    rb.set<uint16_t>(7, 0x4444);
    printf("%016llX\n\n", rb.get<uint64_t>(1));

    rb.set<uint8_t>(8, 0x00);
    printf("%016llX\n\n", rb.get<uint64_t>(1));

Or how about something more silly? Let's store a string in there:

#include <iostream>

#include "register_buffer.h"

int main()
    auto& rb = register_buffer::instance();

    std::string input;
    std::cin >> input;

    for(int i = 0; i < input.size() && i < rb.size; i++) {
        rb.set<char>(i, input[i]);

    for(int i = 0; i < input.size() && i < rb.size; i++) {
        std::cout << rb.get<char>(i);

This is great and all but what about at higher optimization levels? That's where things get hairy. It's best to just completely disable gcc optimizations for this class as there will be problems for -O2 and above with how this is implemented. The readability is a major factor here and we wouldn't gain a significant amount by obfuscating our code (as well as increasing the code size) to work better under higher optimization levels.

A commenter pointed out on the Hacker News post that inline assembly shouldn't be split up around the code as shown here. That is true and if you were to actually use this idea (for some reason) you should be more careful with how the compiler clobbers registers or assumes registers to contain certain values. For this article I think that it's best to demonstrate the concept rather than write the most correct code.

The test program was compiled with the following command:

x86_64-w64-mingw32-g++ -std=c++11 -masm=intel -static-libgcc \
                       -static-libstdc++ -o regbuf.exe main.cpp register_buffer.cpp


And that's it. Now we can treat XMM registers as a single buffer. Hopefully you found this interesting if you made it this far. If you find a good use for this technique I would love to hear from you!

If you want to learn more about x86 or x64 assembly language check out the bottom of my import by hash post where I link to multiple resources.