Problem

.bss (Byte Saving Section 😉) sections are extremely useful for saving on size when dealing with uninitialized data. However in tiny executables where every byte counts they can take up more space in the binary than you might want.

Consider an entry in your PE's section table:

> dt -r IMAGE_SECTION_HEADER
uxtheme!IMAGE_SECTION_HEADER
   +0x000 Name             : [8] UChar
   +0x008 Misc             : _IMAGE_SECTION_HEADER::<unnamed-type-Misc>
      +0x000 PhysicalAddress  : Uint4B
      +0x000 VirtualSize      : Uint4B
   +0x00c VirtualAddress   : Uint4B
   +0x010 SizeOfRawData    : Uint4B
   +0x014 PointerToRawData : Uint4B
   +0x018 PointerToRelocations : Uint4B
   +0x01c PointerToLinenumbers : Uint4B
   +0x020 NumberOfRelocations : Uint2B
   +0x022 NumberOfLinenumbers : Uint2B
   +0x024 Characteristics  : Uint4B

That's 36 bytes of header to describe a section table entry for your uninitialized data. Can we do better if we just want a place to stuff our uninitialized data?

Solution

One solution would be to use the very top of the allocated stack region (upper limit) as a place to store your data. Since the Windows loader will allocate pages for the stack of your main thread when your process is initialized, we can tell the loader to allocate a little extra for us to use for general purpose data storage.

Target stack layout

To do this you have to take into consideration how Windows allocates memory for thread stacks.

Thread stacks and guard pages

The default stack size that most compilers and assemblers will specify to be reserved in your PE headers is going to be 1 MB. However the entire 1 MB of stack space isn't going to be usable all at once. The 1 MB portion of your process' address space is reserved meaning that there can be no other allocations that overlap that consecutive region but there is no backing page frame for the region in physical memory. To actually use the pages in that reserved region you have to commit the pages you want to use. This will locate free pages in physical memory to fill into the corresponding page table entries and reserve space in the page file should the pages need to be paged out. By default the first page at the bottom of the stack (BOS) is committed and immediately usable before your entry point is called by the loader.

Reserved stack memory

How do the number of committed stack pages grow so that we can use the necessary memory as rsp grows in the direction of the upper limit? That's the job of the stack guard page. Guard pages are special pages that will trigger an exception when accessed. You can register handler functions for these exceptions and do processing based on who touched the guarded page. In the case of a stack guard page, Windows will catch this exception, commit the page on demand, and make the subsequent stack page the next guard page. This allows the OS to not use physical memory for stacks unless absolutely needed. Why would this be needed? Imagine a process which has hundreds or even thousands of threads which all have stacks that are multiple megabytes. This can get expensive very quickly and committing on demand can help in saving physical memory. Once a page has been committed and the stack shrinks above it, it will not be uncomitted.

Implementing a guard page feature in an OS is fairly straightforward as all pages in the guard region need to be set as invalid in the page tables. On access, the kernel page fault handler can look in either the page tables or a data structure describing regions of the virtual address space to see if the page is really invalid and triggering an application exception, or if a callback needs to be fired off.

An interesting caveat of having stack guard pages arises when one wants to allocate a buffer on the stack greater than the guard page region:

Stack guard page

As you can see, this would skip the guard page if the required stack region hasn't been committed before. For this reason, MSVC will implicitly call _chkstk for large stack allocations behind the scenes to probe the stack and touch as many guard pages as necessary to make sure there is enough committed memory for the buffer.

All this to say, if we tried to blindly use the extra stack memory at the upper limit we requested it wouldn't be committed and segfault our process. Now that we understand the difference between memory being reserved and committed, this is a non-issue as we are able to specify to the loader how much of our stack we want to commit before our entry point is called.

Updating PE header stack sizes

We want to change the amount of memory that is both reserved and committed for our stack in our binary's PE headers. To do this we need to alter the two fields in IMAGE_OPTIONAL_HEADER shown below:

> dt IMAGE_OPTIONAL_HEADER 
uxtheme!IMAGE_OPTIONAL_HEADER
...
   +0x048 SizeOfStackReserve : Uint8B
   +0x050 SizeOfStackCommit : Uint8B
...

If you are trying to squeeze every last byte out of your files you are probably building your PE headers manually. Take note of how the SizeOfStackReserve and SizeOfStackCommit fields are assigned to the same value:

; Commit 2 MB of virtual address space for the stack instead of the
; normal 1 MB default. We'll use the upper meg for data storage
STACK_COMMIT equ 0x200000
; ...
IMAGE_OPTIONAL_HEADER:
    IMAGE_NT_OPTIONAL_HDR64_MAGIC = 20Bh
    IMAGE_SUBSYSTEM_WINDOWS_GUI   = 2 

    .Magic                      dw IMAGE_NT_OPTIONAL_HDR64_MAGIC
    .MajorLinkerVersion         db 0
    .MinorLinkerVersion         db 0
    .SizeOfCode                 dd 0
    .SizeOfInitializedData      dd 0C0h
    .SizeOfUninitializedData    dd 0
    .AddressOfEntryPoint        dd begin
    .BaseOfCode                 dd IMAGE_NT_HEADERS
    .ImageBase                  dq 140000000h
    .SectionAlignment           dd 10h
    .FileAlignment              dd 10h
    .MajorOperatingSystemVer    dw 5
    .MinorOperatingSystemVer    dw 2
    .MajorImageVersion          dw 0
    .MinorImageVersion          dw 0
    .MajorSubsystemVersion      dw 5
    .MinorSubsystemVersion      dw 2
    .Win32VersionValue          dd 0
    .SizeOfImage                dd end_file
    .SizeOfHeaders              dd begin
    .CheckSum                   dd 0
    .Subsystem                  dw IMAGE_SUBSYSTEM_WINDOWS_GUI
    .DllCharacteristics         dw 0
    .SizeOfStackReserve         dq STACK_COMMIT     ; ATTENTION
    .SizeOfStackCommit          dq STACK_COMMIT     ; ATTENTION
    .SizeOfHeapReserve          dq 100000h
    .SizeOfHeapCommit           dq 1000h
    .LoaderFlags                dd 0
    .NumberOfRvaAndSizes        dd 0
; ...

Note that the above header definition can be made even smaller with various methods that will be discussed in a later article. If you aren't building your headers manually or are using something like a packer or compressing linker then you can use the trusty editbin to change SizeOfStackReserve and SizeOfStackCommit:

This will make our stack look like the following:

Fully committed stack

Additionally, the stack is larger than before and we can start storing data in the newly committed upper area. If there is concern of the stack growing into our uninitialized data block we can always bump up the reserved/committed stack size though deep stacks can be indicators of poor program design.

Locating the upper limit

Now that our stack is correctly initialized at load time it's not difficult to locate the upper limit to start placing our data in the newly committed space:

PAGE_SIZE = 1000h

; We first want to copy the current stack pointer into rax so that we can 
; change it
mov rax, rsp

; We then want to align to the nearest page boundary. This is normally
; done like: (rax & ~(PAGE_SIZE - 1)). Because pages are normally 4 KB in
; size their hex representation is 1000h. To align to 4 KB boundaries we
; have to clear the bottom 3 nybbles which is 4*3 = 12 least significant
; bits to zero. We could also accomlish the alignment with shr and then
; immediately doing a shl to clear out the bottom 12 bits but it ends up
; being the same size as the single and instruction.
and rax, not (PAGE_SIZE - 1)

; Now we have the base address of the last page in our committed stack
; region in our process address space. We want to get to the start of 
; our committed region which we know is of STACK_COMMIT size. All we 
; have to do then is (rax - STACK_COMMIT) right? No, since we're not
; at the very end of the region and at the base of the last page, we have
; to roll rax back (STACK_COMMIT - PAGE_SIZE) to not overshoot our destination :)
sub rax, STACK_COMMIT - PAGE_SIZE

And there we have it, the upper limit is now going to be in rax. We need to first align to the nearest page as we aren't sure exactly where in the BOS page we are after being called by the loader as part of process initialization on Windows runs in user mode on the main thread. We can now store as much data as we feel comfortable at the top of the stack.

Conclusion

That's it! 15 bytes of machine code to get the upper limit in rax. You could have allocated from the heap to do something similar but it would have ended up costing more bytes depending on your use case. From here in fasm if you want to have multiple uninitialized values you can create a struct using the windows includes struct macro:

struct BSS
    Var1 dd ?
    Var2 dw ?
    Var3 dw ?
ends

; ...

; rax = base of uninitialize data area
mov [rax + BSS.Var1], 123

If you have a method to get the stack limit in fewer bytes please contact me!