.bss (Byte Saving Section 😉) sections are extremely useful for saving on size when dealing with uninitialized data. However in tiny executables where every byte counts they can take up more space in the binary than you might want.
Consider an entry in your PE's section table:
> dt -r IMAGE_SECTION_HEADER uxtheme!IMAGE_SECTION_HEADER +0x000 Name :  UChar +0x008 Misc : _IMAGE_SECTION_HEADER::<unnamed-type-Misc> +0x000 PhysicalAddress : Uint4B +0x000 VirtualSize : Uint4B +0x00c VirtualAddress : Uint4B +0x010 SizeOfRawData : Uint4B +0x014 PointerToRawData : Uint4B +0x018 PointerToRelocations : Uint4B +0x01c PointerToLinenumbers : Uint4B +0x020 NumberOfRelocations : Uint2B +0x022 NumberOfLinenumbers : Uint2B +0x024 Characteristics : Uint4B
That's 36 bytes of header to describe a section table entry for your uninitialized data. Can we do better if we just want a place to stuff our uninitialized data?
One solution would be to use the very top of the allocated stack region (upper limit) as a place to store your data. Since the Windows loader will allocate pages for the stack of your main thread when your process is initialized, we can tell the loader to allocate a little extra for us to use for general purpose data storage.
To do this you have to take into consideration how Windows allocates memory for thread stacks.
Edit: A reader has notified me that the current thread's stack limit is actually in the TIB (a struct which is the first element of the TEB)! This is convenient and will save a few bytes as it's a single instruction. To do this one only has to do:
mov rax, gs:[0x10]
Now that that detour is over, let's get on to my original technique (slightly inferior but with some background) of getting the base of the allocated stack region:
Thread stacks and guard pages
The default stack size that most compilers and assemblers will specify to be reserved in your PE headers is going to be 1 MB. However the entire 1 MB of stack space isn't going to be usable all at once. The 1 MB portion of your process' address space is reserved meaning that there can be no other allocations that overlap that consecutive region but there is no backing page frame for the region in physical memory. To actually use the pages in that reserved region you have to commit the pages you want to use. This will locate free pages in physical memory to fill into the corresponding page table entries and reserve space in the page file should the pages need to be paged out. By default the first page at the bottom of the stack (BOS) is committed and immediately usable before your entry point is called by the loader.
How do the number of committed stack pages grow so that we can use the necessary memory as
rsp grows in the direction of the upper limit? That's the job of the stack guard page. Guard pages are special pages that will trigger an exception when accessed. You can register handler functions for these exceptions and do processing based on who touched the guarded page. In the case of a stack guard page, Windows will catch this exception, commit the page on demand, and make the subsequent stack page the next guard page. This allows the OS to not use physical memory for stacks unless absolutely needed. Why would this be needed? Imagine a process which has hundreds or even thousands of threads which all have stacks that are multiple megabytes. This can get expensive very quickly and committing on demand can help in saving physical memory. Once a page has been committed and the stack shrinks above it, it will not be uncomitted.
Implementing a guard page feature in an OS is fairly straightforward as all pages in the guard region need to be set as invalid in the page tables. On access, the kernel page fault handler can look in either the page tables or a data structure describing regions of the virtual address space to see if the page is really invalid and triggering an application exception, or if a callback needs to be fired off.
An interesting caveat of having stack guard pages arises when one wants to allocate a buffer on the stack greater than the guard page region:
As you can see, this would skip the guard page if the required stack region hasn't been committed before. For this reason, MSVC will implicitly call
_chkstk for large stack allocations behind the scenes to probe the stack and touch as many guard pages as necessary to make sure there is enough committed memory for the buffer.
All this to say, if we tried to blindly use the extra stack memory at the upper limit we requested it wouldn't be committed and segfault our process. Now that we understand the difference between memory being reserved and committed, this is a non-issue as we are able to specify to the loader how much of our stack we want to commit before our entry point is called.
Updating PE header stack sizes
We want to change the amount of memory that is both reserved and committed for our stack in our binary's PE headers. To do this we need to alter the two fields in
IMAGE_OPTIONAL_HEADER shown below:
> dt IMAGE_OPTIONAL_HEADER uxtheme!IMAGE_OPTIONAL_HEADER ... +0x048 SizeOfStackReserve : Uint8B +0x050 SizeOfStackCommit : Uint8B ...
If you are trying to squeeze every last byte out of your files you are probably building your PE headers manually. Take note of how the
SizeOfStackCommit fields are assigned to the same value:
; Commit 2 MB of virtual address space for the stack instead of the ; normal 1 MB default. We'll use the upper meg for data storage STACK_COMMIT equ 0x200000 ; ... IMAGE_OPTIONAL_HEADER: IMAGE_NT_OPTIONAL_HDR64_MAGIC = 20Bh IMAGE_SUBSYSTEM_WINDOWS_GUI = 2 .Magic dw IMAGE_NT_OPTIONAL_HDR64_MAGIC .MajorLinkerVersion db 0 .MinorLinkerVersion db 0 .SizeOfCode dd 0 .SizeOfInitializedData dd 0C0h .SizeOfUninitializedData dd 0 .AddressOfEntryPoint dd begin .BaseOfCode dd IMAGE_NT_HEADERS .ImageBase dq 140000000h .SectionAlignment dd 10h .FileAlignment dd 10h .MajorOperatingSystemVer dw 5 .MinorOperatingSystemVer dw 2 .MajorImageVersion dw 0 .MinorImageVersion dw 0 .MajorSubsystemVersion dw 5 .MinorSubsystemVersion dw 2 .Win32VersionValue dd 0 .SizeOfImage dd end_file .SizeOfHeaders dd begin .CheckSum dd 0 .Subsystem dw IMAGE_SUBSYSTEM_WINDOWS_GUI .DllCharacteristics dw 0 .SizeOfStackReserve dq STACK_COMMIT ; ATTENTION .SizeOfStackCommit dq STACK_COMMIT ; ATTENTION .SizeOfHeapReserve dq 100000h .SizeOfHeapCommit dq 1000h .LoaderFlags dd 0 .NumberOfRvaAndSizes dd 0 ; ...
Note that the above header definition can be made even smaller with various methods that will be discussed in a later article. If you aren't building your headers manually or are using something like a packer or compressing linker then you can use the trusty
editbin to change
This will make our stack look like the following:
Additionally, the stack is larger than before and we can start storing data in the newly committed upper area. If there is concern of the stack growing into our uninitialized data block we can always bump up the reserved/committed stack size though deep stacks can be indicators of poor program design.
Locating the upper limit
Now that our stack is correctly initialized at load time it's not difficult to locate the upper limit to start placing our data in the newly committed space:
PAGE_SIZE = 1000h ; We first want to copy the current stack pointer into rax so that we can ; change it mov rax, rsp ; We then want to align to the nearest page boundary. This is normally ; done like: (rax & ~(PAGE_SIZE - 1)). Because pages are normally 4 KB in ; size their hex representation is 1000h. To align to 4 KB boundaries we ; have to clear the bottom 3 nybbles which is 4*3 = 12 least significant ; bits to zero. We could also accomlish the alignment with shr and then ; immediately doing a shl to clear out the bottom 12 bits but it ends up ; being the same size as the single and instruction. and rax, not (PAGE_SIZE - 1) ; Now we have the base address of the last page in our committed stack ; region in our process address space. We want to get to the start of ; our committed region which we know is of STACK_COMMIT size. All we ; have to do then is (rax - STACK_COMMIT) right? No, since we're not ; at the very end of the region and at the base of the last page, we have ; to roll rax back (STACK_COMMIT - PAGE_SIZE) to not overshoot our destination :) sub rax, STACK_COMMIT - PAGE_SIZE
And there we have it, the upper limit is now going to be in
rax. We need to first align to the nearest page as we aren't sure exactly where in the BOS page we are after being called by the loader as part of process initialization on Windows runs in user mode on the main thread. We can now store as much data as we feel comfortable at the top of the stack.
That's it! 15 bytes of machine code to get the upper limit in
rax. You could have allocated from the heap to do something similar but it would have ended up costing more bytes depending on your use case. From here in fasm if you want to have multiple uninitialized values you can create a struct using the windows includes
struct BSS Var1 dd ? Var2 dw ? Var3 dw ? ends ; ... ; rax = base of uninitialize data area mov [rax + BSS.Var1], 123
If you have a method to get the stack limit in fewer bytes please contact me!
© 2018 emsea