The goal of this post is to write a small chunk of x64 assembly which performs a subset of the functionality provided by the Windows PE loader and locates function addresses in DLLs. It will store the hash of function name strings we want to import from DLLs and will populate a table of function pointers. We will be using fasm as our assembler of choice due to its powerful macro capabilities. To explore our process' memory as well as looking at offsets of structure fields, we will be using WinDbg.

The primary motivation for writing our own code to import functions from DLLs is to create small binaries. The import tables that compilers and assemblers normally produce can be quite large and storing the hashes of function names is a lot more compact than storing the equivalent ASCII strings. It also is a great learning device as it's placed at the intersection of multiple topics including x64 assembly and executable file formats.

The import table that we will be creating will take the form of the following diagram:

Hash-based import table

The above table shows three hypothetical entries for functions in kernel32.dll. Each function that we want to import will be initially represented by eight bytes where the first four bytes will be the hash of the ASCII function name and the rest will be filler bytes. After searching through the export table of the correct DLL and matching one of the hashes in our import table, we will populate the entire eight byte entry with the virtual address of the function. All calls to the imported function will be made through the function pointer stored in our table. The default import table that is produced by compilers and assemblers is similar to ours in function but is much larger as the functions to import are stored as ASCII strings (among other requirements).

The code samples are heavily commented and written with people who don't have a lot of experience with Windows or assembly language in mind. This could help you or could annoy you. If you fall into the latter category the full code is produced at the end of the post and isn't verbosely annotated. Additionally, the WinDbg examples should be understandable even if you've never used the debugger before as each command is annotated. I have minimally stripped some output to make things more clear.

If you want to learn more about assembly language, links to interesting resources are provided at the end of the post.

The PE file format

The Portable Executable file format is used for files with extensions such as .exe, .dll and others. It has various headers containing metadata which describe the file as a whole as well as the various sections of data contained within the binary (section A contains machine code while section B consists of data used for global variables referenced by the code, etc). These headers are used by the Windows loader to map binaries into a process' virtual address space and set the header-specified permissions (read, write, execute) on the mapped pages containing the sections. While .exe files are almost always used for programs which get their own address space before they are loaded, .dll files are used as shared libraries which are loaded into the address space of an existing process. Since DLL files are mapped into address spaces to use the code and data resources contained within their sections, the PE headers of a DLL describe where those resources are. We won't be covering the PE file format extensively and many things will be omitted. If you want to learn more this is a great resource. A general diagram of a PE file for the purpose of this post (after it has been mapped into memory) is as follows:

PE file

In this diagram, only the parts of the headers that we will be focusing on are shown. This doesn't show the section headers which describe locations and metadata about each section, the sub-structures that are contained within the PE header, and more. We are primarily interested in the export directory of a DLL as it contains all the references to functions within the code section(s) of the DLL that we want to search for.

All data references (pointers) in the PE headers are stored as RVAs or Relative Virtual Addresses. RVAs are offsets from the beginning of the file (and in rare cases in object files, from the beginning of a section but you don't have to worry about those). Pointers are stored this way because PE files can be loaded/relocated at different places in memory and the headers can't assume a preferred base address. To find the address in our address space of an RVA in a DLL we have to have a pointer to the base address of the module that we add to the RVA. Data references in headers stored as RVAs will always be correct no matter where the module is loaded in memory as long as they are added to the correct base address. If our DLL is mapped into memory beginning at 0x400000 and a field in our module's header refers to a piece of data at RVA 0xFFF then the data is located at 0x400000 + 0xFFF = 0x0x400FFF in our address space. A .reloc section could be used as well but the PE header should be valid without .reloc, hence the need for RVAs.

Creating a PE file in fasm

The first thing we're going to do is tell the assembler exactly what kind of binary we want. Because we want to use x64 instructions we are going to be using a 64 bit PE. This is almost exactly the same as a 32 bit PE but some header fields have been widened to 8 bytes from 4. There are different types of PE files and telling fasm that we want a GUI PE means that a console won't be created for you when your process is initialized. The PE type is specified by a field in the PE headers and read by the Windows loader. Additionally, we tell the assembler that the entry point address to be put in the PE header, which is initially called by the loader, is going to be that of find_kernel32. We will write find_kernel32 shortly. Let's place these directives at the top of our file:

format PE64 GUI
entry find_kernel32

The last thing to note is that we won't be explicitly specifying any sections in our binary. This causes fasm to create a single section named .flat which has read, write, and execute permissions where all code and data is placed. Since we want to produce small binaries, this eliminates padding bytes that are inserted between multiple sections for purposes of alignment. However, be warned that binaries with sections that are both writeable and executable can sometimes set off antivirus software.

Now that that's out of the way we can start writing code!

Import by hash

Find the base address of kernel32.dll

The first thing we need to do after our entry point is called is to find the base (also called the module handle) of the kernel32.dll module in our address space. kernel32.dll is important as it contains the LoadLibraryA function which can be used to map other DLLs into our address space. Every process on Windows has kernel32.dll mapped into its address space (along with ntdll.dll) as it contains code used in the user mode portion of process initialization. Finding the base address can be done a few ways but is normally found by going through the TEB (using one of the fs/gs segment registers depending on what processor mode you're executing in) to the PEB and walking the module list pointed to by InInitializationOrderModuleList field. This can take up a fair amount of instructions (even though kernel32.dll is basically always going to be the second list entry removing the need for looping logic) and we're trying to take up as little space as possible. Luckily a trick which utilizes the fact that all memory allocations for user mode processes in Windows happen on 64K boundaries can be used to quickly obtain kernel32's base with only a few instructions. All allocations performed by the kernel for a process in user mode such as those performed on behalf of VirtualAllocEx and memory mappings are going to start on a multiple of 64K. We can see this in action by viewing a process with the VMMap tool from Sysinternals which shows how a process' virtual address space is partitioned. We can see that every DLL image mapped into our process (as well as most sections of virtual memory shown by the tool) begins on a multiple of 64K as the bottom two bytes of the base address are zero'd out. We can also see the locations of kernel32.dll's header and sections which all begin with a period.

Virtual memory map

What this means means is that if we have any pointer inside kernel32.dll we can align that pointer to 64K boundaries and check for the byte signature of the beginning of the DOS header of the DLL (4d 5a which are the ASCII characters MZ... you'll see these bytes later). If it isn't found, check for the signature at each previous multiple of 64K until it is located. It also just so happens that we can find a suitable pointer off of the top of the stack because the return address of the function which calls your binary's entry point is BaseThreadInitThunk located within kernel32.dll.

The machine code for KERNEL32!BaseThreadInitThunk is (in all versions of Windows since 7 and probably XP) placed by the linker used to link kernel32.dll at an offset in the file that ends up being loaded at an offset between 64K and 128K bytes from the start of the module base address in memory. This means that we don't necessarily need to iterate backwards checking every multiple of 64k as we know exactly which particular region our pointer falls under. This does depend on the particular build of kernel32.dll but has shown to be effective for many versions of Windows. We're having fun, not writing bullet-proof production code. We can see how this makes sense in the VMMap screenshot as the .text section (which is the typical name for code sections and contains BaseThreadInitThunk) is adjacent to the header of kernel32.dll. The easiest way to align a pointer to 64K is to zero out the bottom two bytes of the address it points to. Let's inspect the environment that is set up for us right before our entry point called in WinDbg:

(3fac.10dc): Break instruction exception - code 80000003 (first chance)
00007ffb`2a802e9c cc              int     3
> bp $exentry  ;* Set a BreakPoint on the program entry point
> g            ;* Go run the program to initialize the process and stop at our breakpoint
Breakpoint 0 hit
00000000`00401000 488b1c24        mov     rbx,qword ptr [rsp]

> k ;* Print the state of the stacK after the process init code has ran
 # Child-SP          RetAddr           Call Site
00 00000000`0008ff58 00007ffb`29d31fe4 import+0x1000
01 00000000`0008ff60 00007ffb`2a79ef91 KERNEL32!BaseThreadInitThunk+0x14
02 00000000`0008ff90 00000000`00000000 ntdll!RtlUserThreadStart+0x21

> ln 00007ffb`29d31fe4 ;* List the Nearest preceding symbol that this address could belong to
(00007ffb`29d31fd0) KERNEL32!BaseThreadInitThunk+0x14
> * ^ This is exactly what we were expecting especially considering that the 01
> *   stack entry's call site is to BaseThreadInitThunk

> lm ;* List loaded Modules
start             end                 module name
00000000`00400000 00000000`00402000   import     (no symbols)
00007ffb`27540000 00007ffb`277a6000   KERNELBASE   (deferred)
00007ffb`29d20000 00007ffb`29dce000   KERNEL32   (pdb symbols)
00007ffb`2a730000 00007ffb`2a910000   ntdll      (pdb symbols)

When our breakpoint is hit we looked at the stack and found the return address back to the function that called our entry point to be 00007ffb`29d31fe4. If we list the nearest symbol to that address we see that we will return back to KERNEL32!BaseThreadInitThunk+0x14. Let's align this address to 64K and then subtract 64K like we previously discussed. Dumping the memory after we have computed our new address yields:

> db 00007ffb`29d30000 - 0x10000 ;* Dump Bytes at the computed address
00007ffb`29d20000  4d 5a 90 00 03 00 00 00-04 00 00 00 ff ff 00 00  MZ..............
00007ffb`29d20010  b8 00 00 00 00 00 00 00-40 00 00 00 00 00 00 00  ........@.......
00007ffb`29d20020  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................
00007ffb`29d20030  00 00 00 00 00 00 00 00-00 00 00 00 e8 00 00 00  ................
00007ffb`29d20040  0e 1f ba 0e 00 b4 09 cd-21 b8 01 4c cd 21 54 68  ........!..L.!Th
00007ffb`29d20050  69 73 20 70 72 6f 67 72-61 6d 20 63 61 6e 6e 6f  is program canno
00007ffb`29d20060  74 20 62 65 20 72 75 6e-20 69 6e 20 44 4f 53 20  t be run in DOS 
00007ffb`29d20070  6d 6f 64 65 2e 0d 0d 0a-24 00 00 00 00 00 00 00  mode....$.......

Notice how we aligned the address we found at the top of the stack for the dump computation as the bottom two bytes are zero'd out. What you're seeing is the bytes of the DOS header signaling the beginning of kernel32.dll in our address space! Now let's write write some code. The following snippet finds the base of kernel32.dll assuming that the top of the stack contains the return address to KERNEL32!BaseThreadInitThunk:

; Obtain base of kernel32.dll in our address space in rbx
    ; In fasm a label or symbolic constant beginning with a period is local
    ; to the most recent non-local label thus .alloc_granularity "belongs"
    ; to find_kernel32
    .alloc_granularity = 10000h

    ; rbx = return address to kernel32 loader function BaseThreadInitThunk
    mov rbx, [rsp]

    ; Align address to 64k boundary by clearing bottom two bytes
    ; of pointer stored in rbx (0xFFFF = 65535 = 64*1024 - 1)
    ; xoring a register with itself produces 0 and is the idiomatic way
    ; of zeroing a register on x86 processors
    xor bx, bx

    ; Go back 1 multiple of 64k to get the base of the kernel32.dll module
    sub rbx, .alloc_granularity

    ; ** Don't worry about these. We'll come back to them later
    lea rbp, [__imp_tab_start]
    sub rsp, 8*5

There are calling conventions that assembly programmers have to follow when invoking functions provided by APIs and from other DLLs. This greatly affects how you structure your code. However, because we are writing a stub of code which executes before the real meat of the program, we can elide some of the normal conventions that well-structured assembly code usually employs (in the name of size of course). One of our conventions is going to be storing the base address of the current module we're importing from into rbx to readily be able to turn RVAs into virtual addresses. We will minimize the use of the stack as much as possible.

Get the export directory

Now that we have the base address of kernel32.dll in rbx, we need to parse the headers of the DLL in memory to locate the export directory. All PE files immediately begin with the DOS header which is a valid DOS program that normally spits out an error message about the module not being loaded on Windows. To locate the export directory we first need to obtain the RVA of the PE header which is provided by the e_lfanew field in the DOS header structure. The problem is that we aren't writing our importer in C and don't have the convenience of a compiler figuring out the offset of that field from the struct definition. We can quickly remedy this by using WinDbg to display the type of the debug symbols representing the DOS header struct IMAGE_DOS_HEADER:

   +0x000 e_magic          : Uint2B
   +0x002 e_cblp           : Uint2B
   +0x004 e_cp             : Uint2B
   +0x006 e_crlc           : Uint2B
   +0x008 e_cparhdr        : Uint2B
   +0x00a e_minalloc       : Uint2B
   +0x00c e_maxalloc       : Uint2B
   +0x00e e_ss             : Uint2B
   +0x010 e_sp             : Uint2B
   +0x012 e_csum           : Uint2B
   +0x014 e_ip             : Uint2B
   +0x016 e_cs             : Uint2B
   +0x018 e_lfarlc         : Uint2B
   +0x01a e_ovno           : Uint2B
   +0x01c e_res            : [4] Uint2B
   +0x024 e_oemid          : Uint2B
   +0x026 e_oeminfo        : Uint2B
   +0x028 e_res2           : [10] Uint2B
   +0x03c e_lfanew         : Int4B

As we can see, the offset of the e_lfanew field is 0x3C bytes from the start of IMAGE_DOS_HEADER (and also from the base address of our kernel32 module). Now we can use that to get the RVA of the PE header:

; Find the export directory given the base address of kernel32.dll in rbx
    .e_lfanew = 3Ch

    ; eax = the RVA of the PE header. We use eax instead of rax because
    ; e_lfanew is a 4 byte (dword) field in the structure. If rax were used we would
    ; be reading 8 bytes (qword) from the addressing-mode computed pointer. This pointer 
    ; size deduction is performed by the assembler to generate the correct opcode
    mov eax, [rbx + .e_lfanew]

Note that whenever we modify the register-addressed lower 32 bit component of a 64 bit register, the upper 32 bits of the 64 bit register is zero'd out. Thus, even if rax had a bunch of junk in its upper 32 bits moving a value into eax would zero it out making rax hold the same value as eax. This turns out to be extremely useful when pulling dword sized RVAs out of a dereferenced pointer and then adding that RVA to the module base address contained in rbx. Remember: all sizes have to match in valid instructions so you can only add 64 bit values to other 64 bit values, etc.

Now that we have the RVA of the PE header all we have to do is add it to the base address in rbx and we'll have the virtual address. However, what are we going to do after we have the virtual address of the PE header? First lets examine the PE header (IMAGE_NT_HEADERS) symbols in WinDbg:

> dt -r _IMAGE_NT_HEADERS64 ;* The -r switch recursively displays nested structures
   +0x000 Signature        : Uint4B
   +0x004 FileHeader       : _IMAGE_FILE_HEADER
      +0x000 Machine          : Uint2B
      +0x002 NumberOfSections : Uint2B
      +0x004 TimeDateStamp    : Uint4B
      +0x008 PointerToSymbolTable : Uint4B
      +0x00c NumberOfSymbols  : Uint4B
      +0x010 SizeOfOptionalHeader : Uint2B
      +0x012 Characteristics  : Uint2B
   +0x018 OptionalHeader   : _IMAGE_OPTIONAL_HEADER64
      +0x000 Magic            : Uint2B
      +0x002 MajorLinkerVersion : UChar
      +0x003 MinorLinkerVersion : UChar
      +0x004 SizeOfCode       : Uint4B
      +0x008 SizeOfInitializedData : Uint4B
      +0x00c SizeOfUninitializedData : Uint4B
      +0x010 AddressOfEntryPoint : Uint4B
      +0x014 BaseOfCode       : Uint4B
      +0x018 ImageBase        : Uint8B
      +0x020 SectionAlignment : Uint4B
      +0x024 FileAlignment    : Uint4B
      +0x028 MajorOperatingSystemVersion : Uint2B
      +0x02a MinorOperatingSystemVersion : Uint2B
      +0x02c MajorImageVersion : Uint2B
      +0x02e MinorImageVersion : Uint2B
      +0x030 MajorSubsystemVersion : Uint2B
      +0x032 MinorSubsystemVersion : Uint2B
      +0x034 Win32VersionValue : Uint4B
      +0x038 SizeOfImage      : Uint4B
      +0x03c SizeOfHeaders    : Uint4B
      +0x040 CheckSum         : Uint4B
      +0x044 Subsystem        : Uint2B
      +0x046 DllCharacteristics : Uint2B
      +0x048 SizeOfStackReserve : Uint8B
      +0x050 SizeOfStackCommit : Uint8B
      +0x058 SizeOfHeapReserve : Uint8B
      +0x060 SizeOfHeapCommit : Uint8B
      +0x068 LoaderFlags      : Uint4B
      +0x06c NumberOfRvaAndSizes : Uint4B
      +0x070 DataDirectory    : [16] _IMAGE_DATA_DIRECTORY
         +0x000 VirtualAddress   : Uint4B
         +0x004 Size             : Uint4B

At the bottom of this listing we can see the DataDirectory array field of the OptionalHeader. This contains the various directories of metadata for the DLL. The first data directory contains the export directory which is what we're looking for. As such, the RVA of the export directory is contained in OptionalHeader.DataDirectory[0].VirtualAddress. Because we want the VirtualAddress (the first element in the IMAGE_DATA_DIRECTORY structure) and the zeroth element of the DataDirectory array, the RVA of the export directory is located an an offset of simply 0x70 from the base of OptionalHeader. However, we have the RVA of the PE header struct IMAGE_NT_HEADER in rax. If the offset of the OptionalHeader is 0x18 and the offset from the start of the OptionalHeader to the export directory (zero'th data directory) is 0x70 then the offset of the export directory from the start of the PE header is 0x18 + 0x70 = 0x88. Using this new offset we can obtain the virtual address of the export directory as follows:

    .data_dir_0 = 88h

    ; Before this instruction rax is the RVA of the start of the PE header. By adding it 
    ; to rbx we obtain the virtual address of the PE header in our address space. We can 
    ; stuff the offset of the export directory into the addressing mode calculation here
    ; as well. This saves a byte it remove the need for an additional add instruction to
    ; add .data_dir_0 to the base of the PE header. eax now contains the RVA of the export 
    ; directory
    mov eax, [rbx + rax + .data_dir_0]

    ; rax is the RVA to the export directory for kernel32 so we need to make it a real virtual 
    ; address to access the export directory structure.
    add rax, rbx

Start reading the export directory

After the previous snippets have executed rax holds the virtual address of the export directory. Now we have to rip the data out of this structure to perform the main importing process. Let's take a peek at the export directory struct to find the fields that we need:

   +0x000 Characteristics  : Uint4B
   +0x004 TimeDateStamp    : Uint4B
   +0x008 MajorVersion     : Uint2B
   +0x00a MinorVersion     : Uint2B
   +0x00c Name             : Uint4B
   +0x010 Base             : Uint4B
   +0x014 NumberOfFunctions : Uint4B
   +0x018 NumberOfNames    : Uint4B
   +0x01c AddressOfFunctions : Uint4B
   +0x020 AddressOfNames   : Uint4B
   +0x024 AddressOfNameOrdinals : Uint4B

The fields of interest are NumberOfNames, AddressOfFunctions, AddressOfNames, and AddressOfNameOrdinals. Let's examine each of them in detail:


This is an array of RVAs pointing to zero-terminated ASCII strings of function names. All of the functions that a PE exports by name are going to be in this array. We will be walking through each of these names and computing their hash. This will be compared with the hash that we store in our own import table. The number of name entries in AddressOfNames is held in NumberOfNames. Every index into this array of a particular function name can be used to index into AddressOfNameOrdinals to obtain the name ordinal of the corresponding function.


This field is an array of 16 bit shorts called ordinals which are used as indexes into AddressOfFunctions. There are more to ordinals outside of using them with AddressOfNames but that is outside the scope of this article. The index of a particular function string RVA in AddressOfNames is used to look up into AddressOfNameOrdinals and get the correct index into AddressOfFunctions.


Finally, this field is an array of RVAs which point to exported functions located within the various sections of the PE. It is indexed by the values obtained in AddressOfNameOrdinals.

Parsing the export directory

Below is a diagram of all the three parallel arrays and the data which they refer to:

Export table

This diagram is a bit misleading in that the ASCII function names are from kernel32.dll but refer to ordinals that aren't actually what they are in the real DLL. Hopefully the structure of the arrays is now clear. A feature of the export table that we will not be supporting is export forwarding.

Now that we have our desired fields and their offsets, we can write a few instructions to parse out the data into registers that we'll use in the remaining sections of the importer:

; rax contains the virtual address of IMAGE_EXPORT_DIRECTORY
    .export_names_num  = 18h
    .export_funcs_addr = 1Ch
    .export_names_addr = 20h
    .export_ords_addr  = 24h

    ; Obtain useful RVAs from IMAGE_EXPORT_DIRECTORY
    mov r13d, [rax + .export_funcs_addr] ; AddressOfFunctions
    mov r14d, [rax + .export_names_addr] ; AddressOfNames
    mov r15d, [rax + .export_ords_addr]  ; AddressOfNameOrdinals

    ; Why isn't this mov above the others? This is intentional as it can be 
    ; good practice to stuff independent instructions between the data fetching
    ; instructions that load data from memory and instructions that deal with
    ; the data after it has been fetched. Can you think of why? 
    mov r12d, [rax + .export_names_num]  ; NumberOfNames (not an RVA)

    ; Turn all RVAs into valid 64 bit virtual addresses
    add r13, rbx
    add r14, rbx
    add r15, rbx

We now have all of the information we need to start the importing process. Before we do that though, let's take a detour and build the table where we'll be placing the addresses of the functions we want to import.

Building the import table

Before we iterate over the parallel arrays AddressOfNames, AddressOfNameOrdinals, and AddressOfFunctions let's build the table where we'll be storing the function pointers we want to import. By default, compilers and assemblers build a standard import table which is parsed by the Windows loader. In short, this table stores the ASCII string name of a DLL along with a list of tuples containing the ASCII string name of the function name in the DLL that the program wants to import along with 8 bytes (4 bytes for 32 bit binaries) of zeros that get overwritten by the Windows loader with the correct function pointer address of the corresponding function. All calls to imported functions are indirect calls through their corresponding import table entry which is fixed up when the module is loaded.

We will be building a similar table except we will store the 4 byte hash of the ASCII function name rather than the ASCII string itself. This will save a lot of bytes if the import table is large enough because Windows API function names can be quite long. To build our table we will be using fasm's powerful macro language to make a macro which takes a DLL and a list of functions to import from it and outputs the bytes of corresponding table entries. We'll be using these table entries in our importing code. One of the joys of using an assembler is having full control over the layout of data in your binary. While we could directly define the bytes of our custom import table entries manually, it would neither look nice nor be easily maintainable. As such, we will be using the full power of the assembler and creating a metaprogram which does all the work for us. The syntax of fasm's macro language is vaguely analogous to AWK. You don't have to worry too much much about the macro itself, just understand the format of the bytes that are produced. An example of the output will follow after the macro listing. If you want to understand fasm's macro syntax more you should look at the fasm Programmer's Manual and the fasm Preprocessor Guide.

; Our macro will be used as follows:
;   use 'kernel32', LoadLibraryA, ExitProcess
;   use 'user32',\
;       MessageBoxA,\
;       DestroyWindow
; The first argument to each use macro is the name of the DLL that we want to import
; functions from. To save bytes we won't be including the .dll extension in the file name.
; Notice that we can continue the arguments to the macro on a new line using a backslash.
; Because we will be using LoadLibraryA to load DLLs into our address space, we will need 
; to store the ASCII string name of the DLL. This string is going to be bound to the first 
; argument of use simply called dll. The rest of the arguments are the list of functions
; we want to import from the DLL and are a part of the [imp_name] list. Invoking macro in
; fasm is done by simply using it's name. Its argument list will be everything up to the 
; first non-escaped newline.

macro use dll, [imp_name] {
    ; For every invocation of use, common blocks will only be evaluated a single 
    ; time and then the next block will be evaluated
        ; This will define the current import number for the table
        ; it's used further down in the macro
        imp_num = 0

        ; The next 3 lines are a bit tricky so let's break them down.
        ; The first thing to know is that @@ is an anonymous label. This
        ; means that an offset in the file is recorded where the @@: is placed
        ; but it doesn't have a name to be able to directly reference. To refer
        ; to these types of labels @f and @b evaluate to the offsets designated
        ; by the next or preceding @@ label respectively. $ is the current offset
        ; in the file. We will cover the `@f - $` expression shortly.
        ; db defines bytes in the file exactly where you specify. By passing a
        ; string to db (such as the one in dll) it will define the ASCII string
        ; bytes sequentially. After that we immediately define a null byte to
        ; terminate the DLL string. The align directive will continue to insert
        ; padding bytes at the current location until the offset is a multiple of
        ; the number you pass to it. Therefore the @@: label will refer to a file
        ; offset that is a multiple of 8 bytes. This will be important later.
        ; After all of that, we can understand the purpose of `@f - $`: it
        ; defines a byte whose value is the size of the DLL string including 
        ; the null byte and the padding bytes that are used to align whatever 
        ; the next data definition is. This is used by the import code to jump
        ; over the file name and alignment bytes to the first entry of the table.
        db @f - $, dll, 0
        align 8

    ; Forward blocks are implicit looping constructs. When you have a macro
    ; argument surrounded by brackets like [imp_name] (denoting a list) the 
    ; body of forward will be evaluated for every item in [imp_name]. For every
    ; iteration, the current item will be bound to the imp_name symbol
        ; Local variables to the current forward block
        local imp, imp_len, i, h, c

        ; In fasm, virtual is how you "allocate" memory in macros. It creates a 
        ; space where you can freely define bytes into to read out later. These 
        ; don't get automatically put in your binary. We create the virtual space
        ; at 0 because all of the offsets in the space (such as those obtained
        ; using $) will start from this number. 
        virtual at 0
            ; Double colon to allow code outside the virtual to access this label
            ; and pull data out of our virtual space. Read the fasm Programmer's
            ; Manual for more details
                ; Remember how the function names we want to import aren't quoted
                ; strings? We'll use ` to stringify them and define their bytes 
                db `imp_name
                db 0

            ; The virtual space starts at 0 so $ is going to be the number of
            ; bytes in the string which we assign to imp_len
            imp_len = $ - 1
        end virtual

        ; Now that we have our import name as a string of bytes defined in a
        ; virtual block, we will iterate over each byte in the block and
        ; hash the entire string. The details about this particular hashing
        ; algorithm (djb2) is discussed later. All you have to know now is that
        ; after the while loop the h variable contains the hash of the function
        ; name we want to import.
        i = 0
        h = 5381
        while i <= imp_len
            ; Load the i'th byte out of our virtual block into c
            load c byte from imp:i

            ; This hashing algorithm purposefully overflows a 4 byte integer.
            ; fasm numbers aren't bounded that way so we must manually allow
            ; the overflow to happen using a modulus operation
            h = ((h shl 5) + h + c) mod 0x100000000
            i = i + 1
        end while

        ; We will always import from kernel32.dll first so it needs to always be
        ; the first DLL in our import table. Additionally, if the current 
        ; import entry number is 0 then we are at the start of the table.
        ; `#__imp_tab_start` defines a global label which refers to this first
        ; entry that we will want to populate into.
        if dll eq 'kernel32' & imp_num = 0
        end if

        ; Remember the `align 8` directive above? The following label refers to an
        ; offset which is a multiple of 8 bytes. This is true for all table entries 
        ; because the following data definition defines 8 bytes which means the next 
        ; label will fall on an 8 byte boundary as well.
        ; The following label definition will create a program-global label that can 
        ; be used to refer to the current import table entry. After our found function
        ; address has been filled into our current entry, this label with a name of
        ; imp_name will be used to refer to the function pointer of the import that
        ; we will call through. For example, if we imported ExitProcess, a global
        ; label will be defined called ExitProcess which we would use to refer to the 
        ; function address. To perform the call we would use the
        ; `call [ExitProcess]` instruction. We specify it as a qword to let fasm
        ; know the size of the label so we don't have to specify how big it is
        ; every time we want to use the label.
        label imp_name:qword

        ; Since h is the computed hash of the imported function we will define it as
        ; a 4 byte value (dd stands for define dword) followed by a zero. Both of
        ; these dwords combined are 8 bytes in size (a qword). After the hash of the
        ; function is found by the import code and then the corresponding entry is
        ; located, the entire 8 bytes is replaced with the fixed-up address of the
        ; import. This saves 4 bytes because we can re-use the space taken up by the
        ; hash itself after we don't have a use for it anymore.
        dd h, 0

        ; Increment the import number to prevent the previous if conditions from
        ; evaluating to true
        imp_num = imp_num + 1

    ; Here we have another common block. This will be evaluated after the previous
    ; common and forward blocks have ran and defined their bytes in the file. We
    ; will define 8 bytes of zeros to signal to our importing code that there are
    ; no more functions left to import from this particular DLL and to move onto the
    ; next DLL
        dq 0  ; Terminate table entry

; Finally, this macro will define a 0 dword which will signal to our importing code
; that there are no more DLLS that we want to import. We will sandwich the invocations
; of our use macro with import_start and import_end
macro import_end { dd 0 }

; This doesn't do anything but if we need an import_end, then we should make our
; importing syntax look nice by having an import_start in our DSL :)
macro import_start {}

Is alignment all that important on modern x86 hardware? It used to be. We could certainly shave some bytes off our table if we didn't care about aligning the entries. Some people say that modern processors don't care most of the time. Whatever the reality of the situation is, it's good practice to align your data. We want to be on our best manners if the function addresses contained in the entries are going to be frequently called through. If you want to know more about alignment in C then check out this page by ESR.

Those macros aren't the most appealing code in the world but it works pretty well. Note that we will start importing at __imp_tab_start which points to the first hash of kernel32's import table. This is because the first DLL that we are going to get the base of is kernel32 and we won't have to load it into our address space. For reasons that we'll learn of later, kernel32 will always need to be the first DLL that we import from with the use macro. You don't have to worry too much about the semantics of the code, just understand the following explanation of the bytes the macro builds. Let's say that we invoked our various macros to build an import table:

    use 'kernel32',\

    use 'user32',\

After this table is placed in the binary, the binary is then loaded into memory. Let's assume that we know that it is located at address 00000000`004010de. Let's dump that location to look at the bytes that we've built:

> db 00000000`004010de L 66  ;* display a Dump of Bytes 0x66 bytes Long
00000000`004010de  0a 6b 65 72 6e 65 6c 33-32 00 0c bf e7 b7 00 00  .kernel32.......
00000000`004010ee  00 00 4a 9d 70 74 00 00-00 00 5b 10 be 57 00 00[..W..
00000000`004010fe  00 00 ff 1e 69 b5 00 00-00 00 5e a7 8f a4 00 00  ....i.....^.....
00000000`0040110e  00 00 00 00 00 00 00 00-00 00 08 75 73 65 72 33  ...........user3
00000000`0040111e  32 00 bb 6b ac 1f 00 00-00 00 34 ab 31 42 00 00  2..k......4.1B..
00000000`0040112e  00 00 67 ef 07 a5 00 00-00 00 00 00 00 00 00 00  ..g.............
00000000`0040113e  00 00 00 00 00 00    

The purpose of the size byte we calculated with @f - $ becomes a bit more clear: if you go 0x0a bytes ahead of 00000000`004010de the address we get is 00000000`004010e8 which is clearly divisible by 8 and aligned. Looking at that address in the dump we can see the byte string 0c bf e7 b7 which is the little-endian ordered dword of the hash of the first import in our kernel32 table — AcquireSRWLockExclusive. After the dword hash there are four bytes of zeros before the next export's hash. When the 8 byte virtual address of AcquireSRWLockExclusive is located from the export directory of kernel32.dll, it is stored into 00000000`004010e8 and overwrites the bytes of the hash as well as the four bytes of zeros we defined. The global label that our macro defined for programmers to use to refer to the imported function is going to point to 00000000`004010e8. After the 4 bytes of zeros, we see the next hash corresponding to GetProcessAffinityMask. If we keep going through all of our imports we will see that they are all defined in this way and all aligned to qword boundaries. How do we signal the end of the imports for kernel32? Look at address 00000000`00401110 which is right after the qword entry associated with ExitProcess. It's 8 bytes of zeros signaling that there are no more imports for the kernel32 module. Our importing code will check to see if we have a hash that is zero and if so, will stop looking for imports associated with the kernel32 module. Immediately after this null entry we can see the size byte of the user32 import table. The final piece of the puzzle is located at 00000000`0040113c which is immediately after the terminating qword of the user32 table and is the 0 dword defined by import_end This will be used to stop importing altogether if it is found and will cause the main body of the code to be jumped to as the importing process is finished. We'll see the code for parsing this table shortly.

Populate our import directory

Let's do a quick recap of the fields that registers r12-r15 hold from the export table:

Register Field Description
r12 NumberOfNames The number of named functions exported
r13 AddressOfFunctions The start of the array of export RVAs
r14 AddressOfNames The start of the array of export ASCII names
r15 AddressOfNameOrdinals The array of ordinals mapping AddressOfNames to AddressOfFunctions

We have all the information we need for locating imports as well as a place to put them and it's time to iterate through the export data. We'll be using rcx as the index into the AddressOfNames array pointed to by r14. It holds the offset into the array of the current ASCII export name.

    ; rcx is used as a counter in the find_exports loop so zero it out before the first 
    ; iteration 
    xor ecx, ecx

; This is the start of a loop that will be jumped to for every export of the current DLL
    ; Use rcx to index AddressOfNames (stored in r14) and get the current ASCII export name.
    ; The dword-sized RVA will be stored in rsi in particular because there is a useful
    ; instruction we'll be using which loads bytes pointed to by the rsi register
    mov esi, [r14 + rcx*4]

    ; Turn the RVA into a VA so that rsi points to the string
    add rsi, rbx

Now we've set up our find_exports loop and pointed rsi to the ASCII string of the current export function. What are we going to do with this string? We'll need to check if its hash is stored in the list of functions we want to import in our import table. To do that let's compute its hash using Daniel Bernstein's classic djb2 algorithm:

// Return hash of str using the djb2 algorithm
uint32_t djb2(uint8_t *str) {
    uint32_t hash = 5381;
    uint32_t c;

    // Make each character in str contribute to the
    // resulting hash
    while(c = *str++)
        hash = ((hash << 5) + hash) + c;  // hash * 33 + c

    return hash;

This is the exact algorithm that our macro above used to calculate the hashes stored in our import table. The djb2 algorithm resembles an LCG and is easy to implement with very few instructions. Now that we have seen the C code it is straightforward to translate it to assembly:

    ; Hash current export name to see if there are any matching entries in our
    ; import table entry. rsi contains the address of the current 
    ; export string in the export directory

    ; edx contains our hash as its being computed (initialized with 5381)
    ; You can think of this constant sort of in the same way as a CRC polynomial
    mov edx, 5381
    mov eax, edx  ; eax = hash before shl (shift left)
    shl edx, 5    ; hash * 32 (2^5 = 32)
    add edx, eax  ; hash * 32 + hash = hash * 33

    ; clear out the now-junk bytes in eax before lodsb
    ; ... you'll see why in a second
    xor eax, eax

    ; lodsb loads one byte from the memory location pointed to
    ; by the rsi register into al (least significant byte of rax) and 
    ; then increments rsi by one. This incrementing is quite useful 
    ; as it allows us to easily iterate through the ASCII import name

    ; Add our character to the hash just like the C version of
    ; the algorithm. Remember how argument sizes must match. We
    ; zero'd out eax because we cant add al to edx
    add edx, eax

    ; This test instruction is doing a bitwise and operation between
    ; its arguments to set the flags for the following jnz (jump if
    ; zero flag is not set). &'ing a number with itself results in zero
    ; and sets the zero flag only when the number is zero. This
    ; instruction is checking if we have hit the NULL byte of the ASCII
    ; function name string
    test eax, eax

    ; Jump back to add the next character in the export string to our hash
    ; only if there is more string data to process
    jnz .djb2

    ; Utilize the dword hash in edx in the following snippet

Notice that we use the lodsb instruction to load string bytes from rsi. This instruction is part of a set collectively known as "string instructions" which make it easier to iterate over ranges of bytes. While they might not always be as fast as equivalent sequences of instructions which load bytes and increment pointers manually (this is because they are microcoded), their opcodes don't take up a lot of space. In this case the opcode for lodsb only takes up a single byte. Writing your code around using conventional registers like rsi in this way can add up to a sizeable amount of saved bytes!

Now that our hash function is implemented, the only other thing to consider is hash collisions between imports. This could happen but as we'll see later, the collision would need to happen between names of exported functions located within the same DLL. This should be unlikely enough for our purposes. If someone is going to be using this method of importing it wouldn't be too much to ask to mess around with the hashing algorithm or add more data to the hash like the name of the DLL itself to eliminate the collision should one arise.

Quick aside

We have the hash in edx so now we need to compare it with the list of hashes of functions we want to import in our import table. However, how do we know where our import table is located? Remember those instructions at the bottom of the find_kernel32 snippet I said to temporarily ignore? Specifically lea rbp, [__imp_tab_start] (we will cover the other instruction later). This loads the start of the import table into rbp (the base pointer register points to the base of our import table) right before find_export_dir. As we'll see, after we are done importing, everything we want from kernel32.dll we will jump back to find_export_dir with the base address of a new DLL in rbx as well as rbp pointing to the next correct table and start the importing process all over again. To kick off the first iteration though, we have to initialize rbp to the import table entries for kernel32.dll. That is why we defined __imp_tab_start in our macro. kernel32 will always be the first DLL that we import from.

Before we move on to the snippet of code that searches our import table, lets consider why we used the lea instruction for loading __imp_tab_start. Why didn't we just use mov? It turns out that there's a subtle gotcha of moving absolute addresses in x64. What happens if our binary got loaded at an unexpected base address? What if we wanted to use this code with some modifications in a DLL? If we were to use mov rbp, __imp_tab_start fasm would generate an opcode which loads the value of __imp_tab_start computed from the preferred module load address. This would only be valid if we were loaded in a way we preferred every time. While this is somewhat rare for exe files, sometimes life isn't fair and to deal with data loads like this in the face of different base addresses PE files can contain a .reloc section. The data in this section is used by the Windows loader to patch up all instances of absolute addresses like this to be correct in the event of a different base address (using one of the obscure relocation entry types). However, we don't like lots of sections because we're trying to keep a low profile. The solution to this is to use lea instead as fasm will always generate an opcode which will perform a RIP-relative address calculation which will compute the address of __imp_tab_start as an offset from the current instruction pointer rather than an absolute address. This will always yield the correct address no matter where we're loaded.

Using the excellent x64dbg debugger, we can see how RIP-relative loads are encoded. The opcode that fasm generated for our lea consists of the bytes in the screenshot below:

RIP-relative instruction

x64dbg displays the table address that is loaded into rbp as being 0x4010e8. Looking at the bytes of the opcode, we don't see this constant in the instruction stream. Instead we see the 32 bit constant 0x000000d3 which is part of the instruction opcode. How is our table address calculated? The answer is quite simple: 0x401015 + 0xd3 = 0x4010e8 because rip always points to the instruction that is after the one currently being executed. What about if we had used mov?

Absolute addressing

The difference is immediately obvious as the bytes of the address are absolutely encoded and now dependent on the module's load address.

Finally, how did we figure out the offset of the import table when we dumped it in WinDbg above?

> u $exentry ;* Unassemble a few instructions from the entry point
00000000`00401000 488b1c24        mov     rbx,qword ptr [rsp]
00000000`00401004 6631db          xor     bx,bx
00000000`00401007 4881eb00000100  sub     rbx,10000h
00000000`0040100e 488d2dd3000000  lea     rbp,[import+0x10e8 (00000000`004010e8)]
00000000`00401015 4883ec28        sub     rsp,28h
00000000`00401019 8b433c          mov     eax,dword ptr [rbx+3Ch]
00000000`0040101c 8b840388000000  mov     eax,dword ptr [rbx+rax+88h]
00000000`00401023 4801d8          add     rax,rbx

We can see our lea instruction and what the address to the first table entry is. However, we initialize kernel32's rbp to be the first entry of the table which contains the hash we want rather than before the skip-byte, string, and padding. If we wanted, we could modify the macro to not generate these to save a few bytes as they aren't needed. To find the true starting address of the table we would need to subtract a few bytes from the number initially lea'd into rbp.

Back into the fray

Now let's get back to comparing the hash of the current export with the hashes in our import table. To get to this portion of the code you should note that eax would be zero due to the null byte of the current function export string causing the jnz .djb2 to fall through.

    ; edx contains the computed hash of the current exported function name.

    ; Load the import table base for the current module into rsi
    mov rsi, rbp

; Iterate over every entry in the import table to see if edx is something we're
; interested in
    ; lodsq loads 8 bytes into rax from rsi and adds 8 to rsi (pointing to the next hash
    ; in our import table). There is no need to zero rax because it's already zero'd.
    ; What do you think lodsd does?

    ; Have we hit the last entry of the import table for our current module?
    ; If so that means that the hash in edx isn't of any interest to us
    test eax, eax 
    jz .next_export  ; Jump over populating the current import directory with edx

    ; If we haven't hit the last entry of our module's import list, does the hash of
    ; the current entry in our import table match the one in edx? 
    cmp eax, edx
    jne .find_table_entry

; The current import table entry matches the hash in edx computed from our DLL's 
; export directory. We need to get the address of the function corresponding 
; to the name of the current export string
    ; We don't want any leftover junk in the upper bits of eax. We can eliminate
    ; this extraneous instruction and save a byte using the movzx instruction...
    ; can you figure out how? ;)
    xor eax, eax

    ; Get the index into AddressOfNameOrdinals from rcx which is the current
    ; index into AddressOfNames. Note that AddressOfNameOrdinals is an array 
    ; of 16 bit values
    mov ax, [r15 + rcx*2]

    ; Get the RVA of the function we want to import from AddressOfFunctions
    ; using the correct value from AddressOfNameOrdinals
    mov eax, [r13 + rax*4]

    ; Turn the RVA into a VA. rax now contains the exact function pointer
    ; we want to put into our table
    add rax, rbx

    ; The lodsq instruction over-incremented rsi by 8 after we loaded the 
    ; hash of the import entry we want. That's easy to fix by just
    ; subtracting eight from rsi before we put the function address
    ; in rax into the memory location. This populates the import table
    ; entry with the qword function address and overwrites the unneeded hash.
    mov [rsi - 8], rax

; Go to next export
    ; Increment the index into the array of exported function names
    inc rcx

    ; If we haven't exceeded NumberOfNames then hash and compare the next
    ; export
    cmp rcx, r12
    jl find_exports

    ; If we dont jump back to find_exports we'll fall through to load the next DLL in our
    ; import table...

At this point our code will completely fill up our import table for kernel32 with the addresses of the requested exports. If the code falls through the jl find_exports (because we've looked through every export in the current DLL) we have to check to see if there is another DLL that we need to import from and if so, load that DLL and switch rbp to point to the correct table of hashes to search for/populate into.

When we encounter the string of a new DLL to import (such as "user32") we will pass it to the LoadLibraryA function which takes a DLL string and will load it into our process' address space. Notice how we didn't include the DLL file extension in our strings to make them a bit smaller. This is obviously optional and is because of the following quote:

If no file name extension is specified in the lpFileName parameter, the default library extension .dll is appended. However, the file name string can include a trailing point character (.) to indicate that the module name has no extension. When no path is specified, the function searches for loaded modules whose base name matches the base name of the module to be loaded. If the name matches, the load succeeds. Otherwise, the function searches for the file.

Before we get to the code, we have to learn about the sub rsp, 8*5 instruction in find_kernel32. Because we are going to be calling the LoadLibraryA Windows API function in our import code, we have to follow the conventions of the Windows x64 ABI. This dictates that the stack be aligned to 16 byte boundaries. After KERNEL32!BaseThreadInitThunk calls our entry point, the stack is misaligned and subtracting this amount of bytes from the stack pointer will re-align the stack pointer to be a multiple of 16. If we didn't align the stack then our call to LoadLibraryA call would fail.

Because we load (if the module doesn't exist in our address space already) DLLs with LoadLibraryA it is necessary to include that import in the list of imports from kernel32. This is a requirement because we refer to the LoadLibraryA label in the importing code itself so fasm has to know where the address to that function is. This also means that kernel32 needs to be the first DLL we import from in our series of invocations of the use macro. LoadLibraryA is only invoked after the import entries for kernel32 have been populated so this causes no issues. If we only wanted to import things from kernel32 then we wouldn't need to use LoadLibraryA in the first place. If we wanted we could have the use macro insert this hash and define a label into every kernel32 table instead of imposing this requirement.

Now let's get to the final piece of code that we need to set up the importing process for the next DLL we want to import:

    ; rcx big enough because size of # exports
    ; Move the current import table base into rdi to use with the following
    ; scasq instruction 
    mov rdi, rbp

    ; Zero rax as it's used as the value to compare against for by scasq
    xor eax, eax

    ; scasq will load a qword from the address in rdi and compare it to to 
    ; the value in rax. rdi will be be incremented by 8. If the qword is 
    ; equal to eax then the zero flag will be set. repnz will keep executing
    ; the scasq instruction until a qword equal to rax (0) is found and the
    ; zero flag is set (repeat while not zero)
    ; This is why we defined the table terminator byte in our macro. We are
    ; looping through all the fully imported import table entries (function
    ; addresses) until we hit our terminator. This is so that we can switch
    ; to the table of import hashes for the next DLL.
    repnz scasq

    ; The final scasq incremented rdi over the qword table terminator for our previous 
    ; import table. What does it point to now? It's either one of two things: the 
    ; byte before the ASCII name of the next DLL to import from, or a null byte.
    ; Where did this null byte come from? The import_end macro we defined above.
    ; We have to move the pointer in rdi that we used for scasq into rsi because we
    ; want to use the lodsb instruction
    mov rsi, rdi

    ; Loads al with the number of bytes to the first import table entry for the next
    ; DLL and increments rsi over size byte OR loads al with zero signalling the 
    ; end of our imports. Remember `db @f - $, dll, 0` defines the number of bytes
    ; to jump forward to the first hash in our import table for this DLL

    ; Are we done importing (null terminator defined by import_end)?
    test al, al
    jz done_importing

    ; We aren't done importing so lets make the next table base pointer (rbp) 
    ; equal to the current location in rsi added to the size byte that we loaded
    ; into rax with lodsb. We put that size byte in there because it's so easy to
    ; jump over the name of the DLL we want to import. The DLL name could be
    ; a large amount of characters and it takes up much less space to just have a
    ; byte telling us how much to jump over to the first table entry than writing
    ; instructions for scanning until the null byte of the DLL name string
    lea rbp, [rsi + rax - 1]  ; rax contains size byte

    ; rsi points to the ASCII DLL name we want to load because lodsb incremented
    ; the pointer over the size byte. We are moving it into rcx because it's the 
    ; calling convention to place the first argument (in this case the only
    ; argument) into rcx
    mov rcx, rsi

    ; This is why we always need to include LoadLibraryA in our list of imports
    ; to kernel32 -- we use it to load the rest of the DLLs from which we want
    ; to import from. Then we just walk their headers to their export directory
    ; just like we did with kernel32.dll. Now how we are calling the address
    ; of the function stored in the LoadLibraryA import directory entry that we
    ; just populated
    call [LoadLibraryA]

    ; The loaded module base address is returned in rax so we want to move it into
    ; rbx because we want to walk the headers in find_export_dir and compute VAs
    ; from RVAs
    mov rbx, rax

    ; Let's do it all over again with the base address of the DLL we just loaded
    jmp find_export_dir

; Wow! We're done importing everything. Now we can call main/start the real program
    ; Do something interesting...

And that's it! We will jump back to find_export_dir and start the process all over again with another DLL. When we look for another DLL and hit the zero of import_end we jump to done_importing where we can start writing our actual code. When we want to call any of our imports then we just call through the import table address like call [ImportFromDll] and everything will work if it the import is valid. The final code listing below contains a small example of using some imports from different DLLs.

Wrapping up

If you made it this far, congratulations! Building tiny binaries is a lot of fun and there are many other techniques that we can employ. Our importing code ended up being 171 bytes of machine code — not bad for doing everything it does.

> db $exentry L AB
00000000`00401000  48 8b 1c 24 66 31 db 48-81 eb 00 00 01 00 48 8d  H..$f1.H......H.
00000000`00401010  2d d3 00 00 00 48 83 ec-28 8b 43 3c 8b 84 03 88  -....H..(.C<....
00000000`00401020  00 00 00 48 01 d8 44 8b-68 1c 44 8b 70 20 44 8b  ...H..D.h.D.p D.
00000000`00401030  78 24 44 8b 60 18 49 01-dd 49 01 de 49 01 df 31  x$D.`.I..I..I..1
00000000`00401040  c9 41 8b 34 8e 48 01 de-ba 05 15 00 00 89 d0 c1  .A.4.H..........
00000000`00401050  e2 05 01 c2 31 c0 ac 01-c2 85 c0 75 f0 48 89 ee  ....1......u.H..
00000000`00401060  48 ad 85 c0 74 17 39 d0-75 f6 31 c0 66 41 8b 04  H...t.9.u.1.fA..
00000000`00401070  4f 41 8b 44 85 00 48 01-d8 48 89 46 f8 48 ff c1  OA.D..H..H.F.H..
00000000`00401080  4c 39 e1 7c bc 48 89 ef-31 c0 f2 48 af 48 89 fe  L9.|.H..1..H.H..
00000000`00401090  ac 84 c0 74 16 48 8d 6c-06 ff 48 89 f1 ff 15 55  ...t.H.l..H....U
00000000`004010a0  00 00 00 48 89 c3 e9 6e-ff ff ff                 ...H...n...

There are definitely ways of optimizing for size to shave some bytes off in the code as well as how we lay out our data in the import table but this post has gone on for too long already.

Hope you enjoyed it! If you have any comments or suggestions I would love to hear from you.

Assembly resources

Final code

You should now be able to understand the listing of the final code below:

format PE64 GUI
entry find_kernel32

    .alloc_granularity = 10000h

    ; Obtain base of kernel32.dll in rbx
    mov rbx, [rsp]  ; return address to kernel32 loader function
    xor bx, bx      ; align address to 64k boundaries (allocation granularity)
    sub rbx, .alloc_granularity

    lea rbp, [__imp_tab_start]
    sub rsp, 8*5    ; reserve stack for API use and make stack dqword aligned

    .e_lfanew   = 3Ch
    .data_dir_0 = 88h

    ; Obtain IMAGE_EXPORT_DIRECTORY address in rax
    mov eax, [rbx + .e_lfanew]
    mov eax, [rbx + rax + .data_dir_0]
    add rax, rbx

    .export_names_num  = 18h
    .export_funcs_addr = 1Ch
    .export_names_addr = 20h
    .export_ords_addr  = 24h

    ; Obtain info from IMAGE_EXPORT_DIRECTORY
    mov r13d, [rax + .export_funcs_addr] ; AddressOfFunctions
    mov r14d, [rax + .export_names_addr] ; AddressOfNames
    mov r15d, [rax + .export_ords_addr]  ; AddressOfNameOrdinals
    mov r12d, [rax + .export_names_num]  ; NumberOfNames
    add r13, rbx
    add r14, rbx
    add r15, rbx

    xor ecx, ecx
    mov esi, [r14 + rcx*4] ; use rcx to index AddressOfNames getting cur func name RVA
    add rsi, rbx

  ; hash export name to compare to imp tbl entry
    mov edx, 5381 ; edx contains hash
    mov eax, edx
    shl edx, 5
    add edx, eax
    xor eax, eax
    add edx, eax
    test eax, eax
    jnz .djb2

  ; locate imp table entry
    mov rsi, rbp
    test eax, eax
    jz .next_export        ; we have hit the table's null terminator
    cmp eax, edx
    jne .find_table_entry

    xor eax, eax           ; not needed if eax is already 0
    mov ax, [r15 + rcx*2]  ; get export's index into AddressOfFunctions
    mov eax, [r13 + rax*4] ; get export address rva from AddressOfFunctions
    add rax, rbx
    mov [rsi - 8], rax     ; populate imp table entry with address

  ; go to next export
    inc rcx
    cmp rcx, r12
    jl find_exports

  ; rcx big enough because size of # exports
    mov rdi, rbp
    xor eax, eax
    repnz scasq      ; find next dll table
    mov rsi, rdi
    lodsb            ; increments rsi over size byte 
    test al, al      ; the increment is why `cmp byte [rdi], 0` (3 bytes too) isnt used
    jz done_importing
    lea rbp, [rsi + rax - 1]  ; rax contains size byte
    mov rcx, rsi
    call [LoadLibraryA]
    mov rbx, rax
    jmp find_export_dir

    ; call main

    xor r9d, r9d
    lea r8, [_msg]
    lea rdx, [_msg]
    xor ecx, ecx
    call [MessageBoxA]

    xor ecx, ecx
    call [ExitProcess]

  _msg db 'hello tiny world!', 0

  macro import_start {}
  macro import_end { dd 0 }

  macro use dll, [imp_name] {
        imp_num = 0

        db @f - $, dll, 0
        align 8

          local imp, imp_len, i, h, c
          virtual at 0
                  db `imp_name
                  db 0
              imp_len = $ - 1
          end virtual

          ; hash using djb2
          i = 0
          h = 5381
          while i <= imp_len
              load c byte from imp:i
              h = ((h shl 5) + h + c) mod 0x100000000
              i = i + 1
          end while

          if dll eq 'kernel32' & imp_num = 0
          end if

          label imp_name:qword
            dd h, 0

          imp_num = imp_num + 1
          dq 0  ; terminate table entry

  ; 115 bytes table, 172 bytes import asm
  ; equivalent default imp table is 358 bytes saving 71 bytes already
  ; with this small table
    use 'kernel32',\

    use 'user32',\