Upon reading the title of this article, one might pose the initial question: what would an ARM-based operating system do with an x86 instruction? Or a chunk of x86 instructions? Or an entire x86 binary? Windows 10, for example, does this by taking a set of x86 instructions below:
push ebp
mov ebp,esp
pop ebp
nop
jmp ntdll_775d0000!LdrInitializeThunk
And translating it to the following:
str wfp,[x28,#-4]! // push ebp
mov wfp,w28 // mov ebp,esp
ldr wfp,[x28],#4 // pop ebp
add w9,w9,#0x83,lsl #0xC
add w9,w9,#0x1FE
bl 00000000`03109aa8 // (get jump function address)
br xip1 // jmp ntdll_775d0000!LdrInitializeThunk
First off, ARM and x86 are completely different architectures. An ARM processor is incapable of executing x86 code and the hardware provides no means to do so. This leaves the task up to software developers to facilitate it themselves.
Microsoft's recent version of Windows 10 for ARM-based processors assumes such a task, by simulating an x86 processor entirely in userland. An emulator module (xtajit.dll) employs a form of just-in-time (JIT) translation to convert x86 code to ARM (shown above) within a loop, as the x86 process is executing. On each pass, a chunk of x86 code is translated to ARM, and the translation is executed.
All of this, as you might have guessed, can make the experience of running x86 programs a comparatively slow experience. However, a cache of already-translated code (located in C:\Windows\XtaCache) eliminates much of the overhead. A compiler (xtac.exe) and background caching service (XtaCache) handle full binary translation and caching. Hybrid binaries (located in C:\Windows\SyChpe32) containing x86-to-ARM stubs also help to reduce overhead.
In this article, I present what I believe are five key features of x86 emulation, concluding with an example of the raw opcode translation procedure.
CHPE DLLs
A peculiar system directory exists on Windows 10 for ARM. Located at C:\Windows\SyChpe32, this folder holds a small set of DLL files with the same names as the most frequently used libraries on Windows: ntdll.dll, kernel32.dll, advapi32.dll, etc. Several interesting characteristics are observable in these types of files, including a mixture of x86 and ARM functions and the presence of a strange section name in the header.
.hexpthk
When we first begin investigating a CHPE file (ntdll.dll), we discover a new section type in the header:
000001e0: 00 00 00 00 00 00 00 00 2e 68 65 78 70 74 68 6b .........hexpthk
000001f0: e0 53 00 00 00 10 00 00 00 54 00 00 00 04 00 00 .S.......T......
00000200: 00 00 00 00 00 00 00 00 00 00 00 00 20 00 00 60 ............ ..`
00000210: 2e 74 65 78 74 00 00 00 c2 5d 17 00 00 70 00 00 .text....]...p..
.hexpthk possibly stands for Hybrid Executable Push Thunk. The primary purpose of the 1300+ functions in this section is to provide a set of x86 jump stubs for the library's native ARM APIs. This eliminates the need for JIT translation or XTA cache file access, thus reducing a bit of overhead.
One of the stubs in this section:
ntdll_773e0000!EXP+#RtlCreateUserProcess:
773e2520 8bff mov edi,edi
773e2522 55 push ebp
773e2523 8bec mov ebp,esp
773e2525 5d pop ebp
773e2526 90 nop
773e2527 e9a4620d00 jmp ntdll_773e0000!#RtlCreateUserProcess (774b87d0)
Which jumps to:
ntdll_773e0000!#RtlCreateUserProcess:
00000000`774b87d0 29ba7bfd stp wfp,wlr,[sp,#-0x30]!
00000000`774b87d4 910003fd mov fp,sp
00000000`774b87d8 52800028 mov w8,#1
...
00000000`774b8810 94000004 bl ntdll_773e0000!#RtlCreateUserProcessEx (00000000`774b8820)
00000000`774b8814 28c67bfd ldp wfp,wlr,[sp],#0x30
00000000`774b8818 d65f03c0 ret
push_thunk and pop_thunk
This leads us to another collection of special CHPE functions (in the .text section): the push_thunk and the pop_thunk. These attempt to fetch the translated function from the JIT cache buffer or XTA cache file. Ntdll.dll contains a total of 282 push_thunk and 1829 pop_thunk functions. Below is a disassembly snippet of #NtAllocateVirtualMemory$push_thunk:
ntdll_773e0000!#NtAllocateVirtualMemory$push_thunk:
00000000`77525630 a9bb53f3 stp x19,x20,[sp,#-0x50]!
...
00000000`77525674 900002aa adrp x10,ntdll_773e0000!_os_wowa64_dispatch_call (00000000`77579000)
00000000`77525678 b9400150 ldr wip0,[x10]
00000000`7752567c b0000188 adrp x8,ntdll_773e0000!NtAccessCheck (00000000`77556000)
00000000`77525680 11060109 add w9,w8,#0x180
00000000`77525684 d63f0200 blr xip0
00000000`77525688 d503201f nop
00000000`7752568c 37f80085 tbnz x5,#0x1F,ntdll_773e0000!#NtAllocateVirtualMemory$push_thunk+0x6c (00000000`7752569c)
00000000`77525690 900002b0 adrp xip0,ntdll_773e0000!_os_wowa64_dispatch_call (00000000`77579000)
00000000`77525694 b9402210 ldr wip0,[xip0,#0x20]
00000000`77525698 d63f0200 blr xip0
00000000`7752569c b9400ba8 ldr w8,[fp,#8]
...
00000000`775256c4 d65f03c0 ret
XTA cache (.jc) files
We notice a number of XTA cache (.jc) files in the C:\Windows\XtaCache directory whose filenames correspond to several common DLLs:
ACWINRT.DLL.8CB3A2AB47A53C8A2B0154CD9DCFBAB3.323EB21400CAFAA0582C5009A75869C1.mp.1.jc
APPHELP.DLL.95D0BAD69AB6222384AF242317B56149.DF78C05157184445FAFBB6CE538964A5.mp.1.jc
BCRYPTPRIMITIVES.DLL.013B0F82E47EF5A5FCF5A8526296185B.EF8F27FC52B74ACB9AC3BC89FA928CE1.mp.1.jc
CFGMGR32.DLL.5D4A7278FAF3DBBDBF2DDB915FCAFAB5.F945F9A53B0D60BD9E78255821B9CF3C.mp.1.jc
CMD.EXE.82E4EB063821519C52FE2943A99BF400.1D875ED0DCA07279C17651C0E6C07D68.mp.1.jc
COMBASE.DLL.4F0AC5980CD35BD10454EEE975995AC5.27FF208B6378EEC51D8E744BAF092F94.mp.1.jc
CRYPT32.DLL.C5DAC220D689F508F0E17E2A66CA3F1A.320EA93B91430A11C07BCF035279C389.mp.1.jc
CRYPTBASE.DLL.16CA22D6D04B8B1996C66338057F9FBF.385D1574AF589DD7DBE99097ADFA13A7.mp.1.jc
DEVOBJ.DLL.C27E6011E5F5F63CCBE4C1EDDE39AB82.8F5C03EA8E04FE31C38209807883CEFF.mp.1.jc
KERNEL32.DLL.834794B203AF374A9362DB4ED6A773E1.DDEEC7202859318755E42D071DBE493B.mp.1.jc
KERNELBASE.DLL.FD274E0B2908F4427DD5E105D6793B46.DB966B70C90268F5B3A22AF2FFD62FB9.mp.1.jc
MMDEVAPI.DLL.4095F18EA4B2E4980A31F9CFF10ACB18.F3FFB7D241CFAF41DBF7A8EB1FA2F14C.mp.1.jc
MSASN1.DLL.592DF648596867992875D9F422985BE2.C91F12FC8B8289618C00A6A0B4CB63F6.mp.1.jc
MSVCP_WIN.DLL.7A9A3D8939BA215908F2D277F6036868.059202EDB6FBBF9C5D3CEA601F657783.mp.1.jc
MSVCRT.DLL.D751099CF1900AB3B0B21169E3ACAEA1.D32B8BAE88BDFA4E405C93DE74A58C58.mp.1.jc
NTDLL.DLL.25E64FDB4C4531A2DA3649AF45082708.A8B2D871AD511568138A61A746F3477E.mp.1.jc
SECHOST.DLL.B4519BA08D878093477A68086F2C73E9.6B8F8A4E1C74BF41ECB65F5BC09C99AE.mp.1.jc
SSPICLI.DLL.8A3767BF366F5B126EEE0A97F09F3821.177E02DDE1B9F225A2E156445C8C26C6.mp.1.jc
UCRTBASE.DLL.FDEE79B96ED912C4BECA94A2153A8DA6.2B41C9D2592D756F6DA93B24EBBAB8BB.mp.1.jc
WINBRAND.DLL.0B7D029DE53AA3517BAAE21FFF2958A2.080F7128AEA24497BE12FC799A471BCA.mp.1.jc
WINTRUST.DLL.36F3317E5E83FB1D21A49E8598B82079.1A4137576B1F56F2356651AA77E6C162.mp.1.jc
WLDP.DLL.52F5CBD679E57BDD90519D96701B1362.42321841F03DF5F906568D1916ABD9FC.mp.1.jc
.jc files have a header:
0:000:x86> db 07640000
07640000 58 54 41 43 13 00 00 00-00 00 00 00 48 0f 01 00 XTAC........H...
07640010 1f 01 00 00 38 00 00 00-12 00 00 00 40 18 01 00 ....8.......@...
07640020 64 00 00 00 50 00 00 00-50 00 00 00 50 a1 00 00 d...P...P...P...
07640030 d4 a1 00 00 ec 0e 01 00-4e 00 54 00 44 00 4c 00 ........N.T.D.L.
07640040 4c 00 2e 00 44 00 4c 00-4c 00 00 00 00 00 00 00 L...D.L.L.......
07640050 42 4c 43 4b e4 0e 01 00-00 00 00 00 00 00 00 00 BLCK............
07640060 bf 39 03 d5 c0 03 5f d6-00 00 40 79 fd ff ff 17 .9...._...@y....
07640070 20 00 40 79 fb ff ff 17-c0 00 40 79 f9 ff ff 17 .@y......@y....
.jc header format
Below is the format for the cache file header:
XTAC header
0x00 DWORD 'XTAC'
0x04 DWORD Always 0x13
0x08 DWORD BOOL? Most files set to 0, some are 1
0x0c DWORD Offset of JC address pair table
0x10 DWORD Number of JC address pairs (two DWORDs = 8 bytes per pair)
0x14 DWORD Offset of module name
0x18 DWORD Length of module name (in bytes)
0x1c DWORD Offset of module NT path name (usually at very end of file)
0x20 DWORD Length of module NT path name (in bytes)
0x24 DWORD Offset of BLCK stubs
0x28 DWORD ?
0x2c DWORD Size of BLCK stubs (always 0xa150)
0x30 DWORD ?
0x34 DWORD ?
0x38 WCHAR[] usually start of module name
XTA cache (.jc) files are composed of a JC address pair table and translated x86-to-ARM code. Entries in this pointer table each contain two relative virtual addresses: the RVA of the original function in the x86 binary and the RVA of the translated function in the cache file, in that order.
Pointer table entries do not always point to the start of a function. They often point to return addresses within a function, ie. the address of an instruction immediately following a call instruction. Hence multiple address pairs often exist for a single function.
Below is a translation of __mainCRTStartup from CMD.EXE.82E4EB063821519C52FE2943A99BF400.1D875ED0DCA07279C17651C0E6C07D68.mp.1.jc:
// __mainCRTStartup - translated ARM code from XTA cache file for C:\Windows\SysWOW64\cmd.exe
1bebc: 528b49bb mov w27, #0x5a4d // 'MZ'
1bec0: 51405926 sub w6, w9, #0x16, lsl #12
1bec4: 512374c6 sub w6, w6, #0x8dd
1bec8: 48dffcc7 ldarh w7, [x6]
1becc: 6b1b00e2 subs w2, w7, w27
1bed0: 54000761 b.ne 0x1bfbc // b.any
(...snip...)
Original (x86) __mainCRTStartup at 0x4168dd (base address 0x400000 + 0x168dd):
0x004168dd b84d5a0000 mov eax, 0x5a4d // 'MZ'
0x004168e2 663905000040. cmp word [0x400000], ax // [0x400000:2]=0xffff
0x004168e9 7555 jne 0x416940
(...snip...)
Notice the equivalent "mov eax, 0x5a4d" and "mov w27, #0x5a4d" instructions. The indirect memory access in the cmp instruction ends up being split into four ARM instructions, while the relative branch jne instruction is easily translated into a b.ne. With the cache file accessible, the emulator can then fetch a translated function if needed.
So, the final question remains: how does the emulator (xtajit.dll) actually perform the translation?
XTA JIT (xtajit.dll)
It all begins in BTCpuSimulate, the beating heart of the x86 emulator. However, emulation doesn't start right away when an x86 process begins. The emulation module, xtajit.dll, has not yet been loaded. Windows automatically loads the native ARM ntdll.dll (C:\Windows\System32\ntdll.dll), the CHPE ntdll.dll (C:\Windows\SyChpe32\ntdll.dll), and the wow64 DLLs required for emulation. Before emulation begins, the xtajit function BTCpuProcessInit is called:
NTSTATUS BTCpuProcessInit(PWCSTR wImageName, PVOID pCpuThreadSize)
{
if (!NT_SUCCESS(GetProcessorPowerInformation()))
return;
InitializeCacheDatabase(); // sub_18000E770
/*...*/
// Gets Wow64InfoPtr from TEB
TEB teb = GetCurrentTeb();
TEB32 teb32 = teb + teb->Peb32Offset;
if (teb == teb32->SubSystemTib)
wow64InfoPtr = teb32->TlsSlots[10]; // Wow64InfoPtr
else
wow64InfoPtr = teb->TlsSlots[10]; // Wow64InfoPtr
wow64InfoPtr->CpuFlags |= 2; // WOW64_CPUFLAGS_SOFTWARE
/*...*/
// Done
return STATUS_SUCCESS;
}
Then the emulation starts in the BtCpuSimulate loop:
void BTCpuSimulate()
{
// Calls RtlWow64GetCurrentCpuArea (sub_180015530)
PWOW64_CPU_AREA cpuArea = GetWow64ArmCpuArea();
while (1)
{
/* This function is in the wow64 module. Cross-process items might
include calls to BTCpuNotifyMemory functions. Among those are:
BTCpuNotifyMemoryAlloc, BTCpuNotifyMemoryFree,
BTCpuNotifyMemoryProtect, BTCpuNotifyMemoryDirty */
Wow64ProcessPendingCrossProcessItems();
// Inside the CPU emulator
GetCurrentTeb()->TlsSlots[2] = TRUE;
if (gWow64CpuInfo->ProcessInitComplete)
dmb_ish(); // Wait for all memory accesses to finish
// Updates x86 registers in WOW64 ARM context structure (sub_180015038)
CpupSwitchToX86(
cpuArea->Wow64ArmContext, // Destination
cpuArea->Wow64Context, // Source
);
// Emulates/executes x86 instruction (sub_1800215d0)
EmulateX86Function(
gWow64CpuInfo->BTProperties, // Initialized in BTCpuProcessInit
cpuArea->Wow64ArmContext // Same as 1st argument to CpupSwitchToX86
);
}
}
Registers for x86 are updated in CpupSwitchToX86, where a WOW64_CONTEXT structure inside a WOW64_CPU_AREA structure is updated with the newest values of each register from a source WOW64_CONTEXT. But the function perhaps deserving the most attention in this article is a function I have named EmulateX86Function.
EmulateX86Function
EmulateX86Function, not surprisingly, is a very large and very complex function, from which flow several more very large and very complex functions. Not surprisingly, because one would generally assume that converting between two completely different CPU architectures involves quite a bit of work. X86 instructions must be taken apart and analyzed, deconstructed, and interpreted as ARM instructions. This is all done through the cooperation of several routines, all stemming from a root function: EmulateX86Function. Below is a (truncated) reverse engineering of this function:
// sub_1800215d0
void EmulateX86Function(PVOID btProperties, PWOW64_ARM_CONTEXT wow64ArmContext)
{
PVOID jitCacheInfo = wow64ArmContext->JitCacheInfo;
do
{
DWORD eip = wow64ArmContext->Eip;
DWORD bitShift = jitCacheInfo->BitShift;
DWORD mask = jitCacheInfo->Mask;
PVOID funcTbl = jitCacheInfo->FunctionTable;
DWORD offset = (eip >> bitShift) & mask;
PVOID funcEntry = funcTbl + offset;
PVOID funcEntryRes = NULL;
int counter = 0;
do
{
DWORD btFuncOff = funcEntry->BinaryTranslatedFunc;
if (btFunc != NULL)
{
PVOID origFunc = funcEntry->OriginalFunc;
funcEntry++;
if (origFunc == eip)
{
funcEntryRes = funcEntry - 1;
break;
}
}
else
{
funcEntryRes = NULL;
break;
}
} while (++counter < 32);
if (funcEntryRes != NULL)
{
PVOID dispatchInfo = btProperties->DispatchInfo;
DWORD wordSize = dispatchInfo->WordSize;
PVOID pJitCacheInfo = dispatchInfo->JitCacheInfo;
// Offset of the dispatch routine in the JIT-cache buffer
DWORD offset = dispatchInfo->Offset;
// Pointer to the JIT-cache buffer in memory
PVOID pJitCache = pJitCacheInfo->pJitCache; // Offset 0x20 (32-bit)
PVOID dispatcher = pJitCache + offset;
/* Example: EXP+LdrInitializeThunk (x86) is the first function.
* Once BTDispatchRoutine returns, wow64ArmContext->Eip becomes
* LdrInitializeThunk (ARM64). */
BTDispatchRoutine(wow64ArmContext->Context, wow64ArmContext,
dispatcher);
}
} while (wow64ArmContext->ExecutionFlag != 1);
BTCpuSimulateComplete(btProperties, wow64ArmContext, bUnknown1, bUnknown2,
bUnknown3, bUnknown4, bUnknown5);
}
So the emulation begins, but which function or functions does it hand off the translation to and where?
Binary JIT Translation
Before I answer this question right away, let’s take a look at a translation of LdrInitializeThunk (the very first x86 function executed in an x86 process):
// Original x86 function
ntdll_775d0000!EXP+#LdrInitializeThunk:
775d1760 8bff mov edi,edi
775d1762 55 push ebp
775d1763 8bec mov ebp,esp
775d1765 5d pop ebp
775d1766 90 nop
775d1767 e9f4310800 jmp ntdll_775d0000!LdrInitializeThunk (77654960)
// JIT translation
00000000`0310a168 b81fcf9d str wfp,[x28,#-4]! // push ebp
00000000`0310a16c 2a1c03fd mov wfp,w28 // mov ebp,esp
00000000`0310a170 b840479d ldr wfp,[x28],#4 // pop ebp
00000000`0310a174 11420d29 add w9,w9,#0x83,lsl #0xC
00000000`0310a178 1107f929 add w9,w9,#0x1FE
00000000`0310a17c 97fffe4b bl 00000000`03109aa8 // (get jump function address)
00000000`0310a180 d61f0220 br xip1 // jmp ntdll_775d0000!LdrInitializeThunk
To answer the question, observe the following call flow which illustrates the process behind this translation:
BTCpuSimulate
└─ EmulateX86Function
├─ sub_180022c60
│ └─ sub_180022d20
│ │
│ ├─ sub_1800255b8
│ │ └─ BTXtaDeconstructOpcode
│ │
│ └─ sub_180023e10
│ ├─ BTXtaCreatePushPop // writes "str wfp,[x28,#-4]!" (@180042e04) push ebp
│ ├─ sub_180041fc0
│ │ └─ sub_180043e30 // writes "mov wfp,w28" (@180043eb4) mov ebp,esp
│ ├─ BTXtaCreatePushPop // writes "ldr wfp,[x28],#4" (@180042d1c) pop ebp
│ ├─ sub_180040070
│ │ └─ sub_18003fd30
│ │ └─ sub_1800473f8 // writes "add w9,w9,#0x83,lsl #0xC" (@18004756c) jmp ...
│ │ // writes "add w9,w9,#0x200" (@18004757c) jmp contd...
│ └─ sub_18003fd30
│ └─ sub_180040070
│ └─ sub_180046550 // writes "bl 00000000`03109aa8" (@180046580) jmp end...
└─ sub_180020928
└─ sub_180021008
└─ memcpy
Translation begins at BTCpuSimulate, and ends at a call to memcpy in sub_180021008. Many, many different functions are provided for translation. Sets of functions are designated for groups of similar x86 instructions. Together, they all perform the same general task:
1. Deconstruct an x86 instruction's bytes into a struct (BTXtaDeconstructOpcode)
2. Create an ARM instruction from the deconstructed data (BTXtaCreatePushPop)
3. Save the converted ARM instruction to a buffer (sub_180021008)
4. Repeat the above four steps for each instruction in a function
5. Save the converted function to the JIT-cache buffer
As I mentioned earlier, x86 to ARM translation is quite complex. The reason for xtajit.dll's relatively large size is partly due to the overall size of the x86 instruction set, and the need to emulate it in an ARM-based environment. Instructions are organized into groups: push and pop, for example, use some of the same functions during translation. In the end, the resulting instructions are saved to the JIT-cache buffer.
An x86 instruction's bytes are deconstructed in the following function:
// sub_180025b80
INT BTXtaDeconstructOpcode(PXTA_INFO xtaInfo, PXTA_INSTRUCTION xtaInstruction)
{
if (xtaInstruction->dword_12 >= 15)
goto loc_18005523c;
offset = xtaInfo->InstructionOffset; // From start of function (0xA4)
address = xtaInfo->InstructionAddr; // Absolute address of instruction
BYTE opcode = *((PBYTE)(address + offset)); // Instruction opcode
xtaInfo->InstructionOffset += 1; // Move to next byte
if (opcode > 0xFF)
goto loc_180025e18;
switch (opcode)
{
// push opcodes
case 0x50:
case 0x51:
case 0x52:
case 0x53:
case 0x54:
case 0x55: // push ebp
case 0x56:
case 0x57:
x8 = xtaInstruction->dword_1D;
x12 = opcode & 7;
if (xtaInstruction->dword_1D & 1 == 0)
goto loc_18002db6c;
xtaInstruction->dword_19 = 4;
xtaInstruction->dword_1C = 0x14;
xtaInstruction->TransferRegister = opcode & 7;
xtaInstruction->dword_24 |= (1 << (opcode & 7)) | 0x10;
xtaInstruction->GenerateArm64Func = BTXtaCreatePushPop;
xtaInstruction->dword_28 |= 0x10;
xtaInstruction->OtherFunc = sub_1800399b0;
xtaInstruction->dword_8 |= 1;
break;
// mov r32,r/m32
case 0x8b:
// ...
break;
}
return 0;
}
And translated below:
#define XTAR_X28 28
#define XTAI_PUSH_POP 0xb81fcc00
// sub_180042bb0
VOID BTXtaCreatePushPop(PXTA_DATA xtaData, PXTA_INSTRUCTION xtaInstruction,
PJIT_CACHE_FUNCTION jitCacheFunc)
{
// ...
if (/* do some checks... */)
{
// ...
// loc_180042de8
jitCacheFunc->Buffer[0] = XTAI_PUSH_POP | (XTAR_X28 << 5) |
gXtaContextRegs[xtaInstruction->TransferRegister + 0x6e];
}
// ...
}
The final bytes are then written to a buffer. This buffer is later copied to the JIT-cache buffer with memcpy (see call graph above) and finally executed.
So, we finally witness the actual translation, and very little of what’s involved is surprising. Given that the constituent parts of an instruction are organized into bytes and bit fields, translation simply becomes a matter of deconstructing and reconstructing the parts for a different architecture. But if we again recall this flowchart:
BTCpuSimulate
└─ EmulateX86Function
├─ sub_180022c60
│ └─ sub_180022d20
│ │
│ ├─ sub_1800255b8
│ │ └─ BTXtaDeconstructOpcode
│ │
│ └─ sub_180023e10
│ ├─ BTXtaCreatePushPop // writes "str wfp,[x28,#-4]!" (@180042e04) push ebp
│ ├─ sub_180041fc0
│ │ └─ sub_180043e30 // writes "mov wfp,w28" (@180043eb4) mov ebp,esp
│ ├─ BTXtaCreatePushPop // writes "ldr wfp,[x28],#4" (@180042d1c) pop ebp
│ ├─ sub_180040070
│ │ └─ sub_18003fd30
│ │ └─ sub_1800473f8 // writes "add w9,w9,#0x83,lsl #0xC" (@18004756c) jmp ...
│ │ // writes "add w9,w9,#0x200" (@18004757c) jmp contd...
│ └─ sub_18003fd30
│ └─ sub_180040070
│ └─ sub_180046550 // writes "bl 00000000`03109aa8" (@180046580) jmp end...
└─ sub_180020928
└─ sub_180021008
└─ memcpy
This entire procedure is all for merely four instructions (push ebp; mov ebp, esp; pop ebp; jmp). Imagine the overhead if this were to be performed for a full x86 binary.
Which brings us back to the use of XTA cache files: how are they created in the first place? This is done through independent background caching and compilation with the help of the caching service.
XTA Caching Service (XtaCache)
XtaCache is fairly small. Its purpose is to listen for module load notifications and to start the compiler if needed. Notifications are sent by the emulator (xtajit.dll) when a module is loaded. If necessary, the service will create a new XTA cache file in C:\Windows\XtaCache and start the compiler (xtac.exe). The compiler then performs a partial or full translation of the module's executable code. Once finished, it notifies the service, whereupon XtaCache saves the changes to the XTA cache file and closes all open file handles.
xtajit → XtaCache
Communication between xtajit and XtaCache is achieved using NtAlpcSendWaitReceivePort. Below is the call flow showing inter-process communication between the emulator and caching service:
BTCpuNotifyMapViewOfSection → sub_180014A80 → sub_180015AB0 → sub_180015C28 → sub_180015D80 → sub_180019010 → NtAlpcSendWaitReceivePort}
BTCpuNotifyMapViewOfSection is called every time a module is loaded (since NtMapViewOfSection is called every time a module is loaded). Eventually it passes a module file handle to NtAlpcSendWaitReceivePort, which sends the message to the compiler, xtac.exe.
XtaCache → xtac
Once XtaCache receives a module file handle from the emulation module, it must determine whether or not it the x86 module file needs to be compiled and written to a cache file. The following is the call flow leading to the execution of the compiler (xtac.exe):
1. NtAlpcSendWaitReceivePort receives module file handle from xtajit.dll
2. NtCreateFile opens module, creates a SHA256 hash from file name and PE header data
3. TpAllocWork creates callback function which will start compiler (xtac.exe)
4. Callback function formats XTA cache file name, and creates XTA cache file (NtCreateFile)
5. NtCreateSection creates section handle to be passed to compiler in a command line
6. CreateProcessAsUserW starts compiler with command line: xtac.exe -p [section handle]
The following function is responsible for receiving and responding to an ALPC message:
NTSTATUS ProcessIncomingAlpcMessage(PALPC_CONTEXT pAlpcContext,
PORT_MESSAGE msg, PALPC_MESSAGE_ATTRIBUTES msgAttr)
{
/*...*/
if (msg->TotalLength >= 80 && msg->MsgType == 2)
{
if (msgAttr->ValidAttributes & 0x10000000 != 0)
{
xtaCacheMsgAttr =
(PXTACACHE_MESSAGE_ATTRIBUTES)AlpcGetMessageAttribute(
msgAttr, 0x10000000);
if (xtaCacheMsgAttr->MsgType == 0xA1)
{
/* This will create the XTA cache file and start the compiler.
* hModuleNew is assigned a duplicate handle of hModule. */
result = PrepareJitCacheFileCompilation(
pAlpcContext->qword_0, unknown_0->qword_28,
xtaCacheMsgAttr->hModule, unknown_2, &hModuleNew);
if (NT_SUCCESS(result))
{
/*...*/
if (hModuleNew != 0)
{
xtaCacheMsgAttr =
(PXTACACHE_MESSAGE_ATTRIBUTES)AlpcGetMessageAttribute(
&newMsgAttr, 0x10000000);
xtaCacheMsgAttr->hModule = hModuleNew;
xtaCacheMsgAttr->MsgType = 12;
}
xtaCacheMsg->Result = STATUS_SUCCESS;
}
}
}
}
/*...*/
// Send response to xtajit emulator
NTSTATUS result = NtAlpcSendWaitReceivePort(
pAlpcContext->hAlpcPort, 0x10000, msg, newMsgAttr, NULL, NULL, NULL, 0);
// Close the module file handle
xtaCacheMsgAttr = (PXTACACHE_MESSAGE_ATTRIBUTES)AlpcGetMessageAttribute(
msgAttr, 0x10000000);
NtClose(xtaCacheMsgAttr->hModule);
return result;
}
Eventually a response is sent to the emulator (xtajit.dll) notifying it of the result, including any errors or anomalies in the message and message attributes.
Conclusion
It is worth noting that x86 emulation on ARM is a relatively new feature and is still in its infancy. Microsoft is one of the first to incorporate such a feature in its latest operating system. Supporting x86 emulation on ARM is a topic that is of increasing concern, as ARM processors become more and more ubiquitous and the concomitant demand for x86 emulation grows.
As traditionally x86- and x64-based operating systems attempt to migrate over to an ARM-based architecture, however, the drawbacks to emulation become more noticeable. Whereas x86 on x64 or ARM32 on ARM64 is a non-trivial matter of switching contexts and whatnot, x86 on ARM requires full-on CPU simulation in user mode – all in a single, infinite loop. The overhead encountered during such a process could be immense, depending on the size of the executing binary and the availability of cache files.
A gradually diminishing amount of free disk space could also become a concern, due to a growing cache file directory as the number of x86 binaries executed over time increases. Much remains to be seen as to what direction Microsoft and other software publishers will take in regards to x86 emulation. No doubt there are plenty of opportunities for innovation and improvement upon current techniques, and plenty to watch out for in future updates.
Resources
Below is a list of the materials used during research:
- Raspberry Pi 3 Model B+ (https://www.raspberrypi.org/products/raspberry-pi-3-model-b/)
- WOA Deployer for Raspberry Pi 3 (https://github.com/WOA-Project/WOA-Deployer-Rpi)
- Debugging Tools for Windows (https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/debugger-download-tools)
- IDA Pro 7.2 (https://www.hex-rays.com/products/ida/support/download.shtml)
- PuTTY 0.71 and PSFTP (https://www.chiark.greenend.org.uk/~sgtatham/putty/)
- Sysinternals Suite for ARM64 (https://live.sysinternals.com/ARM64/)
- WoW64 internals ...re-discovering Heaven's Gate on ARM (https://wbenny.github.io/2018/11/04/wow64-internals.html)
- ARM Instruction Set Reference Guide (https://static.docs.arm.com/100076/0100/arm_instruction_set_reference_guide_100076_0100_00_en.pdf)
- A64 Instruction Set Architecture (https://static.docs.arm.com/ddi0596/a/DDI_0596_ARM_a64_instruction_set_architecture.pdf)