Abstract Interpretation under Speculative Execution

Meng Wu
Virginia Tech
Blacksburg, VA, USA

Chao Wang
University of Southern California
Los Angeles, CA, USA

Abstract
Analyzing the behavior of a program running on a processor that supports speculative execution is crucial for applications such as execution time estimation and side channel detection. Unfortunately, existing static analysis techniques based on abstract interpretation do not model speculative execution since they focus on functional properties of a program while speculative execution does not change the functionality. To fill the gap, we propose a method to make abstract interpretation sound under speculative execution. There are two contributions. First, we introduce the notion of virtual control flow to augment instructions that may be speculatively executed and thus affect subsequent instructions. Second, to make the analysis efficient, we propose optimizations to handle merges and loops and to safely bound the speculative execution depth. We have implemented and evaluated the proposed method in a static cache analysis for execution time estimation and side channel detection. Our experiments show that the new method, while guaranteed to be sound under speculative execution, outperforms state-of-the-art abstract interpretation techniques that may be unsound.

CCS Concepts • Software and its engineering → Formal software verification; Compilers; • Security and privacy → Cryptanalysis and other attacks.

Keywords Static analysis, speculative execution, abstract interpretation, timing side channel, WCET, cache

ACM Reference Format:

1 Introduction
Speculative execution [55] is a feature that has been implemented by many modern processors. It allows a processor to increase the execution speed by exploring certain program paths ahead of time instead of waiting for the path conditions to be satisfied. This is to prevent slower instructions, e.g., memory accesses, from blocking faster instructions. For example, when a program reaches a branching instruction, e.g., if(x>5){...}else{...} where the condition depends on an uncached value of x stored in memory, a non-speculative execution will force the processor to wait, often for tens or hundreds of clock cycles, until x is loaded from memory, whereas speculative execution allows the processor to make a prediction of the branching target and then proceed to execute the predicted branch. During speculative execution, the processor maintains a checkpoint of the CPU’s register state, which will be used to rollback the changes if the prediction turns out to be incorrect, i.e., after the value of x is fetched from memory. However, if the prediction turns out to be correct, speculative execution will save time and thus outperform non-speculative execution.

Speculative execution is designed to be transparent to the program running on the processor; that is, it does not affect the program semantics, as the rollback ensures that functional properties are preserved. This is the reason why, in the past, static analysis techniques do not model speculative execution. However, recent vulnerabilities such as Meltdown [37], Spectre [29] and ForeShadow [58] force the community to take another look because, although speculative execution preserves the CPU’s register state, for performance reasons, it does not preserve the states of many other components such as the cache and the pipeline [23, 24].

To see why this may be a problem, consider the cache state that may be altered by speculative execution and thus affect the timing behavior of the subsequent non-speculative execution, e.g., cache hits may become misses, or vice versa. This is important because an instruction may take only 1–3 clock cycles when there is a cache hit, but tens or even hundreds of clock cycles when there is a cache miss. Static analysis is useful in examining the cache related properties of a program, e.g., to detect information leaks through timing side channels [7, 16, 30, 53, 62] or prove that a computation task always meets the deadline [18, 20].

For side channel detection, in particular, one may want to know if the program’s execution time depends on secret data, e.g., the cryptographic key, security token, or password. For deadline estimation, one may want to know the maximum number of cache misses along program paths, since it corresponds to the execution time in the worst case. In both applications, static analysis must be sound to be useful. By sound, we mean all possible behaviors must be considered. The reason is because, if the analysis fails to take into consideration a certain behavior, e.g., a specific execution, it may...
Figure 1. Our static program analysis framework based on sound abstract interpretation of speculative executions.

The accuracy of abstract interpretation is often affected by when abstract states from speculative and non-speculative executions are merged, and we develop a strategy named “just-in-time merging” to minimize the loss of accuracy.

We have implemented our method in LLVM [33], where the speculative CFG is constructed by an LLVM pass before it is used by abstract interpretation. We evaluated it on two types of benchmarks: cryptographic software and real-time software, where the goal is to detect timing side channels and to estimate the execution time, respectively. In both cases, the instruction set architecture is Alpha 21264, with 32-KB fully-associative data cache, 64 bytes per line, and the LRU replacement policy. Our experiments show that, compared to existing non-speculative methods, our method is able to detect significantly more timing related behaviors, i.e., cache misses and side-channel leaks. Furthermore, our optimizations are effective in reducing the runtime overhead while maintaining the accuracy.

To sum up, this paper makes the following contributions:

- We show why existing abstract interpretation techniques are unsound for speculative execution.
- We propose a method for lifting existing algorithms to make them sound for speculative execution.
- We develop optimizations to safely reduce the runtime overhead while maintaining the accuracy.
- We implement the method and demonstrate its effectiveness on a set of C programs.

The paper is structured as follows. First, we illustrate the problem and our solution in Section 2. Then, we provide the technical background in Section 3, before presenting our algorithms in Sections 4, 5 and 6. We present our experimental results in Section 7. We review the related work in Section 8. Finally, we give our conclusions in Section 9.

2 Motivation

We illustrate some scenarios in which speculative execution affects the cache behaviors associated with a program, and explain why such behaviors are crucial for execution time estimation and side channel detection.

2.1 Execution Time Estimation

Figure 2 shows a program that illustrates divergent cache behaviors under normal and speculative executions as observed in practice [2–4, 12, 26, 27]. Here, we have four variables: $ph$, $l1$, $l2$, and $p$, which are mapped to different cache lines. Suppose the register value $k$ is 0, the load at line 8 will access $ph[0]$. We assume the cache has 512 lines in total and 64 bytes per line. We also assume the cache is fully associative, meaning any variable may be mapped to a different line. The place holder variable $ph$ is mapped to the first 510 lines (line 3); in practice, $ph$ may correspond to an assorted set of program variables. Each of the remaining variables, $l1$, $l2$ and $p$, may be mapped to a cache line.

Depending on the branching condition, either $l1$ or $l2$ may be loaded to the cache, but both will result in 512 cache
time estimation example, speculative execution may execute one of the two branches first, and then roll back to execute the other branch. Since the memory locations associated with both branches must be accessed, which add up to more than 512 cache lines, some of the cache lines associated with \( ph \) will be evicted. Therefore, the subsequent load (at line 8) may be a cache miss. The difference in execution time may be observed by the attacker and used to deduce information of the secret \( k \): whether the last statement leads to a cache miss depends on the value of \( k \).

### 2.3 Technical Challenges

The above two examples illustrate the need to soundly model speculative execution. However, there are several challenges. The first one is to model the cache state of a program during speculative execution without drastically altering the abstract interpretation algorithm. The second challenge is to judiciously merge abstract states computed from normal and speculative executions, since when and how to merge them drastically affect the accuracy of the fixed-point computation. Furthermore, since a speculative execution may be rolled back at any moment, the number of scenarios is exponential in the number of speculatively executed instructions. If we have to enumerate, the analysis time will be prohibitively long. Therefore, we group scenarios into equivalence classes, based on which we perform reduction to balance the performance and accuracy.

In the remainder of this paper, we will present our solutions in detail.

### 3 Preliminaries

We review the basics of abstract interpretation, as well as the cache, branch prediction, and speculative execution.

#### 3.1 Abstract Interpretation

Abstract interpretation [14] is a static analysis framework that considers all paths and inputs to obtain a sound over-approximation of the state at every program location [31, 32, 52]. For efficiency reasons, the state is kept abstract and often represented by a set of constraints in a certain abstract domain. For example, in the interval domain, each constraint is of the form \( lb \leq x \leq ub \), where \( x \) is a variable and \( lb, ub \) are the lower and upper bounds. The join of two states, \( s_1 = [lb_1 \leq x \leq ub_1] \) and \( s_2 = [lb_2 \leq x \leq ub_2] \), is defined as \( s_1 \cup s_2 = \min(lb_1, lb_2) \leq x \leq \max(ub_1, ub_2) \). Here, \( \cup \) denotes the join operator, which returns an over-approximation of the set union. If, for example, the polyhedral abstract domain is used, a constraint will be a linear equation and the join operator may be the convex hull.

The purpose of restricting the representation of states to an abstract domain is to reduce the computational overhead. Although various abstract domains may be plugged in, the underlying fixed-point computation remains the same. The fixed-point of states are computed on the program’s control flow graph (CFG). Without loss of generality, we assume
Algorithm 1: Abstract interpretation based static analysis.

1: Initialize $S[n]$ to $\top$ if $n = \text{ENTRY(CFG)}$, and to $\bot$ otherwise.
2: $WL \leftarrow \text{ENTRY(CFG)}$
3: while $\exists n \in WL$ do
4:     $WL \leftarrow WL \setminus \{n\}$
5:     $s' \leftarrow \text{TRANSFER}[S[n], \text{inst}_n]$
6:     for all $n' \in \text{Successors}(CFG, n)$ do
7:         if $s' \not\subseteq S[n']$ then
8:             $S[n'] \leftarrow s'[n'] \cup s'$
9:         $WL \leftarrow WL \cup \{n'\}$
10:     end if
11: end for
12: end while

Figure 3. Pipelined execution trace for program in Figure 2

the CFG has a unique entry node and a unique exit node. Inside the CFG, nodes are associated with instructions or basic blocks of instructions, whereas edges represent the control flows, guarded by conditional expressions.

Let $\text{TRANSFER} : S \times \text{INST} \rightarrow S$ be the transfer function, which takes a state $s \in S$ and an instruction $\text{inst} \in \text{INST}$ as input, and returns the new state $s' = \text{TRANSFER}(s, \text{inst})$ as output. $s'$ is the result of executing $\text{inst}$ in state $s$.

Algorithm 1 shows a generic procedure that returns, for each CFG node $n$, an abstract state $S[n]$ as output. $S[n]$ is supposed to be a sound over-approximation of all the possible states at $n$, regardless of the input values or paths taken to reach $n$. Initially, $S[n]$ is $\top$ (tautology) for the entry node but $\bot$ (empty) for all other CFG nodes. The remaining part of the procedure is a standard worklist-based algorithm for computing the fixed point [44]: starting from the entry node, it computes the states of the successor nodes ($n'$) based on the transfer function. To ensure convergence, e.g., when the program has loops or is otherwise non-terminating, a widening operator ($\lor$) is needed in addition to join ($\cup$). However, for brevity, we omit the details; for a complete introduction, refer to [14, 41].

The actual definitions of abstract state $S$ and transfer function $\text{TRANSFER}$ depend on the application. In this work, we are concerned with the cache state corresponding to a program. We will present our definitions in Section 4.

3.2 Cache and Speculative Execution

Cache is a type of small but fast storage to hold frequently used data so that they do not need to be fetched from or stored to the large but slow memory every time. Although this work focuses on the data cache, which is more relevant to our applications, the underlying technique can be extended to the instruction cache as well.

In a typical CPU, e.g., an Intel processor [1], instructions are fetched from memory and decoded continuously before they are sent to the scheduler for execution. Executing an instruction involves multiple units; speculative execution [55] is an optimization that efficiently utilizes these execution units. During speculative execution, instructions are scheduled in a pipeline as soon as the required execution units are available; for example, while an instruction is waiting for data to be fetched from memory, subsequent instructions may be executed, as long as the program semantics remains the same to observers from outside of the CPU.

Things become complicated when there are branches, however, since the branch prediction unit must make a guess on which branch target to execute. Instructions in the predicted branch will be executed while the branch condition is being evaluated, and will be committed only after the prediction is confirmed to be correct. Upon misprediction, however, the result of speculative execution will be discarded and the execution will be redirected to the correct branch.

The reorder buffer inside the execution unit, among others, is responsible for this rollback: upon a branch mis-prediction, it will not perform register retiring as in a normal execution; instead, it will flush out the affected registers, before restoring the CPU to a previously saved state.

The branch predictor also plays an important role in speculative execution since its accuracy is directly related to the performance of the CPU. However, regardless of the underlying strategies [28, 59, 63], when a branch prediction turns out to be incorrect, the speculatively executed instructions may leave side-effects on the states of other system components, including the cache. In this work, we are concerned with modeling of such side-effects in abstract interpretation.

4 Static Cache Analysis

In this section, we present our instantiation of the baseline abstract interpretation algorithm. The goal is to decide, at each program location, whether a memory access always results in a cache hit. Previously, such must-hit analyses were used in execution time estimation [19, 20] and side
channel mitigation [7, 16, 62]; however, they did not model speculative execution.

4.1 The Abstract State
Let $V = \{v_1, ..., v_n\}$ be the set of program variables stored in memory. Each variable $v \in V$ may be mapped to a cache line. Let the cache be fully associative with the LRU replacement policy, which means a variable $v \in V$ may be mapped to any cache line and, if there is not enough space, the least recently used (LRU) variable will be evicted from the cache. Assume that $N$ is the total number of cache lines, we can define the age of each variable $v \in V$, denoted $\text{Age}(v)$, which is an integer ranging from 1 to $N + 1$. Here, $\text{Age}(v) = 1$ means $v$ resides in the most recently used line, $\text{Age}(v) = N$ means $v$ resides in the least recently used cache line, and $\text{Age}(v) = N + 1$ means $v$ is outside of the cache.

The cache state $S$ associated with the entire program is defined as $S = \langle \text{Age}(v_1), ..., \text{Age}(v_n) \rangle$. In this context, a Must-Hit analysis needs to compute, at each program location, an upper bound of $\text{Age}(v)$. If the upper bound is less than or equal to $N$, then $v$ must be in the cache. Otherwise, it is possible that $v$ may be outside of the cache.

4.2 The Transfer Function
Let $\text{Transfer}(S, \text{inst})$ be the transfer function that models the impact of executing $\text{inst}$ in the cache state $S$: given the current state $S = \langle \text{Age}(v_1), ..., \text{Age}(v_n) \rangle$, it returns a new state $S' = \langle \text{Age}'(v_1), ..., \text{Age}'(v_n) \rangle$. If $\text{inst}$ does not access memory at all, then $S' = S$. Otherwise, assume that $v \in V$ is the variable being accessed in $\text{inst}$, and we compute the new state $S'$ as follows:

- For the accessed variable $v$, set $\text{Age}'(v) = 1$ in $S'$.
- For variable $u \in V$ whose age may be younger than $v$ in $S$, increment the age of $u$; that is, $\text{Age}(u) < \text{Age}(v) \Rightarrow \text{Age}'(u) = \text{Age}(u) + 1$.
- For any other variable $w \in V$, set $\text{Age}'(w) = \text{Age}(w)$.

Given the function $\text{Transfer}$ for an instruction, we define it for a sequence of instructions $\text{Insts} = \{\text{inst}_0, \text{inst}_1, ..., \text{inst}_n\}$ as follows: $\text{Transfer}(S, \text{Insts}) = \text{Transfer}(\ldots (\text{Transfer}(S, \text{inst}_0), \text{inst}_1), \ldots, \text{inst}_n)$.

Figure 4 shows two examples. The left-hand-side example illustrates the access of $v$, which is not yet loaded into the cache. After the access, $\text{Age}(v) = 1$, meaning $v$ is loaded to the youngest cache line. Furthermore, the ages of all other lines increase by 1. Since $\text{Age}(u_4) > 4$, the variable $u_4$ is evicted from the cache.

4.3 The Join Operator
For efficiency reasons, states computed along two program paths are joined together at the control-flow merge point, to avoid creating an exponential number of states. In the BASELINE abstract interpretation algorithm, the join operator ($\sqcup$) always maintains a single cache state in the result, regardless of how many states are joined.

Formally, given two states $S = \langle \text{Age}(v_1), ..., \text{Age}(v_n) \rangle$ and $S' = \langle \text{Age}'(v_1), ..., \text{Age}'(v_n) \rangle$, we define $S'' = S \sqcup S'$ as follows: $S'' = \langle \max(\text{Age}(v_1), \text{Age}'(v_1)), ..., \max(\text{Age}(v_n), \text{Age}'(v_n)) \rangle$.

5 Modeling the Speculative Execution
In this section, we lift the baseline abstract interpretation algorithm so that it can soundly model speculative execution.

5.1 Augmented CFG with Virtual Control Flow
Given the CFG of a program, we first augment it by adding special nodes and edges, to model all possible control flows produced by speculative executions. These implicit control flows, which will be made explicit in our augmented CFG, are called the virtual control flows.

A virtual control flow occurs at every if-else statement where the branching condition depends on some variables stored in memory. In a normal execution, a branch guarded
by a condition (c) is explored only when c is satisfied. However, under speculative execution, the branch will be explored (speculatively) by our algorithm even if c is unsatisfiable. Furthermore, upon mis-prediction, the rollback will re-direct the control to the other branch.

To model all these behaviors, we add the following special nodes and edges to the CFG for every branch that may be explored speculatively:

- \( vn_{\text{start}} \), which is a special CFG node that denotes the start of a virtual control flow;
- \( vn_{\text{stop}} \), which is a special CFG node that denotes the end of a virtual control flow.

The edges connecting such nodes, which represent the virtual control flows, fall into five categories: (1) \( n\rightarrow vn_{\text{start}} \); (2) \( vn_{\text{start}} \rightarrow n \); (3) \( n\rightarrow n \); (4) \( n\rightarrow vn_{\text{stop}} \), and (5) \( vn_{\text{stop}} \rightarrow n \), where \( n \) is a normal CFG node.

The edge \( n\rightarrow vn_{\text{start}} \) represents the start of a speculative execution: it feeds the state \( S[n] \) to \( vn_{\text{start}} \), which in turn generates a speculative state \( SS[ vn_{\text{start}} ] = S[n] \). Then, the newly created speculative state is propagated through the edge \( vn_{\text{start}} \rightarrow n \). Next, it is propagated through the edges \( n\rightarrow n \) and \( n\rightarrow vn_{\text{stop}} \) until reaching \( vn_{\text{stop}} \rightarrow n \). The special node \( vn_{\text{stop}} \) converts the speculative state \( SS[ vn_{\text{stop}} ] \) back to the normal state \( S[n] = SS[ vn_{\text{stop}} ] \). Afterward, the state is joined with other states from the normal execution.

One way to add the special nodes and edges is illustrated in Figure 6a. Specifically, for each if-else statement, we add virtual control flow edges from instructions in one of the branch to the entry node of the other branch under the same branching condition.

Here, the blue solid lines represent normal executions, whereas the red dashed lines represent virtual control flows associated with speculative executions of the else-branch. Virtual control flows associated with the then-branch are similar, but omitted in the figure for clarity. The reason why there are more than one dashed lines is because the roll-back point (i.e., location where roll-back occurs) is non-deterministic; to be conservative, we assume it may occur at any moment within the maximum speculation depth.

In practice, the speculation depth is platform-dependent and bounded by a few factors [21, 47], e.g., the size of the reorder buffer; the maximum number of unresolved branches that the CPU can handle before it stalls; whether there are division-by-zero or floating-point errors in the program; and the number of clock cycles taken to access memory and resolve a branching condition. For simplicity, for example, we assume that the maximum speculative execution depth is provided by the user. In Figure 6a, we assume that \( \text{inst}_B \) is the boundary within which roll-back occurs.

5.2 Merging the Speculative Flows

Since we use abstract interpretation to over-approximate the cache states, multiple executions must be merged to reduce the computational overhead. In the baseline algorithm, for example, states from two different paths are joined whenever the program paths are merged in the CFG. In the speculative analysis, we also need to decide when to join the normal and the speculative states.

Figure 6 shows three merging strategies in addition to the original no-merging strategy in Figure 6a. Consider Figure 6b, for example, since the executions before the branch entry node are identical, they are merged without losing accuracy; in addition, the speculative executions are merged right before the exit point of the other branch. Recall that the join operator (\( \sqcup \)) used to handle merging is over-approximated, we know that the strategy outlined in Figure 6b is a sound over-approximation of Figure 6a.

To over-approximate even more, consider Figure 6c, which merges all speculative states of the else-branch before reaching the then-branch. However, the merged speculative state is propagated through the then-branch before it is merged with the normal state. In contrast, Figure 6d is a more aggressive over-approximation, which merges the speculative states with the non-speculative state at the entry node of the then-branch.
Regardless of the merging strategy, however, our method ensures that the result is a sound over-approximation. Since every time state merging occurs, it may lose information, in general, the later that merging occurs, the more accurate the result is, but there is no guarantee. Furthermore, late merging may lead to a more expensive analysis. Our experimental comparisons of these four strategies show that the one outlined in Figure 6c is the best: it not only obtains significantly more accurate results than the one in Figure 5.3 Just-in-Time Merging: An Example

Consider the CFG of a branch shown on the left-hand side of Figure 7, where each basic block refers to a variable (from a to e). The initial cache state, at the top of the figure on the right-hand side, is the state after executing the first basic block, where variables a, b and c are loaded into the cache. Here, the solid arrows represent the normal execution, where either d or e is mapped to the youngest cache line. Since we are concerned with a Must-Hit analysis, after merging at basic block 4, only a, b and c are left in the cache.

Under speculative execution, we may execute the else-branch before rolling back to execute the then-branch. If we choose to merge the speculative state right after the rollback, the merging would be between d, c, b and a on the one hand, and e, d, c and b on the other hand. The merged state will not contain e anymore, thus losing the important information of speculative execution.

However, if we propagate the speculative state computed from the else-branch through the then-branch and then merge with the non-speculative state, the cache state at basic block 4 will be more accurate. As shown by the dotted arrow $T_s$, variable e is loaded to the cache before d is loaded to the cache; similarly, for $F_s$, variable d is loaded before e is loaded. Finally, when the four states are merged, the result is that only c and b are guaranteed to result in cache hits. Thus, the cache state on the bottom-right of Figure 7, which corresponds to Just-in-Time merging illustrated in Figure 6c, captures the side effect of speculative execution.

6 Generalization and Optimization

In this section, we present the generalized algorithm before discussing several optimizations, which help increase accuracy as well as decrease runtime overhead.

Algorithm 2 shows the static analysis procedure that is sound under speculative execution. Given the original CFG of a program, it first constructs an augmented CFG by adding the virtual control flows. Then, it initializes the abstract states for each program location n, including both the default state, denoted $S[n]$, and the speculative state, denoted $SS[n]$. Next, it starts the fixed-point computation using a worklist based procedure that is similar to that of Algorithm 1.

**Algorithm 2 Abstract interpretation under speculation.**

1. $VCFG ← AUGMENTCFGWITHTERMINALFLOW(CFG)$
2. Initialize $S[n]$ to $T$ if $n ∈ ENTRY(VCFG)$, else to ⊥
3. Initialize $SS[n]$ to ⊥ for all $n ∈ VCFG$
4. $WL ← ENTRY(VCFG)$
5. while $∃ n ∈ WL$ do
6. $WL ← WL \setminus \{n\}$
7. if $n$ is a normal CFG node then
8. $s′ ← TRANSFER(\langle S[n], n \rangle)$
9. $ss′ ← TRANSFER(\langle SS[n], n \rangle)$
10. else
11. Set $ss′$ to $S[n]$ if $n$ is a special $n_{start}$ node, else to ⊥
12. Set $ss′$ to $SS[n]$ if $n$ is a special $n_{stop}$ node, else to ⊥
13. end if
14. for each $n′ ∈$ SUCCESSORS(VCFG, $n$) do
15. if $s′ \not∈ S[n']$ or $ss′ \not∈ SS[n']$ then
16. $S[n'] ← S[n'] ∪ s′$
17. $SS[n'] ← SS[n'] ∪ ss′$
18. $WL ← W L ∪ \{n′\}$
19. end if
20. end for
21. end while

However, when the special CFG node $vn_{start}$ is encountered (Line 11), the default state $S[n]$, which is from the incoming edge, is used to create a speculative state $ss′ ← S[n]$; this is to model the side effects caused by the failed speculative execution upon rollback. From then on, both the default state $S[n]$ and the speculative state $SS[n]$ will be propagated through subsequent nodes in the VCFG; at each node n, the transfer function has to be applied to both of them (Lines 8-9). This continues until the other special node $vn_{stop}$ is encountered, which transforms the speculative state $SS[n]$ back to $s′$ (Line 12) before $s′$ is merged into the normal flow.

6.1 The Running Example

To illustrate how the algorithm works, consider the example program in Figure 8, which is a real-time DSP program written in C [26]. The corresponding CFG is shown in Figure 9, where the red (solid and dashed) edges represent the two virtual control flows.
/* table is 31-byte long to make quantl look-up easier, last entry is for mil=30 when wd is max */
int quant26bt_pos[31] = { 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0};
/* table is 31-byte long to make quantl look-up easier, last entry is for mil=30 when wd is max */
int quant26bt_neg[31] = { 63, 62, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0};

int ril, mil;
ref ril, el, decis, wd, detl, decis_lev[2*], decis_lev[1*];
if (mil = 0 ; mil < 30 ; mil++) {
    decis = (decis_levl[mil]*((long)detl)) >> 15L;
    if (wd <= decis) break;
}
/* if mil=30, wd is less than all decision levels*/
if (el >> 0) ril = quant26bt_pos[mil];
else ril = quant26bt_neg[mil];
return(ril);

Figure 8. Code snippet from a real-time DSP program [26].

Result from Non-speculative Executions Table 1 shows the cache state computed for each location (basic block) based on only normal executions (black edges in Figure 9); this is analogous to running the baseline procedure in Algorithm 1. In Column 2, the variables are arranged according to their ages: the younger variable appears on the left.

Initially, the cache is empty. From basic block 1 to 5, we apply the transfer functions: decis_lev takes two cache lines, but since we do not unwind the loop, we do not know its index statically. Thus, we nondeterministically pick one for the first time, decis_lev[1*]. Following the back edge from basic block 4, when decis_lev is accessed again, we conservatively choose the second cache line for decis_lev[2*] to ensure that the cache state remains an over-approximation. Our analysis iterates through the loop three times before it reaches a fixed-point (light gray row) and terminates.

Result from Speculative Executions Table 2 shows the cache state computed under speculative execution. For clarity, we only focus on the cache states relevant to the speculative executions starting from basic block 5. We use two different colors, blue and red, to differentiate the cache states computed from non-speculative (blue) and speculative (red) executions. By considering speculative executions, it is possible for us to access both quant26bt_pos and quant26bt_neg in a single execution.

Execution Time Estimation The last row of Table 2, which differs from the last row of Table 1, shows that most of the program variables have older ages than before. This is dangerous because, if the cache is only large enough to hold the first eight variables, there will be an additional cache miss, which may force the program to miss its deadline.

Side Channel Detection The additional cache miss may also lead to side-channel leaks. Figure 10 shows a client program that uses the program in Figure 8. The application first accepts some input from the user, then processes it using quantl as a subroutine, and finally encrypts the result using a cipher such as AES. Before calling quantl, a look-up table named sbox is loaded; the lookup table will be used by the cipher while it encrypts the data, during which time a secret key is used as the index to access sbox.

Table 1. Cache states during the fixed-point computation.

<table>
<thead>
<tr>
<th>BBlk</th>
<th>Cache State</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>()</td>
</tr>
<tr>
<td>1</td>
<td>[wd, el]</td>
</tr>
<tr>
<td>2</td>
<td>[mil, wd, el]</td>
</tr>
<tr>
<td>3</td>
<td>[decis, wd, detl, decis_lev[*], mil, el]</td>
</tr>
<tr>
<td>4</td>
<td>[mil, decis, wd, detl, decis_lev[*], el]</td>
</tr>
<tr>
<td>5</td>
<td>[mil, decis, wd, detl, decis_lev[*], el]</td>
</tr>
<tr>
<td>6</td>
<td>[mil, detl, decis_lev[2*], mil, decis_lev[*], el]</td>
</tr>
<tr>
<td>7</td>
<td>[mil, decis, wd, detl, decis_lev[2*], mil, decis_lev[*], el]</td>
</tr>
<tr>
<td>8</td>
<td>[mil, el, decis, wd, detl, decis_lev[2*], mil, decis_lev[*]]</td>
</tr>
</tbody>
</table>

Figure 9. Augmented CFG with virtual control flows.
By controlling the input size, a malicious user can force part of the sbox to be evicted from the cache. As a result, for some key values, accessing sbox results in a cache hit, but for other key values, it results in a cache miss. Although timing side channels have been investigated before [7, 16, 25, 62], these prior works never considered speculative execution. Our contribution, in this context, is to show that even if a program is leak-free under normal execution, it may still be leaky under speculative execution.

### 6.2 Dynamically Bounding Speculation Depth

Although the maximum number of speculatively executed instructions is used to construct the augmented CFG, in practice, the number of speculatively executed instructions can be smaller. For example, when all variables needed to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened. Since our cache analysis aims to resolve a branching condition are in the cache, speculative execution may be shortened.
We have implemented our method in LLVM [33] and experimentally compared it with a state-of-the-art, non-speculative static cache analysis technique [62]. In our experiments, we used a set-associative cache with the LRU replacement policy.

7 Experiments

We have implemented our method in LLVM [33] and experimentally compared it with a state-of-the-art, non-speculative static cache analysis technique [62]. In our experiments, we used a set-associative cache with the LRU replacement policy.
Tables 3 and 4 show the statistics of our benchmarks, collected from various sources including the Malardalen real-time software benchmark [26], a commercially representative embedded software suite named MiBench [27], a high performance patch for SSH (hpn-ssh) [12], a cryptographic toolkit named LibTomCrypt [2], the openssh source code [3], and a Linux kernel for tegra [4] used on Tesla automobiles. These benchmarks are divided into two sets: execution time estimation and side channel detection. The benchmarks for execution time estimation (Table 3) are used as is, whereas the benchmarks for side channel detection (Table 4) are used together with a client program that we wrote, to invoke the benchmark program in a way similar to Figure 10.

### 7.1 Benchmarks

Tables 3 and 4 show the statistics of our benchmarks, collected from various sources including the Malardalen real-time software benchmark [26], a commercially representative embedded software suite named MiBench [27], a high performance patch for SSH (hpn-ssh) [12], a cryptographic toolkit named LibTomCrypt [2], the openssh source code [3], and a Linux kernel for tegra [4] used on Tesla automobiles. These benchmarks are divided into two sets: execution time estimation and side channel detection. The benchmarks for execution time estimation (Table 3) are used as is, whereas the benchmarks for side channel detection (Table 4) are used together with a client program that we wrote, to invoke the benchmark program in a way similar to Figure 10.

### 7.2 Effectiveness: Execution Time Estimation

We first compare our method with the state-of-the-art, non-speculative method [62]. The results are shown in Table 5. For our method, we also report the number of speculative cache misses (#SpMiss), which are not observable from outside of the CPU, the number of conditional branches that can be speculatively executed, and the total number of iterations of our method on loops.

The results show that our method detected more cache misses, thus highlighting the unsoundness of the existing method and the importance of modeling speculative execution during execution time estimation.

As for the analysis time, our method completed all the benchmarks, although it took a longer time than the non-speculative analysis due to its focus on being always sound. The reason why it took significantly longer for our method on loops is that the analysis time and the number of cache misses.

Table 3. Execution time estimation: benchmark statistics.

<table>
<thead>
<tr>
<th>Name</th>
<th>Source</th>
<th>Description</th>
<th>Loc</th>
</tr>
</thead>
<tbody>
<tr>
<td>adpcm</td>
<td>WCET@mdh</td>
<td>motor control</td>
<td>910</td>
</tr>
<tr>
<td>susan</td>
<td>MiBench</td>
<td>image process algorithm</td>
<td>2,140</td>
</tr>
<tr>
<td>layer3</td>
<td>MiBench</td>
<td>mp3 audio lib</td>
<td>2,233</td>
</tr>
<tr>
<td>jcmarker</td>
<td>MiBench</td>
<td>jpeg compose algorithm</td>
<td>1,444</td>
</tr>
<tr>
<td>jdmarker</td>
<td>MiBench</td>
<td>jpeg decompose algorithm</td>
<td>2,068</td>
</tr>
<tr>
<td>jch20</td>
<td>MiBench</td>
<td>jpeg Huffman encoding</td>
<td>694</td>
</tr>
<tr>
<td>gtk</td>
<td>MiBench</td>
<td>GTK plotting routines</td>
<td>949</td>
</tr>
<tr>
<td>g72</td>
<td>mediaBench</td>
<td>routines for G.721 and G.723 conversions</td>
<td>608</td>
</tr>
<tr>
<td>vga</td>
<td>mediaBench</td>
<td>Driver for Borld Graphics Interface</td>
<td>386</td>
</tr>
<tr>
<td>stc</td>
<td>mediaBench</td>
<td>pson Stylus-Color Printer-Drive</td>
<td>492</td>
</tr>
</tbody>
</table>

Table 4. Side channel detection: benchmark statistics.

<table>
<thead>
<tr>
<th>Name</th>
<th>Source</th>
<th>Description</th>
<th>Loc</th>
</tr>
</thead>
<tbody>
<tr>
<td>hash</td>
<td>hpn-ssh</td>
<td>hash function</td>
<td>320</td>
</tr>
<tr>
<td>encoder</td>
<td>LibTomCrypt</td>
<td>hex encode a string</td>
<td>134</td>
</tr>
<tr>
<td>chacha20</td>
<td>LibTomCrypt</td>
<td>chacha20poly1305 cipher</td>
<td>377</td>
</tr>
<tr>
<td>aes</td>
<td>LibTomCrypt</td>
<td>AES implementation</td>
<td>1,838</td>
</tr>
<tr>
<td>str2key</td>
<td>openssl</td>
<td>key prepare for des</td>
<td>385</td>
</tr>
<tr>
<td>des</td>
<td>openssl</td>
<td>des cipher</td>
<td>1,051</td>
</tr>
<tr>
<td>seed</td>
<td>linux-tegra</td>
<td>seed cipher</td>
<td>487</td>
</tr>
<tr>
<td>camellia</td>
<td>linux-tegra</td>
<td>camellia cipher</td>
<td>1,324</td>
</tr>
<tr>
<td>salsa</td>
<td>linux-tegra</td>
<td>Salsa20 stream cipher</td>
<td>279</td>
</tr>
</tbody>
</table>

Table 5. Execution time estimation: comparisons in terms of the analysis time and the number of cache misses.

<table>
<thead>
<tr>
<th>Name</th>
<th>Non-speculative</th>
<th>Speculative</th>
</tr>
</thead>
<tbody>
<tr>
<td>Time(s)</td>
<td>#Miss</td>
<td>Time(s) #Miss</td>
</tr>
<tr>
<td>adpcm</td>
<td>14.40 31</td>
<td>12.70 32</td>
</tr>
<tr>
<td>susan</td>
<td>40570 30</td>
<td>24840 26</td>
</tr>
<tr>
<td>layer3</td>
<td>8464 94</td>
<td>53471 6554</td>
</tr>
<tr>
<td>jcmarker</td>
<td>4.80 27</td>
<td>199</td>
</tr>
<tr>
<td>jdmarker</td>
<td>16.11 35</td>
<td>59777 1518</td>
</tr>
<tr>
<td>jch20</td>
<td>0.48 12</td>
<td>1036 0.44 12</td>
</tr>
<tr>
<td>g72</td>
<td>128 7 1</td>
<td>122 0.94 9</td>
</tr>
<tr>
<td>vga</td>
<td>0.07 4</td>
<td>35 0.06 4</td>
</tr>
<tr>
<td>stc</td>
<td>1.86 31</td>
<td>35222 0.96 23</td>
</tr>
</tbody>
</table>

Table 6. Execution time estimation: comparisons of two strategies for merging speculative executions.

<table>
<thead>
<tr>
<th>Name</th>
<th>Merging at rollback point</th>
<th>Just-in-time merging</th>
</tr>
</thead>
<tbody>
<tr>
<td>Time(s)</td>
<td>#Miss</td>
<td>Time(s) #Miss</td>
</tr>
<tr>
<td>adpcm</td>
<td>14.40 31</td>
<td>12.70 32</td>
</tr>
<tr>
<td>susan</td>
<td>40570 30</td>
<td>24840 26</td>
</tr>
<tr>
<td>layer3</td>
<td>8464 94</td>
<td>53471 6554</td>
</tr>
<tr>
<td>jcmarker</td>
<td>4.80 27</td>
<td>199</td>
</tr>
<tr>
<td>jdmarker</td>
<td>16.11 35</td>
<td>59777 1518</td>
</tr>
<tr>
<td>jch20</td>
<td>0.48 12</td>
<td>1036 0.44 12</td>
</tr>
<tr>
<td>g72</td>
<td>128 7 1</td>
<td>122 0.94 9</td>
</tr>
<tr>
<td>vga</td>
<td>0.07 4</td>
<td>35 0.06 4</td>
</tr>
<tr>
<td>stc</td>
<td>1.86 31</td>
<td>35222 0.96 23</td>
</tr>
</tbody>
</table>
Table 7 shows the results for side channel detection, including comparisons of the two methods in terms of the analysis time and whether leaks are detected.

<table>
<thead>
<tr>
<th>Name</th>
<th>Buffer (byte)</th>
<th>Non-speculative</th>
<th>Speculative</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Time (s)</td>
<td>Leak Detected</td>
<td>Time (s)</td>
</tr>
<tr>
<td>hash</td>
<td>32,908</td>
<td>0.07 No</td>
<td>0.12 Yes</td>
</tr>
<tr>
<td>encoder</td>
<td>32,512</td>
<td>0.03 No</td>
<td>0.10 Yes</td>
</tr>
<tr>
<td>chacha20</td>
<td>26,304</td>
<td>1.18 No</td>
<td>9.24 Yes</td>
</tr>
<tr>
<td>aeb</td>
<td>31,616</td>
<td>0.10 No</td>
<td>0.68 Yes</td>
</tr>
<tr>
<td>strkey</td>
<td>32,768</td>
<td>0.08 No</td>
<td>2.13 No</td>
</tr>
<tr>
<td>des</td>
<td>0</td>
<td>0.01 No</td>
<td>0.01 No</td>
</tr>
<tr>
<td>seed</td>
<td>32,768</td>
<td>0.01 No</td>
<td>0.07 No</td>
</tr>
<tr>
<td>camellia</td>
<td>32,768</td>
<td>0.35 No</td>
<td>6.35 No</td>
</tr>
<tr>
<td>salsa</td>
<td>32,768</td>
<td>0.02 No</td>
<td>0.06 No</td>
</tr>
</tbody>
</table>

7.3 Effectiveness: Side Channel Detection

Table 7 shows the results for side channel detection, including comparisons of the two methods in terms of the analysis time and whether leaks are detected. In this context, a leak refers to the dependency between the cache behavioral difference and sensitive data; furthermore, whether there is a leak or not often depends on the input buffer size controlled by the (potentially malicious) user. Thus, during experiments, we set the buffer size to various values from 32K bytes (the size of cache we use) down to 0 byte.

Generally speaking, the larger the buffer size, the easier that the client program triggers the behavioral difference. Thus we first set the buffer size to 32KB, and starting from there we gradually reduce the buffer size and keep track of the impact of speculative execution on cache state, until the two methods return different results.

Since the benchmarks are mostly cryptographic algorithms, which are relatively small in terms of the number of lines of code, the analysis time is short. Furthermore, our method successfully detected leaks in half of the benchmarks, whereas the existing (unsound) method did not detect leaks in any of them. This highlights the importance of having a sound static cache analysis for speculative execution, e.g., to detect more leaks and avoid bogus proofs (that there is no leak). On one of the benchmarks, des, leaks are detected even if the buffer size is set to 0 because, even without the client program, the benchmark program itself has a user controlled buffer, which can be set to sizes that induce timing side-channel leaks under speculative execution.

As a static analysis procedure, our method may generate false positives. In addition to abstraction, the other source of false positives is modeling of the speculative execution. Therefore, for each of the new leaks detected by our method in Table 7, we manually inspected the software code and the execution trace. Our inspection confirmed that all of them are indeed real leaks; that is, there exist specific memory/cache layouts and execution traces that induce the leaks.

8 Related Work

Abstract interpretation [14] is a framework for conducting static analysis and proving properties. Ferdinand and Wilhelm [18, 20] pioneered the use of abstract interpretation in may- and must-hit cache analyses [61]. Others also used similar techniques to detect timing side channels [7, 16, 62]. However, prior works focused primarily on improving abstract interpretation without considering speculative execution.

There are some techniques that consider the impact of speculative execution [35], but only for the instruction pipeline. In a commercial tool named AIT, speculations are also considered during execution time estimation by leveraging a standalone pipeline analysis as a driver [61]. Since the tool is proprietary, details of this analysis have not been made public; therefore, it is not clear how speculative execution is modeled during abstract interpretation.

Our method differs from the large body of work on statistically estimating the worst-case execution time of real-time software [34, 36, 42] using either CPU simulators or characteristics of prior simulation results [54]. These techniques, while useful, are not designed to be sound, and hence may not be suitable for the applications that we have in mind, such as detecting side-channel leaks or proving that leaks do not exist. The reason is because, if the analysis is not sound, the proof may not be valid and as a consequence, leaks may be left undetected.

For timing side channels, many analysis and verification techniques [6, 9, 10, 25, 45, 46, 53, 62] have been developed, including the one proposed by Chen et al. [11], which uses Cartesian Hoare Logic [51] to prove that timing leaks of a program are bounded. Antonopoulos et al. [5] also developed a method for proving the absence of timing channels. However, these methods only consider instruction-induced timing variance while ignoring the cache.

There are also techniques for improving the accuracy of cache analysis, e.g., by using symbolic execution or model checking to refine the cache analysis results [13, 40, 56] and by extending the analysis from single-core to multi-core CPUs [38]. However, none of these techniques considered speculative execution, which is the focus of our work.

9 Conclusions

We have presented a new abstract interpretation technique that can soundly analyze a program under speculative execution. The goal is to lift existing static analyzers, which were geared toward analyzing only non-speculative executions, so that they become sound also for speculative executions. We have implemented the technique in a static cache analysis tool and evaluated it on two sets of benchmarks, for execution time estimation and side channel detection. Our experimental results show that the method can detect many cache misses and side-channel leaks overlooked by a state-of-the-art non-speculative analysis technique.

Acknowledgments

This work was partially funded by the U.S. National Science Foundation (NSF) under the grants CNS-1617203 and CNS-1702824 and the Office of Naval Research (ONR) under the grant N00014-17-1-2896.
References

[18] Christian Ferdinand, Florian Martin, Reinhard Wilhelm, and Martin
[6] Lucas Bang, Abdullabki Aydin, Quoc-Sang Phan, Corina S. Pasare-
anu, and Tevfik Bultan. 2016. String analysis for side channels with
[54] Abstract Interpretation under Speculative Execution PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA

[22] Pengfei Gao, Jun Zhang, Fu Song, and Chao Wang. 2019. Verifying and Quantifying Side-channel Resistance of Masked Software Implement-
[17] Hassan Eldib, Chao Wang, and Patrick Schaumont. 2014. Formal Veri-
Appendices

We review the original cache analysis in Appendix A and then present the optimized analysis in Appendix B. Finally, we compare them in Appendix C.

A The Original Cache Analysis

A.1 The Cache State

Recall that, without the shadow variables, the abstract cache state is defined as \( S = \langle \text{Age}(v_1), \ldots, \text{Age}(v_n) \rangle \), where each regular variable \( v_i \) (1 ≤ i ≤ n) has an age \( \text{Age}(v_i) \). For our must-hit analysis, \( \text{Age}(v_i) \) is an upper bound of the oldest cache line age for \( v_i \) along all program paths.

A.1.1 The Transfer Function

Given a state \( S \) and an instruction \( \text{inst} \), \( \text{Transfer}(S, \text{inst}) \) computes the new state \( S' = \langle \text{Age}'(v_1), \ldots, \text{Age}'(v_n) \rangle \) as follows:

- If \( \text{inst} \) does not access memory at all, then \( S' = S \).
- Otherwise, let \( v \in V \) be the variable accessed; \( S' \) is computed as follows:

  - For the accessed variable \( v \), set \( \text{Age}'(v) = 1 \) in \( S' \).
  - For variable \( u \in V \) whose age may be younger than \( v \) in \( S \), increment the age of \( u \); that is, \( \text{Age}(u) < \text{Age}(v) \) \( \rightarrow \) \( \text{Age}'(u) = \text{Age}(u) + 1 \).
  - For any other variable \( w \in V \), set \( \text{Age}'(w) = \text{Age}(w) \).

Example A.1. On the left-hand side of Figure 4, \( v \) is the variable accessed, while \( u_1 \) and \( u_4 \) are younger than \( v \) originally. Thus, executing \( \text{inst} \) increases the ages of \( u_1 \) and \( u_4 \) by 1. On the right-hand side, \( u \) is younger than \( v \) originally, while \( w_1 \) and \( w_3 \) are older than \( v \). Thus, executing \( \text{inst} \) increases the age of \( u \) by 1, while the ages of \( w_1 \) and \( w_3 \) remain unchanged.

A.1.2 The Join Operation

A variable must be in the cache only if it is in the cache according to both cache states before the join. Therefore, given two cache states \( S = \langle \text{Age}(v_1), \ldots, \text{Age}(v_n) \rangle \) and \( S' = \langle \text{Age}'(v_1), \ldots, \text{Age}'(v_n) \rangle \), we define \( S'' = S \cup S' \) as follows: \( S'' = \langle \max(\text{Age}(v_1), \text{Age}'(v_1)), \ldots, \max(\text{Age}(v_n), \text{Age}'(v_n)) \rangle \).

Example A.2. For the example in Figure 5, the merged state does not have \( y \) (or \( t \)) because, in at least one of the two states, \( y \) (or \( t \)) was already outside of the cache. Furthermore, \( x, z \) and \( k \) appear in their oldest possible cache lines.

B Cache Analysis with Shadow Variables

B.1 The Cache State

To improve the accuracy, for each program variable \( v_i \) (1 ≤ i ≤ n), we add a shadow variable \( \exists v_i \) to represent the youngest cache line that holds \( v_i \) along some program path. With the shadow variables, the new cache state is defined as \( S = \langle \text{Age}(v_1), \ldots, \text{Age}(v_n), \text{Age}(\exists v_1), \ldots, \text{Age}(\exists v_n) \rangle \).

Example B.1. For the example in Figure 5, the original state is \( \langle \{ \}, \{ x \}, \{ x, z \}, k \rangle \), but the modified state with shadow variables is \( \langle \{ \exists x, \exists y, \exists z \}, \{ x, z \}, \{ x, z \}, k \rangle \). Here, \( \exists x \) is in the youngest cache line for \( x \) along some path whereas \( x \) is in the oldest possible cache line along all paths.

The modified state can be viewed as two separate states: \( S^{\text{MUST}} = \langle \{ \}, \{ x, z \}, \{ k \} \rangle \), the oldest cache lines along all paths, and \( S^{\text{MAY}} = \langle \{ \exists x, \exists y, \exists z \}, \{ x, z \}, \{ \exists k \} \rangle \), the youngest cache lines along some paths.

B.1.1 The Transfer Function

Given the state \( S \) and an instruction \( \text{inst} \), the new state \( S' \) is computed using the modified \( \text{Transfer}(S, \text{inst}) \) as follows:

- If \( \text{inst} \) does not access memory at all, then \( S' = S \).
- Otherwise, let \( v \in V \) be the variable accessed in \( S; S' \) is computed as follows:

  - For the accessed variable \( v \), set \( \text{Age}'(v) = 1 \) in \( S' \) and then update the shadow variables (in \( S^{\text{MAY}} \)):
    - Set \( \text{Age}(\exists v) = 1 \).
    - If \( \text{Age}(\exists v) \leq \text{Age}(v) \), set \( \text{Age}'(\exists v) = \text{Age}(v) + 1 \).
    - Otherwise, let \( u \in V \) whose age may be younger than \( v \) in \( S \), increment the age of \( u \); that is, \( \text{Age}(u) < \text{Age}(v) \) \( \rightarrow \) \( \text{Age}'(u) = \text{Age}(u) + 1 \).
    - For any other variable \( w \in V \), set \( \text{Age}'(w) = \text{Age}(w) \).

There are two steps: we first compute the \( S^{\text{MAY}} \) set of shadow variables and then compute the \( S^{\text{MUST}} \) set of regular variables. \( N_{\text{young}}(u) \) is computed based on \( S^{\text{MAY}} \) and used to refine the aging rule for \( u \) when computing \( S^{\text{MUST}} \).

The condition for aging \( \exists u \), i.e., \( \text{Age}(\exists u) \leq \text{Age}(3v) \), also differs from the one for aging \( u \), which is \( \text{Age}(u) < \text{Age}(v) \).

Example B.2. Below is the comparison of the original and modified transfer functions for the example in Figure 5.
B.1.2 The Join Operation

A regular variable $v \in V$ must be in the cache only if it is in the cache according to both states. Thus, after the join, its age is the maximum of the two prior ages.

A shadow variable $\exists v$ may be in the cache if it is in the cache according to either of the two states. Thus, after the join, its age is the minimum of the two prior ages.

Formally, the joined state $S' = S \cup S''$ is defined as follows: $S'' = \langle \max(\text{Age}(v_1), \text{Age}(v_1')) \ldots \max(\text{Age}(v_n), \text{Age}(v_n')) \rangle, \ldots \min(\text{Age}(\exists v_1), \text{Age}(\exists v_1')) \ldots \min(\text{Age}(\exists v_n), \text{Age}(\exists v_n')) \rangle$.

Example B.3. For the example in Figure 5, the two states prior to the join are $S = \{\{\exists x, y\}, \{\exists y, z\}, \{\exists z, k\}\}$ and $S' = \{\{\exists t, i\}, \{\exists x, z\}, \{\exists x, k\}\}$. The joined state is $S'' = \{\{\exists x, \exists y\}, \{\exists y, \exists z\}, \{x, z\}, \{\exists k, k\}\}$. Again, this can be viewed as two separate sets: the must set $\{\}, \{x, z\}, \{k\}$ and may set $\{\exists x, \exists y\}, \{\exists y, \exists z\}, \{\}, \{\exists k\}$.

C Comparison

We apply both the original analysis and the modified analysis to the example in Figure 11 after loop unrolling, and compare the results. In the table below, Column 2 shows the result of the original analysis and Column 3 shows the result of the modified analysis.

For ease of comprehension, here, we split the modified cache state $S$ into two parts: $SMAY$, which has the shadow variables, and $SMUST$, which has the regular variables. Figure 13 shows the same states as in this table, but in the combined form.

<table>
<thead>
<tr>
<th>State</th>
<th>Cache state (original)</th>
<th>Cache state (w/ shadow variables)</th>
</tr>
</thead>
<tbody>
<tr>
<td>$S_0$</td>
<td>$[a, a, a, a]$</td>
<td>$[a, a, a, a]$</td>
</tr>
<tr>
<td>$S_1 = \text{Trans}(S_0, \text{ref} a)$</td>
<td>$[a, a, a, a]$</td>
<td>$[a, a, a, a]$</td>
</tr>
<tr>
<td>$S_2 = \text{Trans}(S_1, \text{ref} b)$</td>
<td>$[b, a, a, a]$</td>
<td>$[a, a, a, a]$</td>
</tr>
<tr>
<td>$S_3 = \text{Trans}(S_2, \text{ref} c)$</td>
<td>$[c, a, a, a]$</td>
<td>$[a, a, a, a]$</td>
</tr>
<tr>
<td>$S_4 = \text{Join}(S_2, S_3)$</td>
<td>$[{}, a, a, a]$</td>
<td>$[{}, a, a, a]$</td>
</tr>
<tr>
<td>$S_5 = \text{Trans}(S_4, \text{ref} b)$</td>
<td>$[a, a, a, a]$</td>
<td>$[a, a, a, a]$</td>
</tr>
<tr>
<td>$S_6 = \text{Trans}(S_5, \text{ref} c)$</td>
<td>$[c, c, a, a]$</td>
<td>$[c, c, a, a]$</td>
</tr>
<tr>
<td>$S_7 = \text{Join}(S_5, S_6)$</td>
<td>$[{}, a, a, a]$</td>
<td>$[{}, a, a, a]$</td>
</tr>
<tr>
<td>$S_8 = \text{Trans}(S_7, \text{ref} b)$</td>
<td>$[b, a, a, a]$</td>
<td>$[a, a, a, a]$</td>
</tr>
<tr>
<td>$S_9 = \text{Trans}(S_8, \text{ref} c)$</td>
<td>$[c, c, a, a]$</td>
<td>$[c, c, a, a]$</td>
</tr>
<tr>
<td>$S_{10} = \text{Join}(S_8, S_9)$</td>
<td>$[{}, a, a, a]$</td>
<td>$[{}, a, a, a]$</td>
</tr>
</tbody>
</table>

Initially, both $SMAY$ and $SMUST$ are $[a, a, a, a]$.

Executing $\text{ref} a$ will load $a$ into the first cache line. Therefore, $SMAY = \{a, a, a, a\}$ and $SMUST = \{a, a, a, a\}$. The second part remains the same as the result of the original analysis for this particular case, but in general, it can be more accurate.

This is because, at each step, we first compute $SMAY$ and then use $N_{young}(u)$ computed from $SMAY$ to more accurately compute $SMUST$.

C.1 Same Results

Sometimes, $SMUST$ remain the same as the result of the original analysis.

For instance, the original result for $S_5 = \text{Trans}(S_4, \text{ref} b)$ is $[b, \{\}, a, \bot]$. In the new analysis, we first compute $SMAY = \{\{\exists b\}, \{\exists a, \exists c\}, \bot, \bot\}$ and then compute $SMUST$ as follows:

- $\text{Age}(a) = 2$ and $\text{Age}(b) = 5$ (outside of the cache).
- $\text{Age}(\exists b) = 1$, $\text{Age}(\exists a) = 2$, and $\text{Age}(\exists c) = 2$.
- Thus, $N_{young}(a) = \{|\exists b, \exists c\}| = 2$.
- Since $\text{Age}(a) < \text{Age}(b)$ and $N_{young}(a) \neq \text{Age}(a)$, we set $\text{Age}(a) = \text{Age}(a) + 1$.

Therefore, $SMUST = [b, \{\}, a, \bot]$.

C.2 Better Results

Sometimes, $SMUST$ is more accurate than the result of the original analysis.

For instance, the original result for $S_8 = \text{Trans}(S_7, \text{ref} b)$ is $[b, \{\}, \{\}, a]$. In the new analysis, we first compute $SMAY = \{\{\exists b\}, \{\exists a, \exists c\}, \bot, \bot\}$ and then compute $SMUST$ as follows:

- $\text{Age}(a) = 3$ and $\text{Age}(b) = 5$ (outside of the cache).
- $\text{Age}(\exists b) = 1$, $\text{Age}(\exists a) = 2$, and $\text{Age}(\exists c) = 2$.
- Thus, $N_{young}(a) = \{|\exists b, \exists c\}| = 2$.
- Since $\text{Age}(a) < \text{Age}(b)$ but $N_{young}(a) \neq \text{Age}(a)$, we set $\text{Age}(a) = \text{Age}(a)$.

Therefore, $SMUST = [b, \{\}, a, \bot]$, which is more accurate than $[b, \{\}, \{\}, a]$.

C.3 Correctness

In general, $SMUST$ is either the same as, or more accurate than, the result of the original analysis – due to the following modified rule for aging the variable $u$.

Whenever $\text{Age}(u) < \text{Age}(v)$ but $N_{young}(u) < \text{Age}(u)$, we set $\text{Age}(u) = \text{Age}(u)$.

This optimization is safe because, if the number of shadow variables that are younger than or equal to $u$, denoted $N_{young}(u)$, is less than the current age of $u$, there must be younger cache lines to hold all of them. In such a case, we should not (unnecessarily) increase the age of $u$. 