《软件分析》笔记

8月 22, 2020

课程

南京大学《软件分析》Yue Li and Tian Tan
课程主页, 视频

1. Course Introduction 课程介绍

PL 的分支
语言的分类：命令式 (C, Java)、函数式 (Haskell)、逻辑式/声明式
挑战：语言没有变，但是程序变复杂了，如何保证可靠
静态分析的用途：程序可靠性、程序安全、编译优化、程序理解

Static Analysis 静态分析

静态分析：Static analysis analyzes a program P to reason about its behaviors and determines whether it satisfies some properties before running P.
一句话概括静态分析：Static Analysis: ensure (or get close to) soundness, while making good trade-offs between analysis precision and analysis speed. 在保证正确性的前提下，在精度和速度上做平衡。
两个词概括静态分析：
- Abstraction: 具体值->符号值
- Over-approximation
  - Transfer functions: 抽象值上的操作
  - Control flows

sound & complete

Rice’s Theorem: Any non-trivial property of the behavior of programs in a r.e. language is undecidable.
- non-trivial properties ~= interesting properties ~= the properties related with run-time behaviors of programs
- r.e. (recursively enumerable) 递归可枚举语言: recognizable by a Turing-machine
- There is no such approach to determine whether P satisfies such non-trivial properties, i.e., giving exact answer: Yes or No
- 故不存在 perfect (sound & complete) static analysis
sound & complete:
- Over- and under-approximations are both for safety of analysis
- sound: 报出所有问题 may analysis: outputs information that may be true (over-approximation) (safe=over)
- complete: 报出的问题都是对的 must analysis: outputs information that must be true (under-approximation) (safe=under)
useful static analysis
- 妥协 soundness (false negatives 可能漏报)
- 妥协 completeness (false positives 可能误报) (大多数情况的分析，因为 soundness 很重要)

2. Intermediate Representation 中间表示

Compiler
AST vs. IR
- AST
  - high-level and closed to grammar structure
  - usually language dependent
  - suitable for fast type checking
  - lack of control flow information
- IR
  - low-level and closed to machine code
  - usually language independent
  - compact and uniform
  - contains control flow information
  - usually considered as the basis for static analysis
- 所以 IR 更适合静态分析
- IR 没有固定的定义，常用三地址码
3-Address Code 三地址码 (3AC)
- 定义: Each 3AC contains at most 3 addresses (name, constant, temporary)
- 形式
  - x = y bop z
  - x = uop y
  - x = y
  - goto L
  - if x goto L
  - if x rop y goto L
Soot (static analysis framework for Java) & Jimple (Soot's IR)
- Do-While Loop
- Method Call
  - JVM调用指令
    - invokespecial: call constructor 构造函数, superclass methods, private methods
    - invokevirutal: instance methods call (virtual dispatch)
    - invokeinterface: 类似于 invokevirtual, 区别: cannot optimization, check interface implementation
    - invokestatic: call static methods
    - invokedynamic: Java 是静态语言，方便其它动态语言在JVM上跑
  - 尖括号内部是 method signature, 包括 class name, return type, method name, parameter types
- Class
  - clinit: 静态初始化
Static Single Assignment (SSA)

Control Flow Graphs (CFG)

构造
- 点: basic blocks (BB)
- 边
  - There is a conditional or unconditional jump from the end of A to the beginning of B 跳转
  - B immediately follows A in the original order of instructions and A does not end in an unconditional jump 相邻且不是直接 goto
- 把 goto 指令 改成 goto basic block
- 加上 Entry 和 Exit

3&4. Data Flow Analysis —Applications

Data Flow Analysis: How Data Flows on CFG?
Data-flow analysis is to find a solution to a set of safe-approximation directed constraints on the IN[s]’s and OUT[s]’s, for all statements s.
Input and Output States
In each data-flow analysis application, we associate with every program point a data-flow value that represents an abstraction of the set of all possible program states that can be observed for that point.
constraints
- Transfer Function Constraints: 根据 semantics of statements
- Control Flow’s Constraints
Dataflow Analysis 不涉及
- Method Calls 方法调用
  - 只处理 Intra-procedural CFG
  - Inter-procedural Analysis (第七讲) 再学
- Aliases 别名
  - 不存在两个变量指向内存中同一个地址
  - 指针分析再学

Reaching Definitions 到达定值

定义：A definition d at program point p reaches a point q if there is a path from p to q such that d is not “killed” along that path
应用：检测未初始化变量：introduce a dummy definition for each variable v at the entry of CFG, and if the dummy definition of v reaches a point p where v is used, then v may be used before definition (as undefined reaches v)
特点：may analysis / over-approximation, forward
D: v = x op y This statement “generates” a definition D of variable v and “kills” all the other definitions in the program that define variable v, while leaving the remaining incoming definitions unaffected.
gen = {d | 生成且没被后面的语句 kill}, kill = {程序中其它对 v 的定值}
算法
为什么会终止：OUT[S] never shrinks 且状态是有限的（到达不动点）

Live Variables 活跃变量

定义：Live variables analysis tells whether the value of variable v at program point p could be used along some path in CFG starting at p. If so, v is live at p; otherwise, v is dead at p.
应用：替换寄存器时，找到 dead value
特点：may analysis / over-approximation, backward
use：先use（可能后def），def：先def（可能后use）
算法：

Available Expressions 可用表达式

定义：An expression x op y is available at program point p if
- 1. all paths from the entry to p must pass through the evaluation of x op y, and
- 1. after the last evaluation of x op y, there is no redefinition of x or y
应用：找公共子表达式
特点：must analysis / under-approximation, forward
gen：基本块求值 x op y，且之后没有对 x 或 y 赋值，kill：基本块对 x 或 y 赋值，且没有重新计算 x op y（如果重新计算，即使值不一样，也属于 gen）
算法

5&6. Data Flow Analysis - Foundations

Iterative Algorithm, Another View
1. 抽象：\(F: V^k \to V^k\)，其中\(V^k=(V_1 × V_2 … × V_k)=(OUT[n_1], OUT[n_2], …, OUT[n_k])\)
2. X is a fixed point of function F if \(X = F(X)\)
问题
1. 一定会有不动点吗？
2. 只有一个不动点吗？如果不是，我们求出来的是最好的吗？
3. 最多多少步能到达不动点？
upper / lower bound
1. Given a poset (P, ⊑) and its subset S that S ⊆ P, we say that u ∈ P is an upper bound of S, if ∀x ∈ S, x ⊑ u.
2. l ∈ P is an lower bound of S, if ∀x ∈ S, l ⊑ x.
3. least upper bound (lub, or join) of S, written ⊔S: if for every upper bound of S, say u, ⊔S ⊑ u
4. greatest lower bound (glb, or meet) of S, written ⊓S: if for every lower bound of S, say l, l ⊑ ⊓S.
5. lub 和 glb 不一定存在，但如果存在则唯一
lattice 格
1. lattice: Given a poset (P, ⊑), ∀a, b ∈ P, if a ⊔ b and a ⊓ b exist, then (P, ⊑) is called a lattice
  1. poset + every pair of its elements has lub and glb
2. complete lattice: All subsets of a lattice have lub and glb
  1. Every complete lattice (P, ⊑) has a greatest element \(\top\)= ⊔P called top and a least element ⊥ = ⊓P called bottom
3. Product Lattice: \(L^n = (P, ⊑)\) that is defined by:
  1. P = P1 ×… × Pn
  2. (x1, …, xn) ⊑ (y1, …, yn)⟺(x1 ⊑ y1) ∧…∧ (xn ⊑ yn)
  3. (x1, …, xn) ⊔ (y1, …, yn) = (x1 ⊔1 y1, …, xn ⊔n yn)
  4. (x1, …, xn) ⊓ (y1, …, yn) = (x1 ⊓1 y1, …, xn⊓n yn)
Data Flow Analysis Framework via Lattice
1. A data flow analysis framework (D, L, F) consists of:
  1. D: a direction of data flow: forwards or backwards
  2. L: a lattice including domain of the values V and a meet ⊓ or join ⊔ operator
  3. F: a family of transfer functions from V to V
2. Data flow analysis can be seen as iteratively applying transfer functions and meet/join operations on the values of a lattice
Monotonicity 单调性
1. A function f: L → L (L is a lattice) is monotonic if ∀x, y ∈ L, x ⊑ y⟹f(x) ⊑ f(y)
Fixed Point Theorem
1. Given a complete lattice (L, ⊑), if (1)f: L → L is monotonic and (2) L is finite, then the least fixed point of f can be found by iterating f(⊥), f(f(⊥)), …, f^k(⊥) until a fixed point is reached, the greatest fixed point of f can be found by iterating f(\(\top\)), f(f(\(\top\))), …, f^k(\(\top\)) until a fixed point is reached.
Relate Iterative Algorithm to Fixed Point Theorem
1. 每个node(BB)关联一个lattice，因此k-tuple对应product lattice，F对应f
2. 执行的有限性：Assume the lattice height is h and the number of nodes in CFG is k. We need at most i = h×k iterations.
3. F的单调性
May/Must Analysis, A Lattice View
MOP and Distributivity
Meet-Over-All-Paths Solution (MOP)
MOP比Iterative Algorithm更准，当满足分配律(distributive)时一样
Constant Propagation (不distributive)
1. 定义：Given a variable x at program point p, determine whether x is guaranteed to hold a constant value at p.
2. forward must analysis
3. Lattice：
4. Transfer Function：
Worklist Algorithm: Iterative Algorithm的优化
1. Iterative Algorithm
2. Worklist Algorithm

7. Interprocedural Analysis

Call Graph Construction (CHA)

Call Graph: 函数调用关系
构造调用图的方法：Class hierarchy analysis (CHA), Pointer analysis (k-CFA)
Java 方法调用类型
Method Dispatch of Virtual Calls
- During run time, a virtual call is resolved based on
  - type of the receiver object (pointed by o): 𝒄
  - method signature at the call site: 𝒎
- signature 方法签名
- Dispatch(c, m)
Class Hierarchy 层次结构 Analysis (CHA)
- 用CHA解目标方法（单个调用点）
- 例子
CHA的特点：快、不准
用CHA解构造Call Graph
- 算法：从main开始，对可达方法m的每个call site cs解目标方法(Resolve(cs))
- 例子

Interprocedural Data Flow Analysis 过程间数据流分析

Interprocedural Control Flow Graph 过程间控制流图：An ICFG of a program consists of CFGs of the methods in the program, plus two kinds of additional edges （ICFG = CFGs + call & return edges）
- Call edges : from call sites to the entry nodes of their callees
- Return edges : from return statements of the callees to the statements following their call sites (i.e., return sites
- 例子
- Node transfer : same as intra procedural constant propagation, plus that
  - For each call node, kill data flow value for the LHS variable. Its value will flow to return site along the return edges
- Edge transfer
  - Call edge transfer : transfer data flow from call node to the entry node of callee (along call edges) （传参数）
  - Return edge transfer : transfer data flow from return node of the callee to the return site (along return edges) （传返回值）
例子 Interprocedural Constant Propagation

8. Pointer Analysis

Pointer analysis 指针分析
- Computes which memory locations / objects a pointer (variable or field) can point to
- over approximation: a pointer may point to which objects?
- 例子
指针分析的用途：
- Fundamental information
  - Call graph, aliases (can two pointers point to the same object?), …
- Compiler optimization
  - Virtual call inlining, ...
- Bug detection
  - Null pointer detection, …
- Security analysis
  - Information flow analysis, …
指针分析的要素
- Heap abstraction 堆抽象: How to model heap memory?
  - √ Allocation-site 对每个创建点(new)做抽象（循环只算一个）
  - ~~Storeless~~
- Context sensitivity 上下文敏感 How to model calling contexts?
  - Context sensitive 区分不同上下文
  - √ Context insensitive
  - 先学上下文不敏感，再学敏感
- Flow sensitivity 控制流敏感 How to model control flow?
  - ~~Flow sensitive 执行顺序相关~~
  - √ Flow insensitive 把程序当作无序语句的集合，只维护一份结果
- Analysis scope 分析范围 Which parts of program should be analyzed?
  - √ Whole program 全程序
  - ~~Demand driven 需求驱动~~
Java 中的指针
- √ Local variable: x
- ~~Static field: C.f (当作 global variable)~~
- √ Instance field: x.f
- ~~Array element: array[i] 忽略下标~~
指针分析只关注 pointer affecting statements
- New: x = new T()
- Assign: x = y
- Store: x.f = y
- Load: y = x.f
- Call: r = x.k(a, ...)
  - ~~Static call: C.foo()~~
  - ~~Special call: super.foo()/x.<init>()/this.privateFoo()~~
  - √ Virtual call: x.foo() (只关注最难的一种)

9&10. Pointer Analysis - Foundations

先关注 pointer affecting statements 的前四种，之后再引入方法调用
Domains and Notations \(O\)也是创建点的集合
Rules
Pointer Flow Graph (PFG)
- Pointer flow graph of a program is a directed graph that expresses how objects flow among the pointers in the program.
- Nodes: Pointer = V ⋃ (O ×F)
- Edges: Pointer × Pointer （x -> y means x may flow to y）
- 例子
- 建图和传播互相依赖
Pointer Analysis: Algorithms
- worklist: 𝑊𝐿 ⊆ <Pointer, 𝒫(O)>∗
  - Each worklist entry <𝑛, 𝑝𝑡𝑠> is a pair of pointer 𝑛 and points to set 𝑝𝑡𝑠, which means that 𝑝𝑡𝑠 should be propagated to 𝑝𝑡(𝑛)
加入method call
- 过程间分析需要call graph
- 用指针分析来建call graph
- 故一边做指针分析一边建call graph（互相依赖）
- rule
  - \(l \to m\in CG\)
  - 四件事情：dispatch、传this、传参数、传返回值
  - 为什么x到this不建边：容易出现假的指向关系
- 算法

11&12. Pointer Analysis - Context Sensitivity

Context Sensitivity (C.S.) = Context Insensitivity (C.I.) + calling contexts (call-site的序列在哪个点被调用)
Cloning-Based Context Sensitivity
- each method is qualified by one or more contexts
- variables are also qualified by contexts (inherited from the method they are declared in)
Context-Sensitive Heap
- 对象分配在堆区，所以给堆加上下文
- 例子
Domains and Notations
- c: list of call site
rules
- call: dispatch、选择上下文
- select() 例子
Context Sensitive Pointer Analysis
- Pointer Flow Graph with C.S.
  - Nodes 带上下文
- 算法
Context Sensitivity Variants 类型
- 即 select 函数的算法 \(Select(𝑐, 𝑙, 𝑐′: 𝑜_𝑖)\)
  - c: caller context
  - l: call site
  - c': heap context
  - oi: receiver object
- k-Limiting Context Abstraction：限制上下文长度<=k
  - e.g., k=2 for method context, k=1 for heap contexts
- Call site sensitivity (call string sensitivity, or k-CFA)
  - \(Select(c,l,\_)=[𝑙′,…,𝑙′′,𝑙]\) where \(𝑐=[𝑙′,…,𝑙′′]\)
  - 例子
  - k-Call Site Sensitivity / k-CFA
    - 1-call site/1-CFA: \(Select(\_,l,\_) =[𝑙]\)
    - 2-call site/2-CFA: \(Select(c,l,\_)=[𝑙′′,𝑙]\) where \(𝑐=[𝑙′,𝑙′′]\)
- Object sensitivity
  - \(Select(\_,\_,c':o_i)=[𝑜_𝑗,…,𝑜_𝑘,𝑜_𝑖]\) where \(𝑐′=[𝑜_𝑗,…,𝑜_𝑘]\)
- Type sensitivity
  - \(Select(\_,\_,c':o_i=[𝑡′,…,𝑡′′,InType(𝑜_𝑖)]\) where \(𝑐′=[𝑡′,…,𝑡]\), InType(o)=o的创建点所在类的类型（不是o的类型）
- 总结
  - Precision: object > type > call site
  - Efficiency: type > object > call site

13. Static Analysis for Security

目标：在有坏人的情况下保证运行、数据安全
两种技术
- 访问控制：限制权限，无法保证获得权限之后
- Information Flow Security 信息流安全 (end-to-end)：跟踪数据流动，关心数据的传播
  - Information Flow 信息流：如果变量x中的信息传递到变量y，则有信息流x->y
  - Security Levels / Classes：给变量分级
    - 分级方法举例：H (high security, secret), L (low security, public)
  - Noninterference policy 非干涉策略
    - H不能影响（流向）L
两种性质
- Confidentiality 保密性：防止秘密信息泄露 (H --×-> L, read protection)
- Integrity 完整性：防止不可信的信息污染秘密信息 (L --×-> H, write protection)
  - 注入错误/攻击
  - 包括 Correctness 准确性、Completeness 完全性、Consistency 一致性
- 对偶关系
两种信息流动方式
- Explicit Flows 显式流: 通过 direct copy 流动信息
- implicit flows 隐式流: 通过副作用泄露信息
  - Covert/Hidden Channels 隐藏信道：一个机制本来不是用来传递信息的，但是把信息传递出去了
    - channel：传递信息的机制
- 与显式流相比，隐藏信道泄露的信息有限
Taint Analysis 污点分析
- 污点分析关心污点数据从source会不会流到特定的位置 (sinks)
- 应用：
  - Confidentiality
    - Source: source of secret data
    - Sink: leakage
    - Information leaks
  - Integrity
    - Source: source of untrusted data
    - Sink: critical computation
    - Injection errors
- 借助指针分析实现污点分析
  - Treats tainted data as (artificial) objects
  - Treats sources as allocation sites (of tainted data)
  - Domains and Notations
  - Output: 𝑇𝑎𝑖𝑛𝑡𝐹𝑙𝑜𝑤𝑠: a set of pairs of tainted data and sink methods
  - Rules: (其余规则同指针分析)
  - 例子

14. Datalog-Based Program Analysis

Motivation
- Imperative 命令式语言: how to do (~implementation)
- Declarative 声明式语言: what to do (~specification)
Datalog = Data + Logic
- a declarative logic programming language that is a subset of Prolog
- Predicates 谓词 / Data: a set of statements (table of data)
- Fact 事实: 在 table 里的
- Atoms 原子: P(X1, X2, ..., Xn)
  - P: name of predicate
  - Xi: arguments (terms)，包括变量和常量
  - 例子：Age(person,18)
  - 关系原子 P(X1,X2,…, Xn) 为真: tuple X1,X2,…, Xn 在 P 中
  - 算术原子，如 age >= 18
- Rules / Logic: H <- B1, B2, …, Bn ; C1, ..., !Cm. 表示如果body为true，那么head为true
  - H: Head, is an atom
  - B: Body, Bi (an atom) is called subgoal
  - 例子：Adult(person) <- Age(person,age), age >= 18.
  - ,表示与
  - ;表示或
  - !表示非
- Rule Interpretation 解释/推导: 枚举
- Datalog program = Facts + Rules
- EDB and IDB Predicates
  - EDB (extensional data base) 运行前给出的，输入，不可变
  - IDB (intensional data base) 根据rules推导出的，输出
- Recursion: 自己推导自己
  - 例子：Reach(from, to) <- Edge(from, to); Reach(from, node), Edge(node, to).
- Rule Safety：不能处理 unsafe 的无限的情况，不允许递归和取反同时出现
- 特征
  - Monotonicity 单调性：只能产生不能删除
  - Termination 终止性：单调+有限
Pointer Analysis via Datalog
- 建模
- 例子
- Rules (flow insensitive)
Taint Analysis via Datalog
- 建模
  - EDB predicates
    - Source(m : M ) // source methods
    - Sink(m : M ) // sink methods
    - Taint(l : S , t : T ) // associates each call site to the tainted data from the call site
  - IDB predicate
    - TaintFlow (t : T , m : M ) // detected taint flows, tainted data 𝑡 may flow to sink method 𝑚
- rules
总结
- 优点：可读性强、容易实现、engine已有优化
- 缺点：表达能力弱、不能完全控制性能（engine是黑盒）

15. CFL-Reachability and IFDS

Infeasible Paths: Paths in CFG that do not correspond to actual executions，我们希望尽量避免，但是静态分析很难判断
Realizable Paths: The paths in which “returns” are matched with corresponding “calls” 调用和返回要匹配上
- 例子
- Our goal is to recognize realizable paths so that we could avoid polluting analysis results along unrealizable paths.
- 方法：括号匹配
- 系统方法：CFL-Reachability

CFL-Reachability

CFL-Reachability: A path is considered to connect two nodes A and B, or B is reachable from A, only if the concatenation of the labels on the edges of the path is a word in a specified context-free language.
Partially Balanced-Parenthesis Problem via CFL
- Every right parenthesis “)i” is balanced by a preceding left parenthesis “(i”, but the converse needs not hold（括号序列的前缀）
- For each call site i, label its call edge “(i” and return edge “)i”
- Label all other edges with symbol “e”
A path is a realizable path iff the path’ word is in the language L(realizable)

IFDS

程序分析->图可达性问题
Meet-Over-All-Realizable-Paths (MRP)
IFDS (Interprocedural, Finite, Distributive, Subset Problem)
- 前提条件：IFDS is for interprocedural data flow analysis with distributive flow functions over finite domains.
- 过程：Given a program P, and a dataflow-analysis problem Q
  - Build a supergraph G* for P and define flow functions for edges in G* based on Q
  - Build exploded supergraph G# for P by transforming flow functions to representation relations (graphs)
  - Q can be solved as graph reachability problems (find out MRP solutions) via applying Tabulation algorithm on G#
- Let n be a program point, data fact d ∈ MRP_n, iff there is a realizable path in G# from <s_main, 0> to <n, d>. (then d’s white circle turns to blue)
supergraph G* = (N*, E*)
- 包括一些flow graphs Gi
- 每个Gi加上start node si和exit node ei
- 过程调用表示为Call_j和Ret_j
- 例子
- G*上的intraprocedural 边
  - call-to-return-site edge from Call_p to Ret_p
  - call-to-start edge from Call_p to s_p of the called procedure
  - exit-to-return-site edge from e_p of the called procedure to Ret_p
  - 例子(紫色、绿色、蓝色)
Design Flow Functions
- 问题“Possibly-uninitialized variables”: for each node n ∈ N*, determine the set of variables that may be uninitialized before execution reaches n. 可能没有被初始化的变量
- 例子
Build Exploded Supergraph
- Build exploded supergraph G# for a program by transforming flow functions to representation relations (graphs)
- Each flow function can be represented as a graph with 2(D+1) nodes (at most (D+1)^2 edges), where D is a finite set of dataflow facts
whether data fact a holds at program point p
- traditional: if a is in OUT[n4] after the analysis finishes
- IFDS: if there is a realizable path from <s_main, 0> to <n4,a>
Tabulation Algorithm
Understanding the Distributivity of IFDS
- distributivity: F(x ∧ y) = F(x) ∧ F(y)
- Given a statement S, besides S itself, if we need to consider multiple input data facts to create correct outputs, then the analysis is not distributive and should not be expressed in IFDS. In IFDS, each data fact (circle) and its propagation (edges) could be handled independently, and doing so will not affect the correctness of the final results.
  - constant propagation不行：存在z=x+y
  - pointer analysis不行：无法分析别名

16.Soundiness

Soundness: Conservative 保守 approximation: the analysis captures all program behaviors, or the analysis result models all possible executions of the program
Hard-to-analyze features: an aggressively conservative treatment to these features will likely make the analysis too imprecise to scale, rendering the analysis useless
Soundiness: A soundy analysis typically means that the analysis is mostly sound, with well-identified unsound treatments to hard/specific language features 把不sound的部分讲一下怎么做的
Soundness, Soundiness and Unsoundness
- A sound analysis requires to capture all dynamic behaviors
- A soundy analysis aims to capture all dynamic behaviors with certain hard language features unsoundly handled within reason 合理
- An unsound analysis deliberately ignores 故意忽略 certain behaviors in its design for better efficiency, precision or accessibility

Hard Language Feature

Java Reflection
Native Code
- Java Native Interface (JNI): A function module of JVM which allows interoperation between Java and Native code (C/C++)
  - native code: platform dependent features, existing libraries
- 解决方法: Manually models the critical native code