《软件分析》笔记

  1. 1. Course Introduction 课程介绍
    1. Static Analysis 静态分析
    2. sound & complete
  2. 2. Intermediate Representation 中间表示
    1. Control Flow Graphs (CFG)
  3. 3&4. Data Flow Analysis —Applications
    1. Reaching Definitions 到达定值
    2. Live Variables 活跃变量
    3. Available Expressions 可用表达式
  4. 5&6. Data Flow Analysis - Foundations
  5. 7. Interprocedural Analysis
    1. Call Graph Construction (CHA)
    2. Interprocedural Data Flow Analysis 过程间数据流分析
  6. 8. Pointer Analysis
  7. 9&10. Pointer Analysis - Foundations
  8. 11&12. Pointer Analysis - Context Sensitivity
  9. 13. Static Analysis for Security
  10. 14. Datalog-Based Program Analysis
  11. 15. CFL-Reachability and IFDS
    1. CFL-Reachability
    2. IFDS
  12. 16.Soundiness
    1. Hard Language Feature

南京大学《软件分析》Yue Li and Tian Tan
课程主页, 视频

1. Course Introduction 课程介绍

  • PL 的分支 image-20200819203756931
  • 语言的分类:命令式 (C, Java)、函数式 (Haskell)、逻辑式/声明式
  • 挑战:语言没有变,但是程序变复杂了,如何保证可靠
  • 静态分析的用途:程序可靠性、程序安全、编译优化、程序理解

Static Analysis 静态分析

  • 静态分析:Static analysis analyzes a program P to reason about its behaviors and determines whether it satisfies some properties before running P.
  • 一句话概括静态分析:Static Analysis: ensure (or get close to) soundness, while making good trade-offs between analysis precision and analysis speed. 在保证正确性的前提下,在精度和速度上做平衡。
  • 两个词概括静态分析:
    • Abstraction: 具体值->符号值 image-20200819211636037
    • Over-approximation
      • Transfer functions: 抽象值上的操作 image-20200819211755847
      • Control flows image-20200819212044350

sound & complete

  • Rice’s Theorem: Any non-trivial property of the behavior of programs in a r.e. language is undecidable.
    • non-trivial properties ~= interesting properties ~= the properties related with run-time behaviors of programs
    • r.e. (recursively enumerable) 递归可枚举语言: recognizable by a Turing-machine
    • There is no such approach to determine whether P satisfies such non-trivial properties, i.e., giving exact answer: Yes or No
    • 故不存在 perfect (sound & complete) static analysis
  • sound & complete: image-20200819205225292
    • Over- and under-approximations are both for safety of analysis
    • sound: 报出所有问题 may analysis: outputs information that may be true (over-approximation) (safe=over)
    • complete: 报出的问题都是对的 must analysis: outputs information that must be true (under-approximation) (safe=under)
  • useful static analysis image-20200819210356458
    • 妥协 soundness (false negatives 可能漏报)
    • 妥协 completeness (false positives 可能误报) (大多数情况的分析,因为 soundness 很重要)

2. Intermediate Representation 中间表示

  • Compiler image-20200820001723221
  • AST vs. IR
    • AST
      • high-level and closed to grammar structure
      • usually language dependent
      • suitable for fast type checking
      • lack of control flow information
    • IR
      • low-level and closed to machine code
      • usually language independent
      • compact and uniform
      • contains control flow information
      • usually considered as the basis for static analysis
    • 所以 IR 更适合静态分析
    • IR 没有固定的定义,常用三地址码
  • 3-Address Code 三地址码 (3AC)
    • 定义: Each 3AC contains at most 3 addresses (name, constant, temporary)
    • 形式
      • x = y bop z
      • x = uop y
      • x = y
      • goto L
      • if x goto L
      • if x rop y goto L
  • Soot (static analysis framework for Java) & Jimple (Soot's IR)
    • Do-While Loop image-20200820003518273 image-20200820003531200
    • Method Call image-20200820003548997 image-20200820003622923 image-20200820004326286
      • JVM调用指令
        • invokespecial: call constructor 构造函数, superclass methods, private methods
        • invokevirutal: instance methods call (virtual dispatch)
        • invokeinterface: 类似于 invokevirtual, 区别: cannot optimization, check interface implementation
        • invokestatic: call static methods
        • invokedynamic: Java 是静态语言,方便其它动态语言在JVM上跑
      • 尖括号内部是 method signature, 包括 class name, return type, method name, parameter types
    • Class image-20200820004459524image-20200820004504772
      • clinit: 静态初始化
  • Static Single Assignment (SSA)

Control Flow Graphs (CFG)

  • 构造
    • 点: basic blocks (BB) image-20200820010335630
      • There is a conditional or unconditional jump from the end of A to the beginning of B 跳转
      • B immediately follows A in the original order of instructions and A does not end in an unconditional jump 相邻且不是直接 goto
    • goto 指令 改成 goto basic block
    • 加上 Entry 和 Exit

3&4. Data Flow Analysis —Applications

  • Data Flow Analysis: How Data Flows on CFG? image-20200820203959779
  • Data-flow analysis is to find a solution to a set of safe-approximation directed constraints on the IN[s]’s and OUT[s]’s, for all statements s.
  • Input and Output States image-20200820204224457
  • In each data-flow analysis application, we associate with every program point a data-flow value that represents an abstraction of the set of all possible program states that can be observed for that point.
  • constraints
    • Transfer Function Constraints: 根据 semantics of statements image-20200820204659406
    • Control Flow’s Constraints image-20200820205145158
  • Dataflow Analysis 不涉及
    • Method Calls 方法调用
      • 只处理 Intra-procedural CFG
      • Inter-procedural Analysis (第七讲) 再学
    • Aliases 别名
      • 不存在两个变量指向内存中同一个地址
      • 指针分析再学

Reaching Definitions 到达定值

  • 定义:A definition d at program point p reaches a point q if there is a path from p to q such that d is not “killed” along that path image-20200820210533180
  • 应用:检测未初始化变量:introduce a dummy definition for each variable v at the entry of CFG, and if the dummy definition of v reaches a point p where v is used, then v may be used before definition (as undefined reaches v)
  • 特点:may analysis / over-approximation, forward
  • D: v = x op y This statement “generates” a definition D of variable v and “kills” all the other definitions in the program that define variable v, while leaving the remaining incoming definitions unaffected.
  • gen = {d | 生成且没被后面的语句 kill}, kill = {程序中其它对 v 的定值}
  • 算法image-20200820211248220
  • 为什么会终止:OUT[S] never shrinks 且状态是有限的(到达不动点)

Live Variables 活跃变量

  • 定义:Live variables analysis tells whether the value of variable v at program point p could be used along some path in CFG starting at p. If so, v is live at p; otherwise, v is dead at p.image-20200820234556982
  • 应用:替换寄存器时,找到 dead value
  • 特点:may analysis / over-approximation, backward
  • use:先use(可能后def),def:先def(可能后use)
  • 算法:image-20200821003347915

Available Expressions 可用表达式

  • 定义:An expression x op y is available at program point p if
      1. all paths from the entry to p must pass through the evaluation of x op y, and
      1. after the last evaluation of x op y, there is no redefinition of x or y
  • 应用:找公共子表达式
  • 特点:must analysis / under-approximation, forward
  • gen:基本块求值 x op y,且之后没有对 x 或 y 赋值,kill:基本块对 x 或 y 赋值,且没有重新计算 x op y(如果重新计算,即使值不一样,也属于 gen)
  • 算法image-20200821005738463 image-20200821012643222image-20200821012746128

5&6. Data Flow Analysis - Foundations

  1. Iterative Algorithm, Another View
    1. 抽象:\(F: V^k \to V^k\),其中\(V^k=(V_1 × V_2 … × V_k)=(OUT[n_1], OUT[n_2], …, OUT[n_k])\)
    2. X is a fixed point of function F if \(X = F(X)\)
  2. 问题
    1. 一定会有不动点吗?
    2. 只有一个不动点吗?如果不是,我们求出来的是最好的吗?
    3. 最多多少步能到达不动点?
  3. upper / lower bound
    1. Given a poset (P, ⊑) and its subset S that S ⊆ P, we say that u ∈ P is an upper bound of S, if ∀x ∈ S, x ⊑ u.
    2. l ∈ P is an lower bound of S, if ∀x ∈ S, l ⊑ x.
    3. least upper bound (lub, or join) of S, written ⊔S: if for every upper bound of S, say u, ⊔S ⊑ u
    4. greatest lower bound (glb, or meet) of S, written ⊓S: if for every lower bound of S, say l, l ⊑ ⊓S.
    5. lub 和 glb 不一定存在,但如果存在则唯一
  4. lattice 格
    1. lattice: Given a poset (P, ⊑), ∀a, b ∈ P, if a ⊔ b and a ⊓ b exist, then (P, ⊑) is called a lattice
      1. poset + every pair of its elements has lub and glb
    2. complete lattice: All subsets of a lattice have lub and glb
      1. Every complete lattice (P, ⊑) has a greatest element \(\top\)= ⊔P called top and a least element ⊥ = ⊓P called bottom
    3. Product Lattice: \(L^n = (P, ⊑)\) that is defined by:
      1. P = P1 ×… × Pn
      2. (x1, …, xn) ⊑ (y1, …, yn)⟺(x1 ⊑ y1) ∧…∧ (xn ⊑ yn)
      3. (x1, …, xn) ⊔ (y1, …, yn) = (x1 ⊔1 y1, …, xn ⊔n yn)
      4. (x1, …, xn) ⊓ (y1, …, yn) = (x1 ⊓1 y1, …, xn⊓n yn)
  5. Data Flow Analysis Framework via Lattice
    1. A data flow analysis framework (D, L, F) consists of:
      1. D: a direction of data flow: forwards or backwards
      2. L: a lattice including domain of the values V and a meet ⊓ or join ⊔ operator
      3. F: a family of transfer functions from V to V
    2. Data flow analysis can be seen as iteratively applying transfer functions and meet/join operations on the values of a lattice
  6. Monotonicity 单调性
    1. A function f: L → L (L is a lattice) is monotonic if ∀x, y ∈ L, x ⊑ y⟹f(x) ⊑ f(y)
  7. Fixed Point Theorem
    1. Given a complete lattice (L, ⊑), if (1)f: L → L is monotonic and (2) L is finite, then the least fixed point of f can be found by iterating f(⊥), f(f(⊥)), …, f^k(⊥) until a fixed point is reached, the greatest fixed point of f can be found by iterating f(\(\top\)), f(f(\(\top\))), …, f^k(\(\top\)) until a fixed point is reached.
  8. Relate Iterative Algorithm to Fixed Point Theorem
    1. 每个node(BB)关联一个lattice,因此k-tuple对应product lattice,F对应f
    2. 执行的有限性:Assume the lattice height is h and the number of nodes in CFG is k. We need at most i = h×k iterations.
    3. F的单调性
  9. May/Must Analysis, A Lattice Viewimage-20200821221521013
  10. MOP and Distributivity
  11. Meet-Over-All-Paths Solution (MOP)image-20200821222601911
  12. MOP比Iterative Algorithm更准,当满足分配律(distributive)时一样
  13. Constant Propagation (不distributive)
    1. 定义:Given a variable x at program point p, determine whether x is guaranteed to hold a constant value at p.
    2. forward must analysis
    3. Lattice:image-20200821224036118
    4. Transfer Function:image-20200821224124039
  14. Worklist Algorithm: Iterative Algorithm的优化
    1. Iterative Algorithmimage-20200820211248220
    2. Worklist Algorithmimage-20200821224714562

7. Interprocedural Analysis

Call Graph Construction (CHA)

  • Call Graph: 函数调用关系
  • 构造调用图的方法:Class hierarchy analysis (CHA), Pointer analysis (k-CFA)
  • Java 方法调用类型image-20200822014440110
  • Method Dispatch of Virtual Calls
    • During run time, a virtual call is resolved based on
      • type of the receiver object (pointed by o): 𝒄
      • method signature at the call site: 𝒎
    • signature 方法签名image-20200822014611042
    • Dispatch(c, m)image-20200822014750835
  • Class Hierarchy 层次结构 Analysis (CHA)
    • 用CHA解目标方法(单个调用点)image-20200822020205229
    • 例子image-20200822020927029
  • CHA的特点:快、不准
  • 用CHA解构造Call Graph
    • 算法:从main开始,对可达方法m的每个call site cs解目标方法(Resolve(cs))image-20200822021406611
    • 例子image-20200822023512257

Interprocedural Data Flow Analysis 过程间数据流分析

  • Interprocedural Control Flow Graph 过程间控制流图:An ICFG of a program consists of CFGs of the methods in the program, plus two kinds of additional edges (ICFG = CFGs + call & return edges)
    • Call edges : from call sites to the entry nodes of their callees
    • Return edges : from return statements of the callees to the statements following their call sites (i.e., return sites
    • 例子image-20200822023926631
  • image-20200822024124626
    • Node transfer : same as intra procedural constant propagation, plus that
      • For each call node, kill data flow value for the LHS variable. Its value will flow to return site along the return edges
    • Edge transfer
      • Call edge transfer : transfer data flow from call node to the entry node of callee (along call edges) (传参数)
      • Return edge transfer : transfer data flow from return node of the callee to the return site (along return edges) (传返回值)
  • 例子 Interprocedural Constant Propagationimage-20200822024950646

8. Pointer Analysis

  • Pointer analysis 指针分析
    • Computes which memory locations / objects a pointer (variable or field) can point to
    • over approximation: a pointer may point to which objects?
    • 例子image-20200822031215627
  • 指针分析的用途:
    • Fundamental information
      • Call graph, aliases (can two pointers point to the same object?), …
    • Compiler optimization
      • Virtual call inlining, ...
    • Bug detection
      • Null pointer detection, …
    • Security analysis
      • Information flow analysis, …
  • 指针分析的要素
    • Heap abstraction 堆抽象: How to model heap memory?
      • √ Allocation-site 对每个创建点(new)做抽象(循环只算一个)
      • Storeless
    • Context sensitivity 上下文敏感 How to model calling contexts?
      • Context sensitive 区分不同上下文
      • √ Context insensitive
      • 先学上下文不敏感,再学敏感
    • Flow sensitivity 控制流敏感 How to model control flow?
      • Flow sensitive 执行顺序相关
      • √ Flow insensitive 把程序当作无序语句的集合,只维护一份结果
    • Analysis scope 分析范围 Which parts of program should be analyzed?
      • √ Whole program 全程序
      • Demand driven 需求驱动
  • Java 中的指针
    • √ Local variable: x
    • Static field: C.f (当作 global variable)
    • √ Instance field: x.f
    • Array element: array[i] 忽略下标
  • 指针分析只关注 pointer affecting statements
    • New: x = new T()
    • Assign: x = y
    • Store: x.f = y
    • Load: y = x.f
    • Call: r = x.k(a, ...)
      • Static call: C.foo()
      • Special call: super.foo()/x.<init>()/this.privateFoo()
      • √ Virtual call: x.foo() (只关注最难的一种)

9&10. Pointer Analysis - Foundations

  • 先关注 pointer affecting statements 的前四种,之后再引入方法调用
  • Domains and Notations image-20200822034437058 \(O\)也是创建点的集合
  • Rulesimage-20200822035406048
  • Pointer Flow Graph (PFG)
    • Pointer flow graph of a program is a directed graph that expresses how objects flow among the pointers in the program.
    • Nodes: Pointer = V ⋃ (O ×F)
    • Edges: Pointer × Pointer (x -> y means x may flow to y)
    • 例子image-20200822155955342
    • 建图和传播互相依赖
  • Pointer Analysis: Algorithmsimage-20200822160311900
    • worklist: 𝑊𝐿 ⊆ <Pointer, 𝒫(O)>∗
      • Each worklist entry <𝑛, 𝑝𝑡𝑠> is a pair of pointer 𝑛 and points to set 𝑝𝑡𝑠, which means that 𝑝𝑡𝑠 should be propagated to 𝑝𝑡(𝑛)
  • 加入method call
    • 过程间分析需要call graph
    • 用指针分析来建call graph
    • 故一边做指针分析一边建call graph(互相依赖)
    • ruleimage-20200822162318898
      • \(l \to m\in CG\)
      • 四件事情:dispatch、传this、传参数、传返回值
      • 为什么x到this不建边:容易出现假的指向关系
    • 算法image-20200822163135747

11&12. Pointer Analysis - Context Sensitivity

  • Context Sensitivity (C.S.) = Context Insensitivity (C.I.) + calling contexts (call-site的序列 在哪个点被调用)
  • Cloning-Based Context Sensitivity
    • each method is qualified by one or more contexts
    • variables are also qualified by contexts (inherited from the method they are declared in)
  • Context-Sensitive Heap
    • 对象分配在堆区,所以给堆加上下文
    • image-20200822183827146
    • 例子image-20200822185341778
  • Domains and Notationsimage-20200822185933099
    • c: list of call site
  • rulesimage-20200822200631899
    • call: dispatch、选择上下文
    • select() 例子image-20200822194450176
  • Context Sensitive Pointer Analysisimage-20200822195536862
    • Pointer Flow Graph with C.S.
      • Nodes 带上下文
    • 算法image-20200822201056978
  • Context Sensitivity Variants 类型
    • 即 select 函数的算法 \(Select(𝑐, 𝑙, 𝑐′: 𝑜_𝑖)\)
      • c: caller context
      • l: call site
      • c': heap context
      • oi: receiver object
    • k-Limiting Context Abstraction:限制上下文长度<=k
      • e.g., k=2 for method context, k=1 for heap contexts
    • Call site sensitivity (call string sensitivity, or k-CFA)
      • \(Select(c,l,\_)=[𝑙′,…,𝑙′′,𝑙]\) where \(𝑐=[𝑙′,…,𝑙′′]\)
      • 例子image-20200822202451428
      • k-Call Site Sensitivity / k-CFA
        • 1-call site/1-CFA: \(Select(\_,l,\_) =[𝑙]\)
        • 2-call site/2-CFA: \(Select(c,l,\_)=[𝑙′′,𝑙]\) where \(𝑐=[𝑙′,𝑙′′]\)
    • Object sensitivity
      • \(Select(\_,\_,c':o_i)=[𝑜_𝑗,…,𝑜_𝑘,𝑜_𝑖]\) where \(𝑐′=[𝑜_𝑗,…,𝑜_𝑘]\)
    • Type sensitivity
      • \(Select(\_,\_,c':o_i=[𝑡′,…,𝑡′′,InType(𝑜_𝑖)]\) where \(𝑐′=[𝑡′,…,𝑡]\), InType(o)=o的创建点所在类的类型(不是o的类型)
    • 总结
      • Precision: object > type > call site
      • Efficiency: type > object > call site

13. Static Analysis for Security

  • 目标:在有坏人的情况下保证运行、数据安全
  • 两种技术
    • 访问控制:限制权限,无法保证获得权限之后
    • Information Flow Security 信息流安全 (end-to-end):跟踪数据流动,关心数据的传播
      • Information Flow 信息流:如果变量x中的信息传递到变量y,则有信息流x->y
      • Security Levels / Classes:给变量分级
        • 分级方法举例:H (high security, secret), L (low security, public)
      • Noninterference policy 非干涉策略
        • H不能影响(流向)L
  • 两种性质
    • Confidentiality 保密性:防止秘密信息泄露 (H --×-> L, read protection)
    • Integrity 完整性:防止不可信的信息污染秘密信息 (L --×-> H, write protection)
      • 注入错误/攻击
      • 包括 Correctness 准确性、Completeness 完全性、Consistency 一致性
    • 对偶关系
  • 两种信息流动方式
    • Explicit Flows 显式流: 通过 direct copy 流动信息
    • implicit flows 隐式流: 通过副作用泄露信息
      • Covert/Hidden Channels 隐藏信道:一个机制本来不是用来传递信息的,但是把信息传递出去了
        • channel:传递信息的机制
        • image-20200822215720739image-20200822214809808
    • 与显式流相比,隐藏信道泄露的信息有限
  • Taint Analysis 污点分析
    • 污点分析关心污点数据从source会不会流到特定的位置 (sinks)image-20200822220225360
    • 应用:
      • Confidentiality
        • Source: source of secret data
        • Sink: leakage
        • Information leaks
      • Integrity
        • Source: source of untrusted data
        • Sink: critical computation
        • Injection errors
    • 借助指针分析实现污点分析
      • Treats tainted data as (artificial) objects
      • Treats sources as allocation sites (of tainted data)
      • Domains and Notationsimage-20200822220811597
      • Output: 𝑇𝑎𝑖𝑛𝑡𝐹𝑙𝑜𝑤𝑠: a set of pairs of tainted data and sink methods
      • Rules: (其余规则同指针分析)image-20200822221200716
      • 例子image-20200822221801362

14. Datalog-Based Program Analysis

  • Motivation
    • Imperative 命令式语言: how to do (~implementation)
    • Declarative 声明式语言: what to do (~specification)
  • Datalog = Data + Logic
    • a declarative logic programming language that is a subset of Prolog
    • Predicates 谓词 / Data: a set of statements (table of data)
    • Fact 事实: 在 table 里的
    • Atoms 原子: P(X1, X2, ..., Xn)
      • P: name of predicate
      • Xi: arguments (terms),包括变量和常量
      • 例子:Age(person,18)
      • 关系原子 P(X1,X2,…, Xn) 为真: tuple X1,X2,…, Xn 在 P 中
      • 算术原子,如 age >= 18
    • Rules / Logic: H <- B1, B2, …, Bn ; C1, ..., !Cm. 表示如果body为true,那么head为true
      • H: Head, is an atom
      • B: Body, Bi (an atom) is called subgoal
      • 例子:Adult(person) <- Age(person,age), age >= 18.
      • ,表示与
      • ;表示或
      • !表示非
    • Rule Interpretation 解释/推导: 枚举
    • Datalog program = Facts + Rules
    • EDB and IDB Predicates
      • EDB (extensional data base) 运行前给出的,输入,不可变
      • IDB (intensional data base) 根据rules推导出的,输出
      • image-20200822231015020
    • Recursion: 自己推导自己
      • 例子:Reach(from, to) <- Edge(from, to); Reach(from, node), Edge(node, to).
    • Rule Safety:不能处理 unsafe 的无限的情况,不允许递归和取反同时出现
    • 特征
      • Monotonicity 单调性:只能产生不能删除
      • Termination 终止性:单调+有限
  • Pointer Analysis via Datalog
    • 建模image-20200822231647047
    • 例子image-20200822231821174
    • Rules (flow insensitive)image-20200822231846093image-20200822233427784image-20200822233450012image-20200822233616750image-20200822233741838image-20200822233908727
  • Taint Analysis via Datalog
    • 建模
      • EDB predicates
        • Source(m : M ) // source methods
        • Sink(m : M ) // sink methods
        • Taint(l : S , t : T ) // associates each call site to the tainted data from the call site
      • IDB predicate
        • TaintFlow (t : T , m : M ) // detected taint flows, tainted data 𝑡 may flow to sink method 𝑚
    • rulesimage-20200822234516384
  • 总结
    • 优点:可读性强、容易实现、engine已有优化
    • 缺点:表达能力弱、不能完全控制性能(engine是黑盒)

15. CFL-Reachability and IFDS

  • Infeasible Paths: Paths in CFG that do not correspond to actual executions,我们希望尽量避免,但是静态分析很难判断
  • Realizable Paths: The paths in which “returns” are matched with corresponding “calls” 调用和返回要匹配上
    • 例子image-20200823001257952
    • Our goal is to recognize realizable paths so that we could avoid polluting analysis results along unrealizable paths.
    • 方法:括号匹配
    • 系统方法:CFL-Reachability

CFL-Reachability

  • CFL-Reachability: A path is considered to connect two nodes A and B, or B is reachable from A, only if the concatenation of the labels on the edges of the path is a word in a specified context-free language.
  • Partially Balanced-Parenthesis Problem via CFL
    • Every right parenthesis “)i” is balanced by a preceding left parenthesis “(i”, but the converse needs not hold(括号序列的前缀)
    • For each call site i, label its call edge “(i” and return edge “)i”
    • Label all other edges with symbol “e”
  • A path is a realizable path iff the path’ word is in the language L(realizable)image-20200823002740228

IFDS

  • 程序分析->图可达性问题
  • Meet-Over-All-Realizable-Paths (MRP)
    • image-20200823004614738
  • IFDS (Interprocedural, Finite, Distributive, Subset Problem)
    • 前提条件:IFDS is for interprocedural data flow analysis with distributive flow functions over finite domains.
    • 过程:Given a program P, and a dataflow-analysis problem Q
      • Build a supergraph G* for P and define flow functions for edges in G* based on Q
      • Build exploded supergraph G# for P by transforming flow functions to representation relations (graphs)
      • Q can be solved as graph reachability problems (find out MRP solutions) via applying Tabulation algorithm on G#
    • Let n be a program point, data fact d ∈ MRP_n, iff there is a realizable path in G# from <s_main, 0> to <n, d>. (then d’s white circle turns to blue)
  • supergraph G* = (N*, E*)
    • 包括一些flow graphs Gi
    • 每个Gi加上start node si和exit node ei
    • 过程调用表示为Call_j和Ret_j
    • 例子image-20200823010859688
    • G*上的intraprocedural 边
      • call-to-return-site edge from Call_p to Ret_p
      • call-to-start edge from Call_p to s_p of the called procedure
      • exit-to-return-site edge from e_p of the called procedure to Ret_p
      • 例子(紫色、绿色、蓝色)image-20200823011128488
  • Design Flow Functions
    • 问题“Possibly-uninitialized variables”: for each node n ∈ N*, determine the set of variables that may be uninitialized before execution reaches n. 可能没有被初始化的变量
    • 例子image-20200823012605743
  • Build Exploded Supergraph
    • Build exploded supergraph G# for a program by transforming flow functions to representation relations (graphs)
    • Each flow function can be represented as a graph with 2(D+1) nodes (at most (D+1)^2 edges), where D is a finite set of dataflow facts
    • image-20200823014338851
  • whether data fact a holds at program point p
    • traditional: if a is in OUT[n4] after the analysis finishes
    • IFDS: if there is a realizable path from <s_main, 0> to <n4,a>
    • image-20200823015717043
  • Tabulation Algorithm
  • Understanding the Distributivity of IFDS
    • distributivity: F(x ∧ y) = F(x) ∧ F(y)
    • Given a statement S, besides S itself, if we need to consider multiple input data facts to create correct outputs, then the analysis is not distributive and should not be expressed in IFDS. In IFDS, each data fact (circle) and its propagation (edges) could be handled independently, and doing so will not affect the correctness of the final results.
      • constant propagation不行:存在z=x+y
      • pointer analysis不行:无法分析别名

16.Soundiness

  • Soundness: Conservative 保守 approximation: the analysis captures all program behaviors, or the analysis result models all possible executions of the program
  • Hard-to-analyze features: an aggressively conservative treatment to these features will likely make the analysis too imprecise to scale, rendering the analysis useless
  • Soundiness: A soundy analysis typically means that the analysis is mostly sound, with well-identified unsound treatments to hard/specific language features 把不sound的部分讲一下怎么做的
  • Soundness, Soundiness and Unsoundness
    • A sound analysis requires to capture all dynamic behaviors
    • A soundy analysis aims to capture all dynamic behaviors with certain hard language features unsoundly handled within reason 合理
    • An unsound analysis deliberately ignores 故意忽略 certain behaviors in its design for better efficiency, precision or accessibility

Hard Language Feature

  • Java Reflection
    • image-20200823025537612
  • Native Code
    • Java Native Interface (JNI): A function module of JVM which allows interoperation between Java and Native code (C/C++)
      • native code: platform dependent features, existing libraries
    • 解决方法: Manually models the critical native code