ART中的IR优化

发布于 2020-03-23  37 次阅读


以Android7.0 源码树为例

除去对特定CPU架构的优化,ART为SSA形式化后的CFG设计了13种不同的优化方法。(有的优化方法被执行了不止一次)

IR优化的入口函数
art\compiler\optimizing\optimizing_compiler.cc -> RunOptimizations

static void RunOptimizations(HGraph* graph,
                             CodeGenerator* codegen,
                             CompilerDriver* driver,
                             OptimizingCompilerStats* stats,
                             const DexCompilationUnit& dex_compilation_unit,
                             PassObserver* pass_observer,
                             StackHandleScopeCollection* handles) {
  ArenaAllocator* arena = graph->GetArena();
  HDeadCodeElimination* dce1 = new (arena) HDeadCodeElimination(
      graph, stats, HDeadCodeElimination::kInitialDeadCodeEliminationPassName);
  HDeadCodeElimination* dce2 = new (arena) HDeadCodeElimination(
      graph, stats, HDeadCodeElimination::kFinalDeadCodeEliminationPassName);
  HConstantFolding* fold1 = new (arena) HConstantFolding(graph);
  InstructionSimplifier* simplify1 = new (arena) InstructionSimplifier(graph, stats);
  HSelectGenerator* select_generator = new (arena) HSelectGenerator(graph, stats);
  HConstantFolding* fold2 = new (arena) HConstantFolding(graph, "constant_folding_after_inlining");
  HConstantFolding* fold3 = new (arena) HConstantFolding(graph, "constant_folding_after_bce");
  SideEffectsAnalysis* side_effects = new (arena) SideEffectsAnalysis(graph);
  GVNOptimization* gvn = new (arena) GVNOptimization(graph, *side_effects);
  LICM* licm = new (arena) LICM(graph, *side_effects, stats);
  LoadStoreElimination* lse = new (arena) LoadStoreElimination(graph, *side_effects);
  HInductionVarAnalysis* induction = new (arena) HInductionVarAnalysis(graph);
  BoundsCheckElimination* bce = new (arena) BoundsCheckElimination(graph, *side_effects, induction);
  HSharpening* sharpening = new (arena) HSharpening(graph, codegen, dex_compilation_unit, driver);
  InstructionSimplifier* simplify2 = new (arena) InstructionSimplifier(
      graph, stats, "instruction_simplifier_after_bce");
  InstructionSimplifier* simplify3 = new (arena) InstructionSimplifier(
      graph, stats, "instruction_simplifier_before_codegen");
  IntrinsicsRecognizer* intrinsics = new (arena) IntrinsicsRecognizer(graph, driver, stats);

  HOptimization* optimizations1[] = {//第一轮优化,总共5遍
    intrinsics,
    sharpening,
    fold1,
    simplify1,
    dce1,
  };
  RunOptimizations(optimizations1, arraysize(optimizations1), pass_observer);

  MaybeRunInliner(graph, codegen, driver, stats, dex_compilation_unit, pass_observer, handles);

  HOptimization* optimizations2[] = {
//第二轮优化,共12遍
    // SelectGenerator depends on the InstructionSimplifier removing
    // redundant suspend checks to recognize empty blocks.
    select_generator,
    fold2,  // TODO: if we don't inline we can also skip fold2.
    side_effects,
    gvn,
    licm,
    induction,
    bce,
    fold3,  // evaluates code generated by dynamic bce
    simplify2,
    lse,
    dce2,
    // The codegen has a few assumptions that only the instruction simplifier
    // can satisfy. For example, the code generator does not expect to see a
    // HTypeConversion from a type to the same type.
    simplify3,
  };
  RunOptimizations(optimizations2, arraysize(optimizations2), pass_observer);

  RunArchOptimizations(driver->GetInstructionSet(), graph, codegen, stats, pass_observer);
  AllocateRegisters(graph, codegen, pass_observer);
}

下面举几个优化的例子

IntrinsicsRecognizer

art\compiler\optimizing\intrinsics.cc => IntrinsicsRecognizer::Run()

IntrinsicsRecognizer识别基本块中可以被intrinsic化的函数调用指令,然后把intrinsic信息设置到该指令中。在后续翻译为机器指令的时候可以实现intrinsic优化。

DexFileMethodInliner::kIntrinsicMethods数组中定义了ART中支持的Intrinsics方法和Inline方法。
art\compiler\dex\quick\dex_file_method_inliner.cc

const DexFileMethodInliner::IntrinsicDef DexFileMethodInliner::kIntrinsicMethods[] = {
#define INTRINSIC(c, n, p, o, d) \
    { { kClassCache ## c, kNameCache ## n, kProtoCache ## p }, { o, kInlineIntrinsic, { d } } }

    INTRINSIC(JavaLangDouble, DoubleToRawLongBits, D_J, kIntrinsicDoubleCvt, 0),
    INTRINSIC(JavaLangDouble, LongBitsToDouble, J_D, kIntrinsicDoubleCvt, kIntrinsicFlagToFloatingPoint),
 ...
    INTRINSIC(JavaLangInteger, ReverseBytes, I_I, kIntrinsicReverseBytes, k32),
    INTRINSIC(JavaLangLong, ReverseBytes, J_J, kIntrinsicReverseBytes, k64),
    INTRINSIC(JavaLangShort, ReverseBytes, S_S, kIntrinsicReverseBytes, kSignedHalf),
    INTRINSIC(JavaLangInteger, Reverse, I_I, kIntrinsicReverseBits, k32),
    INTRINSIC(JavaLangLong, Reverse, J_J, kIntrinsicReverseBits, k64),

    INTRINSIC(JavaLangInteger, BitCount, I_I, kIntrinsicBitCount, k32),
    INTRINSIC(JavaLangLong, BitCount, J_I, kIntrinsicBitCount, k64),
    INTRINSIC(JavaLangInteger, Compare, II_I, kIntrinsicCompare, k32),
    INTRINSIC(JavaLangLong, Compare, JJ_I, kIntrinsicCompare, k64),
    INTRINSIC(JavaLangInteger, HighestOneBit, I_I, kIntrinsicHighestOneBit, k32),
    INTRINSIC(JavaLangLong, HighestOneBit, J_J, kIntrinsicHighestOneBit, k64),
    INTRINSIC(JavaLangInteger, LowestOneBit, I_I, kIntrinsicLowestOneBit, k32),
    INTRINSIC(JavaLangLong, LowestOneBit, J_J, kIntrinsicLowestOneBit, k64),
    INTRINSIC(JavaLangInteger, NumberOfLeadingZeros, I_I, kIntrinsicNumberOfLeadingZeros, k32),
    INTRINSIC(JavaLangLong, NumberOfLeadingZeros, J_I, kIntrinsicNumberOfLeadingZeros, k64),
 ...

#define UNSAFE_GET_PUT(type, code, type_flags) \
    INTRINSIC(SunMiscUnsafe, Get ## type, ObjectJ_ ## code, kIntrinsicUnsafeGet, \
              type_flags), \
    INTRINSIC(SunMiscUnsafe, Get ## type ## Volatile, ObjectJ_ ## code, kIntrinsicUnsafeGet, \
              type_flags | kIntrinsicFlagIsVolatile), \
    INTRINSIC(SunMiscUnsafe, Put ## type, ObjectJ ## code ## _V, kIntrinsicUnsafePut, \
              type_flags), \
    INTRINSIC(SunMiscUnsafe, Put ## type ## Volatile, ObjectJ ## code ## _V, kIntrinsicUnsafePut, \
              type_flags | kIntrinsicFlagIsVolatile), \
    INTRINSIC(SunMiscUnsafe, PutOrdered ## type, ObjectJ ## code ## _V, kIntrinsicUnsafePut, \
              type_flags | kIntrinsicFlagIsOrdered)

    UNSAFE_GET_PUT(Int, I, kIntrinsicFlagNone),
    UNSAFE_GET_PUT(Long, J, kIntrinsicFlagIsLong),
    UNSAFE_GET_PUT(Object, Object, kIntrinsicFlagIsObject),
#undef UNSAFE_GET_PUT

    // 1.8
    INTRINSIC(SunMiscUnsafe, GetAndAddInt, ObjectJI_I, kIntrinsicUnsafeGetAndAddInt, 0),
    INTRINSIC(SunMiscUnsafe, GetAndAddLong, ObjectJJ_J, kIntrinsicUnsafeGetAndAddLong, 0),
    INTRINSIC(SunMiscUnsafe, GetAndSetInt, ObjectJI_I, kIntrinsicUnsafeGetAndSetInt, 0),
    INTRINSIC(SunMiscUnsafe, GetAndSetLong, ObjectJJ_J, kIntrinsicUnsafeGetAndSetLong, 0),
    INTRINSIC(SunMiscUnsafe, GetAndSetObject, ObjectJObject_Object, kIntrinsicUnsafeGetAndSetObject, 0),
    INTRINSIC(SunMiscUnsafe, LoadFence, _V, kIntrinsicUnsafeLoadFence, 0),
    INTRINSIC(SunMiscUnsafe, StoreFence, _V, kIntrinsicUnsafeStoreFence, 0),
    INTRINSIC(SunMiscUnsafe, FullFence, _V, kIntrinsicUnsafeFullFence, 0),

    INTRINSIC(JavaLangSystem, ArrayCopy, CharArrayICharArrayII_V , kIntrinsicSystemArrayCopyCharArray,
              0),
    INTRINSIC(JavaLangSystem, ArrayCopy, ObjectIObjectII_V , kIntrinsicSystemArrayCopy,
              0),

    INTRINSIC(JavaLangInteger, RotateRight, II_I, kIntrinsicRotateRight, k32),
    INTRINSIC(JavaLangLong, RotateRight, JI_J, kIntrinsicRotateRight, k64),
    INTRINSIC(JavaLangInteger, RotateLeft, II_I, kIntrinsicRotateLeft, k32),
    INTRINSIC(JavaLangLong, RotateLeft, JI_J, kIntrinsicRotateLeft, k64),

#undef INTRINSIC

//SPECIAL宏用于定义Inline函数
#define SPECIAL(c, n, p, o, d) \
    { { kClassCache ## c, kNameCache ## n, kProtoCache ## p }, { o, kInlineSpecial, { d } } }

    SPECIAL(JavaLangString, Init, _V, kInlineStringInit, 0),
    SPECIAL(JavaLangString, Init, ByteArray_V, kInlineStringInit, 1),
    SPECIAL(JavaLangString, Init, ByteArrayI_V, kInlineStringInit, 2),
    SPECIAL(JavaLangString, Init, ByteArrayII_V, kInlineStringInit, 3),
    SPECIAL(JavaLangString, Init, ByteArrayIII_V, kInlineStringInit, 4),
    SPECIAL(JavaLangString, Init, ByteArrayIIString_V, kInlineStringInit, 5),
...
#undef SPECIAL
};

这其中,
Intrinsics方法支持了

  • java.lang包中的Float,Double,Integer,Math等类的很多函数
  • sun.misc.unSafe类中的一些函数

Inline函数支持了

  • java.lang.String的构造函数(被标记为kInlineSpecial)
  • 部分用户定义的方法

HConstantFolding & InstrutionSimplifier

HConstantFolding 包括常量折叠和常量传播的优化。
InstrutionSimplifier暂略

HDeadCodeElimination

HDeadCodeElimination用于消除CFG中无效的基本块和无用的HInstruction
art\compiler\optimizing\dead_code_elimination.cc

HSelectGenerator

select有些类似c语言的?:三目运算符。
例如下面的程序

public class OatSelectTest{
    int x=0, y=0, z=0, w=0;
    public void test() {
        int ret = 666;
        if( x<0 ){
            ret=-ret;
        }else{
            ret+=123;
        }
        x=ret;
    }
}

经过select优化后会变成

public class OatSelectTest{
    int x=0, y=0, z=0, w=0;
    public void test() {
        int ret = 666;
        ret1 = -666;
        ret2 = 789;
        x=select(ret1,ret2,x<0?);
    }
}

而在机器指令生成层面,以x86为例,可以省略一次jmp指令跳转,且可以借助x86的cmov(conditional move)指令来简化。

SideEffectsAnalysis

SideEffectsAnalysis其实不是一种优化,它只是收集合并CFG中各个基本块包含指令的SideEffects信息。SideEffects用于描述ART IR指令的副作用(SideEffects)。例如 HNewInstance(new一个对象)就有Trigger GC的副作用,即HNewInstance指令的执行可能会出发垃圾回收。

这也和后面所说的(见总结),ART编译完本地机器码后,程序的运行仍然不能脱离具体虚拟机的实现。
就好比此处的SideEffects,当然也有很多必要的特殊指令在ART编译dex为机器码的时候会添加上,使得得到的机器码在运行的过程中其实离不开虚拟机的管控和支持。
例如

  • Java虚拟机的垃圾回收器做对象标记前,往往会设置一个标志,表示自己要做对象标记了。
  • 其他线程运行的时候要经常对这个标志进行检查,发现这个标志有效时,这些线程就需要等待,让垃圾回收器能安全地做对象标记。否则,GC一边做对象标记,线程又同时创建对象,修改引用,这会导致对象标记不准确,影响垃圾回收。
    这些检查标记地添加时编译器主动完成的。它会在两个地方添加标记检查指令,一个是在Entry基本块中,另一个是在Loop header基本块里。
    art\compiler\optimizing\instruction_builder.cc => HInstructionBuilder::Build
bool HInstructionBuilder::Build() {
  locals_for_.resize(graph_->GetBlocks().size(),
                     ArenaVector<HInstruction*>(arena_->Adapter(kArenaAllocGraphBuilder)));
...
  for (HReversePostOrderIterator block_it(*graph_); !block_it.Done(); block_it.Advance()) {
    current_block_ = block_it.Current();
    uint32_t block_dex_pc = current_block_->GetDexPc();

    InitializeBlockLocals();

    if (current_block_->IsEntryBlock()) {
      InitializeParameters();
      AppendInstruction(new (arena_) HSuspendCheck(0u));
      AppendInstruction(new (arena_) HGoto(0u));
      continue;
    } else if (current_block_->IsExitBlock()) {
...
    } else if (current_block_->IsLoopHeader()) {
      HSuspendCheck* suspend_check = new (arena_) HSuspendCheck(current_block_->GetDexPc());
...
    }
...
}

这其中就设置了HSuspendCheckHGoto标志。
SuspendCheck而言,每次循环都会执行SuspendCheck检查。因为如果一个线程在循环里执行时间过长,会导致对象标记不能开展,进而影响GC。

GVNOptimization

Global Value Numbering(全局值编号)优化。
值编号有全局和局部之分。

  • 局部(Local) VN是针对单个基本块内的值编号。
  • 全局(Global) VN可以跨基本块,但要求CFG先SSA形式化。在ART中,因为CFG已经SSA了,此处使用GVN

值编号的意思就是给表达式Ex设置一个编号,如果代码中有其他表达式Ey能计算出和Ex相同值的话,那么Ey的值可以用Ex替代。
VN的效果简而言之就是消除/替换冗余代码
例如

w = 3; // 这是一个表达式,由于值编号系统还没有这个表达式,所以这个表达式会被添加到值编号系统中。
x =  3; // 这也是一个表达式,但计算出来的值在值编号系统中存在,所以可以改写为 y= w

y = x + 4 // 这个表达式可以替换x为w, 得到 y = w + 4这样一个新表达式,并添加到值编号系统
z = w + 4 //这个表达式和上个(y=w+4)值一样,所以z=y

上述代码经过值编号之后, x=3, z=w+4可以被其他代码替换,这两个表达式可以消除。
实现值编号的时候,有两个问题

  • 如何给表达式编号? ART中,用ART IR(HInstruction对象)的hash code作为该IR的编号
  • 如何判断两个表达式等价? ART中,通过HInstructionEquals函数来判断。其实现包括对Constant的比较,对数据类型的比较,对各输入参数的个数与值的比较。
bool HInstruction::Equals(HInstruction* other) const {
//比较指令类型
  if (!InstructionTypeEquals(other)) return false;
  DCHECK_EQ(GetKind(), other->GetKind());
// 如果有值,则比较
  if (!InstructionDataEquals(other)) return false;
// 指令执行结果的数据类型的比较
  if (GetType() != other->GetType()) return false;
// 输入参数个数要一样
  if (InputCount() != other->InputCount()) return false;
// 比较各个输入参数
  for (size_t i = 0, e = InputCount(); i < e; ++i) {
    if (InputAt(i) != other->InputAt(i)) return false;
  }
  DCHECK_EQ(ComputeHashCode(), other->ComputeHashCode());
  return true;
}

GVN的具体实现在art\compiler\optimizing\gvn.cc

void GlobalValueNumberer::Run() {
  DCHECK(side_effects_.HasRun());
  sets_[graph_->GetEntryBlock()->GetBlockId()] = new (allocator_) ValueSet(allocator_);

  // Use the reverse post order to ensure the non back-edge predecessors of a block are
  // visited before the block itself.
  for (HReversePostOrderIterator it(*graph_); !it.Done(); it.Advance()) {
    VisitBasicBlock(it.Current());
  }
}

VisitBasicBlock函数进行关键的PRO遍历CFG,并进行值系统中的判断,消除操作。
我们可以在Lookup函数中看出之前的hash比较操作.

  // If in the set, returns an equivalent instruction to the given instruction.
  // Returns null otherwise.
  HInstruction* Lookup(HInstruction* instruction) const {
    size_t hash_code = HashCode(instruction);
    size_t index = BucketIndex(hash_code);

// buckets_[index] 链表保存了所有相同hash_code的HInstruction
    for (Node* node = buckets_[index]; node != nullptr; node = node->GetNext()) {
      if (node->GetHashCode() == hash_code) {
        HInstruction* existing = node->GetInstruction();
        if (existing->Equals(instruction)) {
          return existing;
        }
      }
    }
    return nullptr;
  }

LICM

Loop Invariant Code Motion. 即 循环不变量外提。顾名思义。

HInductionAnalysis

Induction Variable Analysis 归纳变量分析。
归纳变量可分为基本归纳变量和派生归纳变量。
通过一个例子来了解它们

    i = 0
L1: if i>n goto L2
    j = i + 4
    k = j + a
    i = i+ 1
    goto L1
L2: ......

上述代码有i,j,k三个变量。

  • i变量取值满足i=i+c这样的公式。其中c是循环不变量/表达式 (c的值不受循环影响)。此处c为.这样的变量称为基本归纳变量。基本归纳变量可以用一个三元组i=<i,1,0>表达,即i=i*1+0
  • 围绕基本归纳变量可以得到和它相关的派生变量。例如这里的j满足j=a*i+b(a,b也是循环不变量)公式,k也是如此。此处,j直接依赖于i,k通过j间接依赖i,所以j,k都是由i派生的派生归纳变量。派生归纳变量可以用三元组x=<i,a,b>表达,即x=a*i+b

识别归纳变量后,存在多种情况优化我们的代码

  • 强度削减优化。例如用加法替代减法。
  • 强度削减优化后,可以消除一些循环中的归纳变量。

例如对下面代码进行归纳变量分析优化

while (i<10){
    j=i*9+3;
    a[j]=a[j]+5;
    i+=3;
}

强度削减后(将乘法变为加法)

s=9*i+3;
while (i<10){
    j=s;
    a[j]=a[j]+5;
    i+=3;
    s+=27;
}

而后消除冗余的归纳变量

s=9*i+3;
while (s<93/*i<10*/){
    //j=s;
    a[s]=a[s]+5;
    //i+=3;
    s+=27;
}

其他优化方法

HBoundsCheckElimination

ART在翻译aget,aput(数组圆度读/取指令)为IR的HArrayGet或者HArraySet的时候,同时还会生成一条HBoundsCheck IR,用于检查是否索引越界。在一个循环中,这个检查开销很大。HBoundsCheckElimination可以消除这种不必要的边界检查。例如循环索引最大值本身不超过数组元素个数的时候,那么就能去掉HBoundsCheck.

LoadStoreElimination

例如某个地方已经获取了一个类成员的变量值,后续再通过HInstanceFieldGet IR获取同一个成员变量的时候就可以消除这些重复指令了。

总结

  • ART IR中的编号是以偶数倍递增的,这是为了方便在IR之间插入一些指令(检查或者标志),而后这些插入指令的编号就是奇数了。
  • 虽然ART把dex字节码编译为了本地机器码,但是其运行仍然不像其他例如c++编译为字节码直接在系统或者运行时库中运行。ART编译完本地机器码后,程序的运行仍然不能脱离具体虚拟机的实现。因为有一些检查标志需要虚拟机进行interpreter,且需要虚拟机的管控。(前文SideEffectsAnalysis的分析中有介绍)