Android 8.0 ART improvements

The Android runtime (ART) has been improved significantly in the Android 8.0 release. The list below summarizes enhancements device manufacturers can expect in ART.

Concurrent compacting garbage collector

As announced at Google I/O, ART features a new concurrent compacting garbage collector (GC) in Android 8.0. This collector compacts the heap every time GC runs and while the app is running, with only one short pause for processing thread roots. Here are its benefits:

GC always compacts the heap: 32% smaller heap sizes on average compared to Android 7.0.
Compaction enables thread local bump pointer object allocation: Allocations are 70% faster than in Android 7.0.
Offers 85% smaller pause times for the H2 benchmark compared to the Android 7.0 GC.
Pause times no longer scale with heap size; apps should be able to use large heaps without worrying about jank.
GC implementation detail - Read barriers:
- Read barriers are a small amount of work done for each object field read.
- These are optimized in the compiler, but might slow down some use cases.

Loop optimizations

A wide variety of loop optimizations are employed by ART in the Android 8.0 release:

Bounds check eliminations
- Static: ranges are proven to be within bounds at compile-time
- Dynamic: run-time tests ensure loops stay within bounds (deopt otherwise)
Induction variable eliminations
- Remove dead induction
- Replace induction that is used only after the loop by closed-form expressions
Dead code elimination inside the loop-body, removal of whole loops that become dead
Strength reduction
Loop transformations: reversal, interchanging, splitting, unrolling, unimodular, etc.
SIMDization (also called vectorization)

The loop optimizer resides in its own optimization pass in the ART compiler. Most loop optimizations are similar to optimizations and simplification elsewhere. Challenges arise with some optimizations that rewrite the CFG in a more than usual elaborate way, because most CFG utilities (see nodes.h) focus on building a CFG, not rewriting one.

Class hierarchy analysis

ART in Android 8.0 uses Class Hierarchy Analysis (CHA), a compiler optimization that devirtualizes virtual calls into direct calls based on the information generated by analyzing class hierarchies. Virtual calls are expensive since they are implemented around a vtable lookup, and they take a couple of dependent loads. Also virtual calls cannot be inlined.

Here is a summary of related enhancements:

Dynamic single-implementation method status updating - At the end of class linking time, when vtable has been populated, ART conducts an entry-by-entry comparison to the vtable of the super class.
Compiler optimization - The compiler will take advantage of the single-implementation info of a method. If a method A.foo has single-implementation flag set, compiler will devirtualize the virtual call into a direct call, and further try to inline the direct call as a result.
Compiled code invalidation - Also at the end of class linking time when single-implementation info is updated, if method A.foo that previously had single-implementation but that status is now invalidated, all compiled code that depends on the assumption that method A.foo has single-implementation needs to have their compiled code invalidated.
Deoptimization - For live compiled code that's on stack, deoptimization will be initiated to force the invalidated compiled code into interpreter mode to guarantee correctness. A new mechanism of deoptimization which is a hybrid of synchronous and asynchronous deoptimization will be used.

Inline caches in .oat files

ART now employs inline caches and optimizes the call sites for which enough data exists. The inline caches feature records additional runtime information into profiles and uses it to add dynamic optimizations to ahead of time compilation.

Dexlayout

Dexlayout is a library introduced in Android 8.0 to analyze dex files and reorder them according to a profile. Dexlayout aims to use runtime profiling information to reorder sections of the dex file during idle maintenance compilation on device. By grouping together parts of the dex file that are often accessed together, programs can have better memory access patterns from improved locality, saving RAM and shortening start up time.

Since profile information is currently available only after apps have been run, dexlayout is integrated in dex2oat's on-device compilation during idle maintenance.

Dex cache removal

Up to Android 7.0, the DexCache object owned four large arrays, proportional to the number of certain elements in the DexFile, namely:

strings (one reference per DexFile::StringId),
types (one reference per DexFile::TypeId),
methods (one native pointer per DexFile::MethodId),
fields (one native pointer per DexFile::FieldId).

These arrays were used for fast retrieval of objects that we previously resolved. In Android 8.0, all arrays have been removed except the methods array.

Interpreter performance

Interpreter performance significantly improved in the Android 7.0 release with the introduction of "mterp" - an interpreter featuring a core fetch/decode/interpret mechanism written in assembly language. Mterp is modelled after the fast Dalvik interpreter, and supports arm, arm64, x86, x86_64, mips and mips64. For computational code, Art's mterp is roughly comparable to Dalvik's fast interpreter. However, in some situations it can be significantly - and even dramatically - slower:

Invoke performance.
String manipulation, and other heavy users of methods recognized as intrinsics in Dalvik.
Higher stack memory usage.

Android 8.0 addresses these issues.

More inlining

Since Android 6.0, ART can inline any call within the same dex files, but could only inline leaf methods from different dex files. There were two reasons for this limitation:

Inlining from another dex file requires to use the dex cache of that other dex file, unlike same dex file inlining, which could just re-use the dex cache of the caller. The dex cache is needed in compiled code for a couple of instructions like static calls, string load, or class load.
The stack maps are only encoding a method index within the current dex file.

To address these limitations, Android 8.0:

Removes dex cache access from compiled code (also see section "Dex cache removal")
Extends stack map encoding.

Synchronization improvements

The ART team tuned the MonitorEnter/MonitorExit code paths, and reduced our reliance on traditional memory barriers on ARMv8, replacing them with newer (acquire/release) instructions where possible.

Faster native methods

Faster native calls to the Java Native Interface (JNI) are available using the @FastNative and @CriticalNative annotations. These built-in ART runtime optimizations speed up JNI transitions and replace the now deprecated !bang JNI notation. The annotations have no effect on non-native methods and are only available to platform Java Language code on the bootclasspath (no Play Store updates).

The @FastNative annotation supports non-static methods. Use this if a method accesses a jobject as a parameter or return value.

The @CriticalNative annotation provides an even faster way to run native methods, with the following restrictions:

Methods must be static—no objects for parameters, return values, or an implicit this.
Only primitive types are passed to the native method.
The native method does not use the JNIEnv and jclass parameters in its function definition.
The method must be registered with RegisterNatives instead of relying on dynamic JNI linking.

@FastNative can improve native method performance up to 3x, and @CriticalNative up to 5x. For example, a JNI transition measured on a Nexus 6P device:

Java Native Interface (JNI) invocation	Execution time (in nanoseconds)
Regular JNI	115
!bang JNI	60
`@FastNative`	35
`@CriticalNative`	25