the Chromium logo

The Chromium Projects

Aligned bundling support in LLVM

This document describes the proposal to add aligned instruction bundle support (in short - "bundling") in LLVM and its implementation in the MC module.

Specification

For the purpose of supporting the Software Fault Isolation (SFI) mechanisms required by Native Client, the following directives are added to the LLVM assembler:

.bundle_align_mode <num>
.bundle_lock <option>
.bundle_unlock

With the following semantics:

When aligned instruction bundle mode ("bundling" in short) is enabled (.bundle_align_mode was encountered with an argument > 0, which is the power of 2 to which the bundle size is equal), single instructions and groups of instructions between .bundle_lock and .bundle_unlock directives cannot cross a bundle boundary.

Furthermore, the .bundle_lock directive supports the align_to_end option, which means that the group has to end at a bundle boundary.

For example, consider the following:

.bundle_align_mode 4
mov1
mov2
mov3

Assuming that each of the mov instructions is 7 bytes long and mov1 is aligned to a 16-byte boundary, two bytes of NOP padding will be inserted between mov2 and mov3 to make sure that mov3 does not cross a 16-byte bundle boundary.

A slightly modified example:

.bundle_align_mode 4
mov1
.bundle_lock
mov2
mov3
.bundle_unlock

Here, since the bundle-locked sequence mov2 mov3 cannot cross a bundle boundary, 9 bytes of NOP padding will be inserted between mov1 and mov2. An example to demonstrate the align_to_end option:

.bundle_align_mode 4
mov1
mov2
.bundle_lock align_to_end
mov3
mov4
.bundle_unlock

Normally, only two bytes of NOP padding would be required between mov2 and mov3 to ensure that bundle-locked sequence does not cross a bundle boundary. However, since align_to_end was provided, an additional two bytes of NOP padding will be inserted so that the sequence ends at a boundary.

For information on how this ability is used for software fault isolation by Native Client, see the following resources:

Implementation in LLVM MC

As proposed, bundling is a feature of the assembler. Therefore, it is implemented in the MC module of LLVM. Specifically, the following parts are affected:

The following description will focus on the path: Text assembly -> ELF object streamer -> Assembly -> Object file emission. This path can be roughly divided to three stages:

  1. Parse bundling directives and emit a sequence of sections (MCSectionData) and fragments (MCFragment derivatives) that represent the input.
  2. Perform layout of the fragments to be valid w.r.t. relaxation and bundling.
  3. Write the output object.

Parsing and emitting sections and fragments

In order to implement bundling, we use the existing assembly parsing facilities in MC, adding support for the new directives. The existing section and fragment abstractions are used, with some flags added to keep the state of the bundling directives encountered. Specifically:

When bundling mode is turned on in the assembler (BundleAlignSize > 0), the following rules apply to emitting fragments:

Laying out fragments

Bundling blends well into the existing layout mechanism in the MC assembler, since its effects are somewhat similar to relaxation. Some fragments may need to grow due to padding, which may require re-layout of subsequent fragments and recomputation of fixups. Therefore, the MC assembler employs an iterative layout algorithm. The following diagram will help explain the layout of fragments w.r.t. bundling:

image

Note that since we create a single data fragment for a bundle-locked group, the above applies to such groups as well as single instructions.

Writing fragments to the object file

Note: the write target of the assembler is abstracted into a stream object, which can also write into memory. "Object file" here implies this stream.

In the last step, the assembler writes the list of fragments into sections according to their layout order. In this step, the BundlePadding field created during layout is used to add NOP padding (by calling writeNopData on the target-specific assembler backend) of appropriate size before fragments that require it.

Performance impacts

It's interesting to study the performance impact of adding the bundling feature on MC's assembler normal operation (without actual bundling directives).

Memory consumption of fragment objects

Amount of bytes needed for the various MCFragment objects on x86-64:

Fragment type Without bundling With bundling
`MCFragment` ` 64` ` 64`
`MCRelaxableFragment` ` 320` ` 320`
`MCDataFragment` ` 240` ` 240`

Explanation: the single-byte BundlePadding field in MCEncodedFragment was placed in space reserved for alignment earlier. The same was done for the boolean HasInstructions in MCDataFragment.

Assembler runtime

llvm-mc was run on a large assembly file (produced by compiling gcc 3.5 into a single assembly file) as follows:

sudo nice -n -20 perf stat -r 10 llvm-mc -filetype=obj gcc.s -o gcc.o

There were no noticeable difference in the runtime of llvm-mc with and without the bundling patch.

Alternative implementation

Bundling is similar to relaxation in many aspects which simplifies the implementation. While the performance impact of bundling is negligible as shown, the interaction of these two mechanisms can have a negative effect on memory consumption. When bundling is used, most instructions need to be put in their own fragment. That's because before we have decided the sizes of jumps, we don’t know the relative positions of other instructions, as we might need to insert bundle padding NOPs between them. The only exception are bundle-locked superinstruction sequences which can be stored in a single fragment.

Since there's a memory overhead for storing a fragment, the use of bundling with relaxation significantly increases memory overhead. When translating pexe, pnacl-llc uses K * nexe size memory. The experimental results show that the value of K is ~17 with the fixed cost of ~50MB. Given that, with the maximum pexe size limited currently to 64MB, pnacl-llc would need over 1.1GB of address space which is significantly more than 768MB that will be available when pnacl-llc starts using IRT.

Therefore, one way to reduce the memory consumption is to disable jump relaxation. In that case, we know that size of all instructions from the beginning and we can write the bundle padding NOPs directly into the fragments while emitting the instructions, thus reducing the number of fragments needed.

Emitting instructions

We have implemented the alternative bundle padding scheme in LLVM MC. Currently, this implementation is only being used when the -mc-relax-all flag is used. The large bulk of implementation is in the MCELFStreamer::mergeFragment method. We reuse the existing code and emit instructions into their own fragment. We also reuse the existing logic for calculating bundle boundaries and necessary bundle padding. However, when the jump relaxation is disabled, instead of adding each fragment into the list of fragments held by MCSectionData, we merge it with the current fragment.