Evolving Port Mappings with PMEvo

This website accompanies our PLDI'20 paper. Have a look at the research artifact to see our prototype implementation.

Get an overview with our PLDI'20 Video Abstract:

Or watch the entire PLDI'20 Talk:

What is a port mapping and why is it interesting?

Processor Overview

Achieving peak performance in a computer system requires optimizations in every layer of the system, be it hardware or software. A detailed understanding of the underlying hardware, and especially the processor, is crucial to optimize software. One key criterion for the performance of a processor is its ability to exploit instruction-level parallelism.

Modern processors commonly enable instruction-level parallelism by providing a set of independent execution units that are grouped behind execution ports. When executing an instruction sequence, the processor splits the instructions into micro-operations (or short μops) and distributes them among the available execution ports.

As not every port has all kinds of execution units, μops are tied to specific groups of ports. These are characterized by the port mapping: It captures how the processor divides instructions into μops and which execution ports can execute these μops:

Example Port Mapping

Instruction-level parallelism plays a significant role in the performance of modern processors. To understand a processor's performance, e.g. to predict running times or to guide compiler optimizations, knowledge of the port mapping is essential.

How to get a port mapping?

Processor manufacturers usually do not share the port mappings of their microarchitectures. We therefore need to infer port mappings from experiments to gain the necessary insights into the processor.

Several existing approaches use experiments to automatically infer a processor's port mapping, for example uops.info and EXEgesis. They have in common that their experiments rely on the use of hardware performance counters. While these counters provide precise information about how instructions are executed on the processor, they are only fully available on Intel processors. These approaches are therefore not suitable for processors that do not provide such counters, including a large group of ARM processors and x86 processors by AMD.

The PMEvo Framework

Our framework, PMEvo, automatically infers port mappings solely based on the measurement of the execution time of short instruction sequences. While these measurements are less easy to interpret than data from hardware performance counters, they can be performed on a much wider range of processors.

PMEvo is organized in four phases:

PMEvo Overview

Based on a provided description of the instruction set architecture (ISA) under test, PMEvo first generates a set of experiments. These experiments are instruction sequences that are specifically designed such that their execution time depends on the presence of conflicting resource requirements of instructions.

As the next step, PMEvo determines the throughputs of the generated experiments by executing them in a steady state on the processor under test and measuring the execution time.

From these measurements, PMEvo identifies groups of instructions that behave identically in every experiment including them. By reducing these groups of congruent instructions to one representative, we obtain a simplified set of experiments.

In the last phase, PMEvo employs an evolutionary algorithm to evolve a port mapping that explains the throughputs for the simplified experiment set as accurately as possible. The fitness of a candidate port mapping is evaluated using an efficient implementation of an analytical throughput model based on a linear program.

Results

Our prototype implementation infers a port mapping for Intel's Skylake architecture that are competitive to existing work. Furthermore, it finds port mappings for AMD's Zen+ architecture and the ARM Cortex-A72 architecture, which are out of scope for existing automatic techniques.

We evaluate PMEvo's inferred port mapping by their ability to accurately predict instruction throughput for dependency-free instruction sequences. Our evaluation shows that our inferred port mappings for Intel's Skylake architecture allow for throughput predictions that are similarly accurate as the results of uops.info, llvm-mca, and IACA. The comparison with the neural-network-based throughput predictor Ithemal indicates that PMEvo's port mappings capture performance characteristics that are not well supported by their approach. For the non-Intel architectures in our evaluation setup, PMEvo outclasses the prediction accuracy of llvm-mca, the only of the above tools that has performance models for these architectures.

The heat maps below visualize these results: In each heat map, experiments are placed into buckets based on a tool's predicted instruction throughput and the actual measured instruction throughput. The closer the experiments are to the green diagonal line, the better is the tool's prediction accuracy for these experiments.

Prediction Accurarcy Results

Publication

Conferences

People