AnySL: Efficient and Portable Multi-Language Shading

RenderMan Glass Starball, Procedural Parquet | RTfact/AnySL

RenderMan Granite, Checker, Parquet, Mirror | Manta/AnySL

About

In cooperation with the Computer Graphics Group we develop a unified shading system that is independent of source language, target architecture and rendering engine without sacrificing runtime performance.

Our goal is to eventually provide a shading-system that uses a portable shader-format to allow integration into any kind of rendering engine (e.g. ray-tracing, rasterization, global illumination). Additionally, integration of existing shading-languages only requires minimal effort while the compiler technology of AnySL still enables maximum performance.

Shaders denote program fragments that extend the functionality of a rendering system for specific tasks such as computing emission, light-material interaction, or geometry processing --- similar to plug-ins used elsewhere. The key difference to such function-call and library-based plug-ins is that shading code usually needs to be transformed to meet the needs of the target applications regarding performance or program structure and should provide convenience for the programmer. However, to support a certain shading language, the compiler has to provide a compiler framework for it. Hence, the renderer's implementor ends up in investing a large part of his time in creating compilers; something he did not want to do in the first place.

AnySL is a novel approach to ease the integration of a shading language into a renderer. We compile shaders into a program representation that is independent of the shading language, the renderer, and the target hardware platform. The renderer has to provide the implementation of the basic constructs of the shading language. By augmenting the renderer with a just-in-time compiler library, the shaders are loaded and "glued" to the renderer's interface at runtime. Afterwards, the shader is mapped to the underlying hardware platform. With this approach, all performance obstacles incurred by common programming abstraction mechanisms are optimized away, resulting in high performance while keeping the maximum flexibility.

The AnySL Shading System uses an embedded just-in-time compiler (the "Low-Level Virtual Machine" (LLVM)) to load, specialize and optimize shaders at runtime. This allows us to recompile on the fly, e.g. after modifications of shader parameters, without sacrificing performance.

Whole-Function Vectorization

For ray tracing engines that employ packet tracing, the scalar shader code is automatically transformed to packet code that operates on packets of data (that are sized depending on the target architecture's SIMD width). This allows to exploit the SIMD instruction sets of CPUs (e.g. SSE, AltiVec) without putting the burden of writing such complex and error-prone code on the shader programmer. The only option to this is sequential shading of all rays of a packet, which incurs a lot of overhead if the ray tracer operates on SIMD datatypes because packets have to be split before execution and results have to be merged again.

Compared to sequential shading, we obtain an average speedup factor of 3.9 of the entire rendering process in RTfact. At the same time, we reach over 90% of the performance of the hand-written, native shaders.

See the project page for more details: Whole-Function Vectorization.

LLVM PTX Backend

As part of the AnySL system we implemented an LLVM backend for NVIDIA's "Parallel Thread Execution" (PTX) assembly language. PTX is the low-level representation fed to NVIDIA GPGPU graphics drivers and is usually generated by compilers for the "Compute Unified Device Architecture" (CUDA).

The backend is similar to LLVM's C-backend and generates .ptx files directly from LLVM's intermediate representation (IR).

The backend already supports most of the PTX features:

simple arithmetic (add, mul, ...)
control flow
structs and arrays
simple function calls (no recursion, no struct returns)
global, shared, constant, and texture memory access
mathematical functions (e.g. sin, cos, sqrt, pow, ...)
special registers (e.g. thread_id)

There are no intrinsics for PTX-specific functionality like texture fetches, they are currently only accessed via external functions. Atomic and synchronization instructions are not yet implemented but should work the same way.

Performance has not yet been optimized to a larger degree. Register pressure lowering optimizations are necessary for more performant code.

The backend was written as part of the bachelor's thesis of Helge Rhodin. The source code is released under the University of Illinois/NCSA Open Source License (BSD-style) and is hosted at SourceForge.

Code contributions to the backend are very welcome! :)

Download LLVM PTX Backend.

Publications

Conferences

Whole Function Vectorization
Karrenberg, R. and Hack, S.
International Symposium on Code Generation and Optimization, 2011. [doi] [url] [slides] [bib]

@CONFERENCE{KH:2011:cgo,
 	author = {Ralf Karrenberg and Sebastian Hack},
 	title = {{W}hole {F}unction {V}ectorization},
 	booktitle = {International Symposium on Code Generation and Optimization},
 	series = {CGO},
 	year = {2011},
 	doi = {10.1109/CGO.2011.5764682},
 	abstract = {
 		Abstract—Data-parallel programming languages are an important component
 		in today's parallel computing landscape. Among those are domain-
 		specific languages like shading languages in graphics (HLSL, GLSL,
 		RenderMan, etc.) and "general-purpose" languages like CUDA or OpenCL.
 		Current implementations of those languages on CPUs solely rely on multi-
 		threading to implement parallelism and ignore the additional intra-core
 		parallelism provided by the SIMD instruction set of those processors
 		(like Intel's SSE and the upcoming AVX or Larrabee instruction sets).
 		In this paper, we discuss several aspects of implementing data-parallel
 		languages on machines with SIMD instruction sets. Our main contribution
 		is a language- and platform-independent code transformation that
 		performs whole-function vectorization on low-level intermediate code
 		given by a control flow graph in SSA form.
 		We evaluate our technique in two scenarios: First, incorporated in a
 		compiler for a domain-specific language used in real-time ray tracing.
 		Second, in a stand-alone OpenCL driver. We observe average speedup
 		factors of 3.9 for the ray tracer and factors between 0.6 and 5.2 for
 		different OpenCL kernels.
 	},
 	webslides = {http://www.cdl.uni-saarland.de/projects/wfv/wfv_cgo11_slides.pdf},
 	url = {http://www.cdl.uni-saarland.de/papers/karrenberg_wfv.pdf},
 	acc_rate = {26.7},
 	accepted = {28},
 	submitted = {105},
 }

AnySL: Efficient and Portable Shading for Ray Tracing - HPG 2010
Karrenberg, R., Rubinstein, D., Slusallek, P. and Hack, S.
Proceedings of the Conference on High Performance Graphics, pages 97–105, Eurographics Association, 2010. [url] [slides] [bib]

@CONFERENCE{KRSH:2010:hpg,
 	author = {Ralf Karrenberg and Dmitri Rubinstein and Philipp Slusallek and Sebastian Hack},
 	title = {{AnySL: Efficient and Portable Shading for Ray Tracing}},
 	booktitle = {Proceedings of the Conference on High Performance Graphics},
 	series = {HPG '10},
 	year = {2010},
 	location = {Saarbrucken, Germany},
 	pages = {97--105},
 	numpages = {9},
 	url = {http://portal.acm.org/citation.cfm?id=1921479.1921495},
 	acmid = {1921495},
 	publisher = {Eurographics Association},
 	address = {Aire-la-Ville, Switzerland, Switzerland},
 	booktitle_short = {HPG},
 	abstract = {
 		While a number of different shading languages have been developed,
 		their efficient integration into an existing renderer is notoriously
 		difficult, often boiling down to implementing an entire compiler
 		toolchain for each language. Furthermore, no shading language is
 		broadly supported across the variety of rendering systems.
 		AnySL attacks this issue from multiple directions: We compile shaders
 		from different languages into a common, portable representation, which
 		uses subroutine threaded code: Every language operator is translated to
 		a function call. Thus, the compiled shader is generic with respect to
 		the used types and operators.
 		The key component of our system is an embedded compiler that
 		instantiates this generic code in terms of the renderer's native types
 		and operations. It allows for flexible code transformations to match
 		the internal structure of the renderer and eliminates all overhead due
 		to the subroutine threaded code. For SIMD architectures we
 		automatically perform vectorization of scalar shaders which speeds up
 		rendering by a factor of 3.9 on average on SSE. The results are highly
 		optimized, parallel shaders that operate directly on the internal data
 		structures of a renderer. We show that both traditional shading
 		languages such as RenderMan, but also C/C++-based shading languages,
 		can be fully supported and deliver high performance across different
 		CPU renderers.
 	},
 	webslides = {http://www.cdl.uni-saarland.de/projects/anysl/anysl_hpg10_slides.pdf}
 }

MSc Thesis

Automatic Packetization
Karrenberg, R.
M.Sc. Thesis, Saarland University, 2009. [pdf] [bib]

@MASTERSTHESIS{Karrenberg:2009:MSc,
     author  = {Ralf Karrenberg},
     title   = {{Automatic Packetization}},
     school  = {Saarland University},
     year    = {2009},
     month   = {July},
     webpdf  = {http://www.cdl.uni-saarland.de/publications/theses/karrenberg_msc.pdf},
 	abstract = {
 		Modern processor architectures provide the possibility to execute an
 		instruction on multiple values at once. So-called SIMD (Single
 		Instruction, Multiple Data) instructions work on packets (or vectors)
 		of data instead of scalar values. They offer a significant performance
 		boost for data-parallel algorithms that perform the same operations on
 		large amounts of data, e.g. data encoding and decoding, image
 		processing, or ray tracing.
 		However, the performance gain comes at a price: programming languages
 		provide no elegant means to exploit SIMD instruction sets. Packet
 		operations have to be coded by hand, which is complicated, unintuitive,
 		and error prone.  Thus, packetization - the transformation of scalar
 		code to packet form - is mostly applied automatically by local compiler
 		optimizations (e.g. during loop vectorization) or with a lot of manual
 		effort at performance-critical parts of a system.
 		This thesis describes an algorithm for automatic packetization that
 		allows a programmer to write scalar functions but use them on packets
 		of data. A compiler pass automatically transforms those functions to
 		work on packets of the target-architecture's SIMD width. The resulting
 		packetized function computes the same results as multiple executions of
 		the scalar code.
 		The algorithm is implemented in a source-language and target-
 		architecture independent intermediate representation (the Low Level
 		Virtual Machine (LLVM)), which enables its use in many different
 		environments. The performance of the generated code is shown in a real-
 		world case study in the context of real-time ray tracing: serial shader
 		code written in C++ is automatically specialized, optimized, and
 		packetized at runtime. The packetized shaders outperform their scalar
 		counterparts by an average factor of 3.6 on a standard SSE architecture
 		of SIMD width 4.
 	}
 }

BSc Thesis

Decompilation of LLVM IR
Moll, S.
B.Sc. Thesis, Saarland University, 2011. [pdf] [bib]

@BACHELORSTHESIS{Moll:2011:BSc,
     author = { Simon Moll },
     title = { {D}ecompilation of {LLVM} {IR} },
     school = {Saarland University},
     year = { 2011 },
     month = { February },
     webpdf  = {http://www.cdl.uni-saarland.de/publications/theses/moll_bsc.pdf},
     abstract = { },
 }

A PTX Code Generator for LLVM
Rhodin, H.
B.Sc. Thesis, Saarland University, 2010. [pdf] [bib]

@BACHELORSTHESIS{Rhodin:2010:BSc,
     author = { Helge Rhodin },
     title = { {A} {PTX} {C}ode {G}enerator for {LLVM} },
     school = {Saarland University},
     year = { 2010 },
     month = { October },
     webpdf  = {http://www.cdl.uni-saarland.de/publications/theses/rhodin_bsc.pdf},
     abstract = { },
 }

Compiler Design Lab