Understanding OpenGL ES: Multi-thread and multi-window rendering
As the CPUs and GPUs in mobile devices have become more powerful and devices with one or more high-resolution screen have become ubiquitous, the demand for complex interactions with the graphics driver has increased. In this blog post, I’ll discuss what multi-thread and multi-window rendering means to developers, and I’ll describe if and when these techniques should be used in your apps.
What is multi-thread rendering?
Traditionally, OpenGL ES applications only render to one surface from one thread. However, as the complexity of 3D rendering engines has increased, the CPU overhead of graphics API operations has become a common bottleneck – particularly when loading assets. This is where multi-threaded rendering becomes interesting.
A rendering thread is one CPU thread associated with one graphics context. By default, each graphics context will not be able to access the resources (textures, shaders and vertex buffers) of another context. For this reason, shared contexts are used so one or more background loading threads can access the resources of a primary thread. There are two reasons why this rendering model is extremely useful:
- The primary thread won’t block
By their nature, graphics API calls that upload data have to block until the transfer between application and driver memory has completed. Additionally, shader compilation is a blocking operation in many graphics drivers. This blocking introduces a costly overhead that results in the GPU being starved of work. By moving all upload operations to a background thread, the primary thread can maintain a consistent framerate
- Parallel work distribution on multi-core CPUs
As the graphics driver is processed on the CPU, splitting the work into multiple rendering threads enables the OS to issue work to multiple CPU cores in parallel. This results in the driver’s workload being processed faster than a single rendering thread would be capable of
When should I use multi-threaded rendering?
OpenGL ES Data Upload – Unoptimized
OpenGL ES Data Upload – Optimized
Multi-threaded rendering is best suited to applications that are CPU limited when compiling shaders or uploading data to the graphics driver. Multi-threaded rendering (when done sensibly) enables better distribution of driver work and allows applications to maintain consistent frame rates.
In the simple example above, the transition from Level 1 to Level 2 in a game requires additional textures, VBOs and shader programs to be uploaded. Assuming a seamless transition between the two is required (i.e. splash screens, videos etc. can’t be used to hide the upload cost), the game must upload the new resources to the driver while Level 1 is still being rendered.
In the unoptimized case, the time spent issuing calls to the driver each frame is erratic due to the additional overhead of upload/compilation operations. The increased frame submission time may cause V-sync’s to be missed, which could cause the game to feel jerky as the frame rate stutters.
In the optimized case, a secondary thread is used to upload resources in the background. This allows the primary thread to retain a consistent call submission time and, in turn, a smooth frame rate.
For the best possible performance, rendering threads should be created at start up. A primary thread should be used for all rendering. Additional threads (created with a shared context) should only be used for shader compilation and buffer data upload. The number of background threads should be kept to a minimum (e.g. one thread per-CPU core). Creating threads in excess will lead to unmaintainable, hard to debug code.
Calls to eglMakeCurrent() should be kept to a minimum due to its cost (the EGL specification states that all outstanding operations must be flushed before the context is bound to a new thread).
When shouldn’t I use multi-threaded rendering?
When you’re not CPU limited or load times are not a concern
If you’re not CPU limited by the graphics driver, you should avoid multi-threaded rendering. It will increase the complexity of your rendering engine and may even reduce performance if it’s implemented badly.
When trying to “simplify” your rendering engine
The worst use case is to frequently bind a single graphics context to different threads (using eglmakeCurrent()). This is bad for two reasons:
- The cost of context binding
As discussed above, calling eglMakeCurrent() forces the driver to kick all outstanding operations
- API calls are serialized
As a graphics context can only be bound to a single CPU thread at any point in time, all API calls will be submitted serially
So, the API calls have the same cost as a single threaded render (as API call submission is serialized), but there is the additional overhead of the context switch…which means that performance will be less optimal than a single threaded renderer
It may seem like a good design decision, but rendering in this way always results in complex, messy code, where submission order is very difficult to understand (and even more difficult to debug!).
Don’t do this!
What is multi-window rendering?
Multi-window rendering is when an application renders into more than one window surface. These surfaces are then composed into a surface by the OS’s window compositor (for example, Surface Flinger on Android or X11 on many Linux distros) that can be presented to the device’s screen.
In a multi-window application, there is a one to one mapping of CPU threads and graphics contexts. Each graphics context is used to render into its own windows surface.
When should I use multi-window rendering?
Multi-window rendering is best suited to use cases where an application needs to render to more than one screen, for example when a TV is used as a second screen.
When shouldn’t I use multi-window rendering?
To compose layers
|Layer Composition – Unoptimized|
Layer Composition – Optimized
In the unoptimized example above, the game scene, touch controls and mini map are rendered to individual surfaces. The application then relies on the OS’s compositor to combine them into surface that can be displayed. This approach is wasteful as memory has to be allocated for a number of surfaces, the compositor will process transparent pixels and the GPU’s Hidden Surface Removal (HSR) isn’t being used to its full potential (i.e. fragments that are occluded by opaque UI elements will be redundantly coloured).
In the optimized case, the game scene is rendered first, and then the touch controls and mini map are rendered directly into the same surface. In cases where this approach isn’t suitable, FBOs can be used to perform the composition within the application. For example, the game scene could be rendered to a lower resolution FBO, blit into the app’s window surface and the UI elements could be drawn on top at the native resolution (this technique is commonly used to increase the performance per-pixel when rendering game scenes).
Multi-thread, multi-window support in PVRTrace
As of our PowerVR Graphics 3.2 SDK, PVRTrace (our OpenGL ES capture and analysis tool) supports applications that rely on these complex graphics driver interactions. This includes per-thread render state inspectors, per-thread filtering in the Call View and Frame Selector, and a thread usage timeline graph. The combination of all of these features makes multi-threaded OpenGL ES much easier for you (and us!) to debug. Additionally, the multi-thread support in our PVRVFrame OpenGL ES emulator has been significantly improved.
Multi-thread, multi-window rendering makes it very easy to shoot yourself in the foot by creating complex, hard to debug rendering engines. However, it also provides a lot of power and flexibility when used correctly. If you stick to the guidelines outlined in this post, you can improve the performance of your resource loading without introducing unnecessary headaches.