Why GPUs don’t like to share – a guide to improve your renderer on PowerVR-based platforms

Most modern GPUs don’t like to share data they own – daring to poke the bear will invariably incur a cost. This includes the PowerVR Tile Based Deferred Rendering (TBDR) architecture, which keeps a number of frames in-flight so that vertex, fragment and GPU compute tasks can be dynamically scheduled for execution. For an application to be optimal, a renderer must be written with this parallelization in mind.

The key question is – what happens when an application updates data that is still being used by the GPU? The following sections detail the common pitfalls when dealing with dynamic texture and vertex data on platforms with PowerVR Series5 and Series5XT GPUs, and describe how these pitfalls can be avoided.

Texture updates

When an application updates a texture, the driver will check if there are any unresolved renders that require the existing data.

Optimize your renderer - 01 - texture ghosting Texture ghosting

In most cases where the texture is still needed, the driver will take an optimized path and cache the modified data until outstanding renders complete. If the driver is unable to take this path (i.e. the cache is already full), a copy will be created so the original texture can be updated while the duplicate is used by incomplete renders (thus allowing the GPU to continue without interruption). The process of duplicating textures is referred to as “ghosting”.

How does ghosting affect me?

The aim of ghosting is to avoid expensive stalls that would occur if a render had to be kicked every time a texture is updated. However, a significant amount of texture ghosting can cause the driver to run out of memory. Applications that hit this issue tend to be regularly updating texture atlases, for example map data in navigation software and glyphs, particularly for languages with large character sets (such as Mandarin).

In a worst case scenario, an application may be updating a texture, using it, updating it and using it repeatedly within a single frame. When an application does this, the driver will most likely create a ghost for every version of that texture that is required by a render, which means a single texture has the potential to cause a GL_OUT_OF_MEMORY error by itself!

When ghosting occurs, the entire texture will be duplicated. This is done for performance reasons, as tracking modifications and sampling textures correctly would have a significant overhead.

So…what can I do about this?

Rule of thumb: Avoid texture updates

To retain as much parallelization as possible, textures uploaded to the GPU should be seen as read-only blocks of data. With this in mind, it’s possible to refactor many common use cases to avoid the cost of ghosting.

Optimizing a sprite renderer

A common case that can be optimized is a sprite renderer in a 2D game.

Optimize your renderer - 02 - Unoptimized texture atlasUn-optimized texture atlas

Many developers will use a large texture atlas that contains all of the sprites they currently need (like in the figure above). This seems like a sensible idea at first, as using the smallest possible number of texture atlases enables efficient batching to keep the number of draw calls to a minimum. However, when a region of the texture atlas is updated, the driver will have to ghost the entire texture. If this is happening frequently within a small number of frames, the memory required for ghosted textures may become a problem.

Optimize your renderer - 03 - Optimized texture atlases

Optimized texture atlases

To combat this, large atlases can be broken down into a number of smaller atlases (see above). The contents of each small atlas can then be grouped by association in such a way that the atlases will be touched as infrequently as possible and, when they are updated, less memory will be required for ghosting. As an example, a 2D game could place persistent characters into one atlas, level-specific sprites into another and all other sprites into a miscellaneous atlas.

In all cases, sprite updates should be batched so each atlas is touched as little as possible.

If an application is still running out of memory, the frame-rate can be limited. Doing so will reduce the likelihood of ghosting, as there is less chance of an unresolved render needing the data.

If this does not solve the problem, an application can force the render to serialize. This will remove ghosting entirely as there will be no outstanding renders, but this approach is likely to cause noticeable performance degradation. For this reason, serialization should only be considered as a last resort.

VBO updates

When VBOs are updated, the driver will kick any outstanding vertex processing (TA) tasks that rely on the existing data. When the task is kicked, the driver will block the application’s GL thread until the vertex processing has completed and it’s safe to update the contents of the VBO.

Optimize your renderer - 04 - VBO update blockingVBO update blocking

The reason a different approach is taken for VBOs is that the cost of kicking a TA task and waiting for it to complete is much cheaper than an equivalent approach for textures. An additional benefit of this solution is that the GPU does not get interrupted, so it can continue to process its workload as fast as possible.

How can I avoid the stall?

Rule of thumb: Avoid VBO updates

Similar to the advice for textures, it’s best to think of VBO’s as read-only blocks of data. For this reason, VBOs should only be used for static attribute data. Dynamic attributes that change on a per-frame basis should be uploaded directly to GL instead of modifying VBOs. Doing so will avoid the stall.

Optimize your renderer - 05 - VBO circular bufferVBO circular buffer

In situations where attributes need to be updated but may be reused for a number of frames, a circular buffer of VBOs can be used (see figure above). A circular buffer consisting of n VBOs (where n is the number of frames in flight) should be sufficient to pair a VBO with each in-flight frame, and thus avoid blocking (as VBOs will only be updated when they are not being accessed by the GPU). Although this approach avoids the stall, it will increase the memory requirements of your application. If you are already approaching GL_OUT_OF_MEMORY territory, the slight overhead of the stall may be a more efficient option than an out-of-memory fall back.

Conclusion

To ensure your application’s 3D graphics are as efficient as possible, rendering code should be designed in such a way that the GPU will not be disturbed. This will give great performance, and it will also lead to a well-designed solution that is easier to maintain and port to new platforms.

If you’d like to learn more about the PowerVR architecture and best practices when writing graphics applications, check out our Performance Recommendations and PowerVR Series5 Architecture Guide for Developers documents. If you have any questions about this tutorial, you can contact our DevTech support team on Imagination’s PowerVR Insider dedicated forum. Remember to follow us on Twitter (@ImaginationPR and @PowerVRInsider) and subscribe to our blog.

, , , , ,