PowerVR ‘Rogue’: Designing an optimal architecture for graphics and GPU compute

When designing our PowerVR ‘Rogue’ architecture, all components were reworked and optimised for efficiency (more on this in a future article). Part of that effort came from a deeper consideration of the GPU compute angle. Therefore, in this article, I will focus on just two highlighted key features of the PowerVR Series6 GPU design that is linked to this effort.

PowerVR ‘Rogue’ GPUs feature scalar ALUs for highest compute resources utilization and easy programming

The PowerVR ‘Rogue’ architecture is built around scalar processing engines rather than the vector engines used in older GPU designs. There are numerous benefits in going to a scalar processing architecture – most notably easier optimal software development. This ease of development benefits both our compiler teams (no need for complex and expensive vectorisation efforts at the compiler level). It equally it is far easier for developers, since it no longer matters if they vectorise their algorithms or not. This benefit is illustrated in the graph below:

GPU compute on PowerVR 'Rogue': ALU utilization scalar vs vector architecture

As can be seen in the graph, the scalar architecture does not care if the algorithm is written with scalar ops (R), vec2 ops (RG), vec3 (RGB) or a full vec4 (RGBA), where the vector-based architecture is highly sensitive to vectorisation. With vector-based architectures, the problem of efficiency is shifted to the software developer, rather than tackling efficiency directly through a modern architecture with a scalar design.

This architectural efficiency is essential for optimising image processing algorithms. This is one of the most popular and sensible usages of GPU compute in the mobile segment, where many algorithms reject colour information as a first step, and limit processing to intensity information only. Such an approach on a scalar architecture is no problem at all.

On a poorly-implemented vector architecture however, the developer is faced with 25% of peak performance or the expensive option of vectorising the entire algorithm to process multiple intensity values in one go (e.g. 4 pixels).

Vectorising algorithms may be manageable for simple algorithms but quickly becomes a lot more complex as algorithms commonly mix vector widths thus significantly complicating this effort. Typically developers focus on optimising for the most common dominant architecture and, given current market share ratios, it seems extremely likely that scalar based architectures, like PowerVR ‘Rogue’, will be dominant in the mobile market (not a surprise given the gain in efficiency). Further strengthening this developer focus is the PC market, where compute architectures are also using scalar pipelines. Algorithms ported from this market to mobile will already have been optimised for scalar and not vector based GPU designs.

PowerVR ‘Rogue’ GPUs have improved support for local memory

Compute APIs have the opportunity to expose different memory types, which, depending on the implementation, may provide different performance levels. Typically this is referred to as local memory (fast memory) and global memory.

When writing algorithms using just global memory, you just address data as you would normally do, and access goes to system memory through a standard cache infrastructure. With local memory, however, algorithms can be rewritten to pre-load data into the local memory (a kind of on-chip cache). Then the algorithm accesses this fast local memory store during its compute processing, and at the end, the results again are burst-written to system memory.

It should already be clear that the latter approach sounds far more bandwidth- and power-effective, as data is fetched into local memory once followed by making all accesses on-chip. This is unlike the first approach where it is all left up to chance (any cache implementation depends on luck: if you are lucky, the data is still in the cache; if you are unlucky, the data has already been flushed by other data accesses and hence you need to re-fetch).

If you remember our graphics approach of Tile Based Deferred Rendering (TBDR) from other posts, you will remember that by using our tile sorting, we ensure that caches become 100% effective (see link). It comes as no surprise then that Imagination has implemented the equivalent concept of compute using the efficiency of fast on-chip memory.

GPU compute on PowerVR 'Rogue': memory hierarchy in OpenCL

Within the PowerVR ‘Rogue’ architecture, there are numerous optimisations linked to compute usage scenarios. We also continue to make our architecture more efficient and effective by studying actual practical mobile compute use cases as they come to market from third parties.

If you have any questions or feedback about Imagination’s graphics IP, please use the comments box below. To keep up to date with the latest developments on PowerVR, follow us on Twitter (@GPUCompute, @PowerVRInsider and @ImaginationPR) and subscribe to our blog feed.

, , , , , , , , ,

  • Sean Lumly

    Wow! This was a fantastic read, and I learned something about Scalar vs. Vector ALUs (which makes complete sense in hindsight). Imgtec continues to lead the charge in terms of efficiency, and I simply can’t wait to see what Rogue is capable of doing. Thank you so much for this.