GPU compute – what is it and why do we care?

GPU compute is one of the hottest topics today – but what exactly is it, and why should we all care about it for mobile devices?

Answering the first question is relatively easy, as the name already gives it away: GPU compute is about using the GPU for something which is not graphics, but instead for generalized compute (usually in the form of number crunching). This usage scenario makes sense, as GPUs have very large numbers of Arithmetic Logic Units (ALUs), and doing shader calculations (for vertex and pixel processing) is, at a basic level, no different from other more generic types of calculations.

GPU compute means highly parallel, independent processing

So does that mean that the GPU will replace the CPU which has historically been the resource for number crunching? Not quite, as another key characteristic comes into play with GPU compute and that is parallelism. GPUs are number crunching beasts, but they have been designed for graphics processing; and in this usage scenario, data elements are inherently independent. Independent means that one vertex, which forms a 3D object, does not impact the calculation of any other vertex being processed – they are independent. The same is true for pixel processing. The calculations for one pixel on the screen are not directly influenced by the calculations for any other pixel on the same screen. Again, they are inherently independent.

This independence allows for massively parallel processing, where many vertices and/or pixels can be processed in parallel. GPUs are very much designed with this parallelism in mind, and typically a GPU consists of tens or even hundreds of parallel compute engines which all independently process data. This architectural design means that GPUs are fast and highly efficient at processing generic data elements matching the graphics usage case.

Now this is exactly what makes GPU compute different from CPU compute. CPU compute has historically been dependent compute, in which the CPU is a decision engine where what is currently being processed has a strong dependency on what was previously processed. This is known as serial processing. GPUs, on the other hand, have many parallel compute units. Using these parallel units for serial compute is typically inefficient, as the inherent architectural parallelism becomes less efficient and the resulting performance would be lower. CPUs have been designed with this in mind and thus excel, as far as it is possible, at executing serial tasks.

GPU parallelism comes with another characteristic related to the handling of branching. Branching means that, as part of execution, a decision is made to run a certain set of instructions based on a test operation per processed element. This breaks the parallel behaviour as we get divergence between executed tasks. Again, this is simply linked to how GPUs are designed with many parallel engines which match GPU usage scenarios. When you have many parallel ALU engines, you need to control them, and this requires logic which decodes and tracks the instructions that need to be executed.

To allow for efficient designs, GPUs build on the assumption that many ALUs will work on independent parallel data, but that they will all execute the same instruction. This is known as SIMD (Single Instruction Multiple Data). This approach allows GPU designs to put down many ALU engines, but limit the amount of control logic overhead (in terms of silicon and power costs). But this has an impact when branching occurs, as this diverges the execution, where different data elements may end up along different branches of the program execution path. Branching occurs even in graphics processing as more complex pixel and vertex shading effects execute branches. While branching may reduce efficiency, architectures are still typically designed to handle these branches as cleverly as possible.

In summary, GPUs are very good at ultimate parallelism: lots of independent data elements, all executing the same program with minimal and ideally no divergence due to code branches. CPUs, on the other hand, are designed to handle serial processing tasks where there are constant dependencies, and branching is handled effectively.

CPUs and GPUs should be considered as complementary designs, optimized for different types of processing. This impacts the hardware architecture where the CPU is optimized for flow control and low memory latency by dedicating transistors to control and data caches. The GPU is optimized for data-parallel computations that have a high ratio of arithmetic operations to memory operations, dedicating transistors to data processing rather than data caches and flow control, as illustrated below:

GPU compute - mobile GPU vs. CPU architecturesA typical representation of CPU and GPU architectures

The right task running on the appropriate processor

This level of hardware optimization for specific tasks is what brings us to the “why” part of this section, as hardware architectural optimization brings with it power efficiency – the most critical element in the battery-powered mobile market segment.

Executing massively parallel tasks on a CPU works, but the throughput is fairly low because the architecture is designed to handle serial divergent tasks and hence the power efficiency of running such as task is low. On the other hand, running a highly-parallel non-divergent task on a GPU is extremely fast, as this is what GPUs were designed for. Hence, GPUs deliver enormous throughput and also very high power efficiency.

The reverse is also valid – running highly-divergent serial tasks on a GPU results in lower utilization, where many ALU units may go idle and go unused. In this scenario, throughput and power efficiency will be lower. On the other hand, running a serial divergent task on a CPU engine is exactly what a CPU was designed for, and hence hardware utilization and throughput will be high and the power efficiency will be better.

This power and performance benefit is demonstrated in the images below. They are both taken from this video showing a highly parallel compute processing task – image processing – where each image pixel is independent of all other pixels processed, using the same program and instructions with zero divergence, an optimal scenario for a GPU. Running the algorithm on our PowerVR GPU results in a much higher framerate than when using a multicore CPU implementation running at a clock frequency many factors higher:

GPU compute_OpenCL on PowerVR-vs-OpenCL-on-multicore-CPUBuffer sharing between OpenCL and OpenGL ES (top left) is supported on PowerVR GPUs; it provides better performance and lower CPU overhead

The screenshots above demonstrate both the resulting high performance and the efficiency benefit, which leads to optimized power consumption. The GPU is not only fast, it is also using far less power to achieve that higher performance, and this is critical for mobile devices, where low power consumption is critical.

PowerVR GPU compute, for parallel non-divergent processing, has been proven to deliver higher performance and better power efficiency than when using the CPU for these tasks. For other types of processing, the roles may get reversed thus making the CPU and GPU complementary processing solutions. Recognizing this split is critical going forward. There is no sense in making CPUs more parallel-capable or GPUs more serial-capable – it is far better to ensure both types of processing resources are available and optimized for their respective optimal usage scenarios and to assign the workloads to the most suitable processor available within an SoC system.

If you have any questions or feedback about Imagination’s graphics IP, please use the comments box below. To keep up to date with the latest developments on PowerVR, follow us on Twitter (@GPUCompute, @PowerVRInsider and @ImaginationPR) and subscribe to our blog feed.

, , , , , , , , , ,

  • Mazinger-z

    Rogue Gpu looks like a good jump forward compared to series 5

    I’m very curious of what you guys are going to archieve with MIPS arquitecture

    Any idea of when are we going to see the new MIPS preocessors in smartphones and tablets?

  • http://withimagination.imgtec.com/index.php/author/alexvoica Alexandru Voica

    Hi,

    Thanks for the comment. Indeed, PowerVR Series6 offers multiple architectural improvements over SGX (Series5/Series5XT) GPUs. This is the first article in a series of blog posts focused on the ‘Rogue’ architecture so stay tuned for more.

    We have seen increased interest for the MIPS Series5 ‘Warrior’ CPUs and received positive feedback from key partners. I can’t give you an exact timeframe for when you will see these CPUs in mobile devices but we remain committed to expanding the existing market share for MIPS cores.

    Best regards,
    Alex.

  • TWatcher

    I note that anandtech discovered that the camera app in the galaxy S4, was one of the very few 1st party apps to run the PowerVr GPU @ full 532Mhz, and they said it was apparent during the application of visual filters.

    Might this suggest that the filters are being run on the GPU ?

  • Matt

    We need Android Support for OpenCL! Hope you can push for this over the awful renderscript compute. The recent actions by Google to prevent OpenCL are just sad for the industry and for developers a like

  • http://withimagination.imgtec.com/index.php/author/alexvoica Alexandru Voica

    Hi Matt,

    We support all relevant mobile GPU compute standards, including Renderscript/Filterscript and OpenCL. It is up to our ecosystem to decide which one to promote, but we see both as complementary standards. There’s more to come on how each was designed and their characteristics in a future blog article.

    Best regards,
    Alex.

  • http://withimagination.imgtec.com/index.php/author/alexvoica Alexandru Voica

    Hi,

    There are indeed certain scenarios in which you can use the GPU for image processing. One of them is to use the camera input as a texture and apply certain filters on it using either OpenGL or OpenCL.

    See an example of this here:
    https://www.youtube.com/watch?v=lfuEpjgzvCE

    Best regards,
    Alex.

  • Matt

    I appreciate your good faith however Google’s recent removal of the Clang front end compiler on their devices to specifically prevent OpenCL developers from compiling code speaks volumes about their unwillingness to let developers choose what is bet. I hope you won’t support these draconian style blocks by Google. It is a sad sad day to see this from a company claiming to be open. Either way it is really frustrating to see a company go out of their way to specifically block an open standard such as OpenCL. Its one thing to say we aren’t supporting OpenCL and another to spend time and resources to prevent OpenCL code from compiling.

    I have tried renderscript compute however it is very very limiting and not being able to force code to run on a specific processor is a deal breaker. You have no way to optimize otherwise. Obviously your CPU version would be different from your GPU version but there is no way to say “Run this one if you are going to run on the GPU or this one if you are running on the CPU”. Its not like we are dealing with MASSIVE performance on these platforms and we need to capture what we can.