GPU compute on PowerVR with Android’s Filterscript API

Android is Google’s mobile and embedded operating system that targets every consumer electronics product from smartphones and tablets to handheld gaming consoles or smart TVs and wall ovens. Yes, you’ve read correctly – an oven capable of running OpenGL ES apps. But graphics APIs aside, the latest version of the Android operating system (Android 4.2 Jelly Bean) also includes two compute APIs that will open up new worlds for apps developers: Renderscript Compute and Filterscript.

Renderscript is not new in itself. All Android 4.0-based devices had been able to use it for graphics before Android 4.1 was introduced and the Renderscript graphics engine became deprecated. The compute part of the API however had certain limitations (for example, it could only be run on the CPU). With Android 4.2, things have moved in a good direction for GPU compute enthusiasts as not only has Google enabled Renderscript to execute on the GPU, but it has introduced a second API targeting special programming use-cases called Filterscript.

Filterscript’s relevance for mobile devices

Filterscript is essentially a carefully chosen subset of Renderscript APIs that allows developers to run code on a potentially wider variety of processors (CPUs, GPUs, and DSPs). For example, a script can include different parameters to tell the Renderscript runtime it does not require strict IEEE 754-2008 floating point semantics. Filterscript also has a separate file extension to Renderscript (.fs instead of .rs). The changes from Renderscript Compute were designed explicitly to optimise parallel processing cases predominantly run on the GPU, offering developers a new set of tools dedicated to pixel processing (similar to Khronos APIs).

The benefits of Filterscript over Renderscript are related to cross platform development brought to the CPU. It reduces instruction set lock-in and offers heterogeneous platforms a better opportunity to benefit from GPU acceleration.

This new compute API is particularly suited to image processing operations, particularly aiming to replace kernels that one would typically write with GLSL. Since we’ve had a long history of helping companies develop specific use cases for GPU compute on PowerVR Series5 and Series5XT cores, we’ve developed a Filterscript demo and ran a selection of the most popular Android image processing scripts on three computing platforms integrating a PowerVR SGX544 GPU and observed the results.

PowerVRGPU_PowerVR_SGX544_Filterscript_demo (5)

Image adjustment example running on a PowerVR SGX544SC-based platform with all filters turned on

PowerVRGPU PowerVR SGX544 Filterscript demo (4)

Image adjustment example running on a PowerVR SGX544SC-based platform with one filter turned on

PowerVR is the best mobile GPU compute architecture

Before delving through our findings, there are a few things to be mentioned. First and foremost, Imagination’s PowerVR architecture has been designed for efficiency from day one and continues to provide the most power efficient family of GPUs available on the market today. Key advantages such as the TBDR (Tile-Based Deferred Rendering) approach to rendering graphics, the PVRTC technology for texture compression, and a unified architecture focused both on fillrate and GPU compute efficiency provide our partners with the tools they require to succeed in a dynamic market.

PowerVRGPU PowerVR SGX544 Filterscript demo (CPU only) PowerVRGPU PowerVR SGX544 Filterscript demo (GPU)

Image adjustment example running on the dual-core CPU (left) and the PowerVR SGX544SC GPU (right)

Our dedicated engineering team has put a lot of effort into providing the mobile and embedded market with the best solution that is optimized for low power but also is able to deliver unmatched performance points. Both partners and industry analysts have confirmed the low power characteristics of PowerVR SGX and ‘Rogue’ cores.

Another important point to be made is that for mobile power becomes critically important. While clock frequency is no longer a problem and silicon area is less of a concern, all designs however become power limited so efficiency determines your performance. This is where our carefully balanced design and robust architecture come into play, offering a scalable roadmap that can be integrated with a wide range of CPU and bus interconnect architectures and supports all major compute APIs such as OpenCL , Renderscript Compute, Filterscript and, for future PowerVR generations, standards promoted by groups like the HSA Foundation.

Filterscript examples and final words

Developers have found Filterscript very efficient at handling simple scripts that may otherwise have been written in GLSL. Many apps also have large amounts of C or C++ pre-processing code that runs before final operations in Filterscript. Therefore, we’ve ported existing OpenCL applications as well as developed new Filterscript code to test whether PowerVR GPUs are able to handle GPU compute code and so far the results have been very promising.

But before we give you the results, here is a comparison of the peak GFLOPS performance of each platform, when looking at the total processing power of the respective CPUs and GPUs.

Multicore CPUs and PowerVR Series5XT GPUs GFLOPS

The performance comparison charts below show how some well-known image processing filters implemented in Filterscript and Renderscript have run on a platform with a PowerVR Series5XT GPU.

PowerVRGPU Android Renderscript Filterscript PowerVR SGX544MP3

PowerVRGPU Android Renderscript Filterscript PowerVR SGX544MP2

PowerVRGPU Android Renderscript Filterscript PowerVR SGX544SC

Notice that even though some scripts have roughly the same performance on both multicore CPUs and GPUs, it is important to remember each processor’s running frequency and peak power consumption. The important takeaway for Filterscript and Renderscript running on the PowerVR GPU here is that you get similar or better performance (going up to three- to sevenfold in some cases) at a much lower frequency, and therefore get much lower system power consumption. Furthermore, by offloading parts of your application to the PowerVR GPU, the CPU is free to handle other tasks and the overall system efficiency is increased as well.

We’ve seen GPU compute examples where by moving the code on the GPU, we’ve achieved massive savings of up to 1.5W at the system level so taking the time to optimize your code to run on the appropriate processor for a corresponding task can definitely impact the overall user experience in a meaningful way.

Interested in Imagination’s PowerVR GPUs and Android’s APIs for GPU Compute? Then follow us on Twitter (@GPUCompute, @PowerVRInsider and @ImaginationPR) and keep coming back to our blog. We even have a dedicated tag where you can find all you need to know about GPU Compute, HPC, heterogeneous processing and other similar topics.

, , , , , , , , , ,

  • Mos

    Hi, Alexandru
    How to get the Image adjustment example code?

  • http://withimagination.imgtec.com/index.php/author/alexvoica Alexandru Voica

    Hi,

    The demo is a Filterscript application that applies per-pixel
    filtering. The kernels were ported from OpenCL but code is part of our demonstration kit which we only release to our partners.

    It should be very straightforward to build something similar based on the Filterscript/Renderscript documentation in the latest Android SDK.

    Best regards,
    Alex.

  • Sean Lumly

    How do you find the performance of filterscript compared to OpenCL on your reference platform?

  • http://withimagination.imgtec.com/index.php/author/alexvoica Alexandru Voica

    It obviously depends on the kind of code you’re writing. Because of the differences between the two APIs (outlined earlier by Kristof), we’ve noticed differences in performance for use cases that make sense in mobile.

    This is because Renderscript/Filterscript was designed to make scheduling easier and abstract between processing elements (CPU, GPU, DSP, etc.) in an SoC which might lead to a slight decrease in peak performance. However, Android devs are constantly pushing the boundaries in terms of writing faster code.

    Regards,
    Alex.

  • Sean Lumly

    Thanks for the response, though I’m not sure that the question was answered — I understand that performance will be different given the platforms, but how they compare with similar/equal kernels on a GPU (say Rogue) is still ambiguous (to me).

  • http://withimagination.imgtec.com/index.php/author/alexvoica Alexandru Voica

    I thought you were asking me to compare Renderscript/Filterscript vs. OpenCL performance on the same family (mainly, SGX). Obviously, Series6 has much better compute performance so all compute APIs see increased peak performance.

    For example, we’ve run our OpenCL image processing demos on the LG H13 platform (lower clocked PowerVR G6200 GPU) and have seen performance increases of several orders of magnitude compared to multicore SGX platforms.

    I hope that answers your question.

    Regards,
    Alex.

  • Sean Lumly

    Thanks again, though I perhaps should have been clearer.

    I was wondering about the comparison between the API performance using a single GPU (preferably a Rogue GPU) as a reference and similar kernels for each API. I guess I want to know how RS/FS generally compares to CLES in terms of performance.

    I’ve read that FS is still at a 2-3x speed disadvantage, though the code may not have been properly optimized, and the GPU was different than the mighty Rogue.

  • T

    You will likely not get an answer from ImgTech on that as Google is pushing RSC like crazy and who do they sell GPUs to. My experience on Renderscript compute on other platforms (T604) can be MUCH faster even without optimizations and much faster 3-8x if you take time to optimize fully. More importantly RSC offers no guarantee your code will run on the GPU. Normally you might run one alg on the GPU but another on a CPU. RSC offers no method to do this and you are just left guessing what happend. RSC was not well thought out and is being shoved down mobile GPU vendor’s throats.

  • Sean Lumly

    Thanks! I’m not opposed to Renderscript as a compute API. I like the platform agnostic aspect of the API and the ‘future proofing’ of existing code on new and different platforms — at the expense of a bit of speed. Thankfully it seems that RS and FS both are seeing some major performance improvements version over version, though details of these cases and benchmarks are still light. And as you mention, it is still very necessary to optimize code, even for this high-level API.

    I understand why Google is pushing RS/FS over something a bit more low-level like OpenCL given the deluge of hardware combinations that Android is distributed on. I think the major fault with RS/FS at the moment is lack of documentation.