Imagination’s Meta CPU: advanced simultaneous multithreading

In today’s increasingly competitive landscape, edging out other solutions both in benchmarks and in actual real life usage scenarios becomes a matter of efficiently combining the resources at hand to achieve the much-sought higher performance rather than relying on just one factor or performance metric. Therefore, after returning from a successful 2012 Linley Tech 6th Annual Processor Conference, we’ve decided to expand Paul Evans and Jim Whittaker’s presentation on Meta’s innovative latency-tolerant CPU architecture  that focused on what simultaneous multithreading can bring to embedded computing.

Memory latency and its impact on computing architectures

For most platforms, system latency is quickly becoming the bottleneck. The traditional CPU centric approach was to rely on dedicated memory with fixed predictable access times (tightly coupled memory, for example) and hope Out of Order (OoO) execution and branch prediction would mask latency issues. This proved to be a tolerable “good enough” solution as long as the cost and performance metrics were reasonable.

Meta simultaneous multithreading CPU - CPU centric systems

CPU-centric embedded system

Recent mobile and embedded SoC architectures followed a different design paradigm. This involves using a larger common memory pool (to reduce cost) and introducing a multilevel cache hierarchy to maintain performance. This works well until the inevitable cache miss, at which point we enter a world of variable and potential very long latencies. Correctly right-sizing these cache buffers can be a complicated matter (cost vs performance vs application behavior) – imagine an application dereferencing through a large table, suddenly a technique like OoO is of little use for multicore. Issues with unpredictable memory behavior, limited operating system support (Google’s Android did not introduce support for multicore processor architectures until v.3.0 Honeycomb while Windows Phone 7 is still relying on single-core) and increased power consumption also make some of the techniques we see today impractical or costly.

Meta simultaneous multithreading CPU - traditional SoC architectures

Traditional SoC architecture

Therefore it becomes clear that, as embedded technologies move forward, common solutions such as increasing clock frequencies, adding additional caches or increasing the number of CPU cores on a chip will just make matters more complicated rather than deliver leaps in performance gain. As processors try to take control of the system resources, the bus fabric becomes a convoluted matrix and large latencies will unequivocally start to appear. These will in turn corrode the overall system performance and waste power, therefore going to a dual-core CPU might only give a 30% to 70% improvement.

Meta simultaneous multithreading CPU - current SoC architectures

Current SoC architecture

A dual core central processor can be power hungry, expensive and complex and only delivers a roughly 0.6x increase in performance but adds a 2.2x area increase due to the required connecting and maintenance logic. The performance gain can be even worse with quad core; we are currently seeing an industry trend to keep the CPU core count to two and increase the number of GPU units instead. As GPUs are multi-threaded processing units, this technique ultimately boosts the compute resources of a typical mobile platform in the hundred GFLOPS to several TFLOPS range.

Meta CPUs: Simultaneous Multithreading with Automatic MIPS Allocation

Jumping straight to multicore is clearly not the way to go. Doubling and quadrupling the silicon area is not necessarily a cost effective method especially as OoO schemes become more prohibitive when gluing multiple cores together. It’s therefore time to revisit simultaneous hardware multithreading to increase the efficiency of available resources before leaping to multicore.

Imagination’s Meta family of 32-bit general purpose multi-threaded processor IP cores are the ideal CPU simultaneous multithreading solution. Meta is Imagination’s synthesizable applications processor based on a common instruction-set architecture (ISA) for the entire family (HTP, MTP and LTP) that delivers high performance via multithreading combined with low power consumption and reduced system cost.

Meta simultaneous multithreading CPU - Meta CPU block diagram

Meta CPU block diagram

This CPU provides a latency tolerant solution, maximizing the efficiency of any SoC before going to multicore. Multi-threaded CPUs allow for a more fine-grained execution at the instruction level and utilize the CPU’s capabilities better by issuing instructions from multiple threads. They strike the right balance between private and shared resources with parallel execution offering natural memory coherency instead of relying on complex logic to implement synchronization barriers for instruction and data cache maintenance operations. The unique multithreading features of the Meta architecture enable SoCs to make far better use of every cycle of the CPU, getting more done in fewer cycles than conventional embedded processors. The Meta architecture also includes advanced scheduling features such as AMA (Automatic MIPS Allocation) that enables the system designer to adjust what percentage of total MIPS is made available to each thread. Since AMA is implemented in hardware, final system performance can be adjusted to minimize software rebuilds.

Single-threaded vs. Multi-threaded

By comparing single and multi-threaded CPUs, there is a clear advantage when executing in parallel. In a single thread scenario, events occur serially and the processor must wait during cache misses.

Meta simultaneous multithreading CPU - Single-threaded single core CPU

Single-threaded single core CPU

Meta simultaneous multithreading CPU - Single-threaded dual core CPU with SMP

Single-threaded dual core CPU with SMP

For regular multithreading, when the first virtual processor experiences a cache miss, the next thread will start executing, requiring a context switch. Multicore CPUs can also over-complicate the whole idea of performing a context switch by relying on the OS to decide when this should happen. If the cores have a number of small differences in the programmer’s model, the process can become somewhat inefficient and consume active cycles when more useful instructions could be executed.

 Meta simultaneous multithreading CPU - Multi-threaded CPUMulti-threaded CPU

With simultaneous multithreading these problems disappear, as there is no context switch. At the start of each clock cycle an intelligent scheduling is performed on all available threads; if resources are available for a certain thread, it is launched so one can have multiple overlapping tasks running at the same time.

 Meta simultaneous multithreading CPU - Simultaneous multi-threaded CPUMeta CPU – Simultaneous multi-threaded CPU

Meta supports the latest Linux and Android operating systems, but can also run low level RTOS or even native DSP code simultaneously – all highly efficiently sharing SoC resources, which keeps clock speeds down, optimizes memory utilization, and gets the most performance out of an SoC. SMP Linux regards Meta as four virtual processors, allowing the operating system to dynamically assign tasks. As far as it’s concerned, it’s a quad core but without all the overhead of actually having four physical cores.

Another breakthrough for Meta was it implementation of AMA (Automatic MIPS Allocation). It is a patented technology for multithreading that enables the CPU to achieve real-time performance using intelligent scheduling of multiple threads. The major advantage for AMA lies in the possibility to adjust priorities dynamically so programmers can control thread execution rates. In the figure below, the concept behind how AMA works is illustrated by considering two real-time threads and a regular task. The first two threads are set to a higher priority than the third; in addition, one of the real-time threads is also assigned a certain allocation of MIPS. The non-real time thread cannot run at a desired rate, builds up a deficit but will eventually catch-up using spare cycles.

Meta CPU simultaneous multithreading: AMA (Automatic MIPS Allocation) example

AMA (Automatic MIPS Allocation) example

By using AMA, Meta is able to use spare resources to run the non-real time tasks and always maintain the required resource for the real-time processes. It’s just like having your own hardware task manager that controls every thread in a much more advanced way.

Meta simultaneous multithreading CPU - Meta vs Traditional single core

Meta multi-threaded CPU vs Traditional single core CPUs

The Meta family of 32-bit CPU cores is a unique range of IP processors that employ hardware simultaneous multithreading to provide exceptional tolerance to SoC system latencies while also delivering new levels of real-time response, that makes them ideal for a wide range of applications.

For the latest news on Meta CPUs, our Flow technologies for the Internet of Things and other exciting announcements make sure you follow us on Twitter (@ImaginationPR), add us on +Imagination and subscribe to this blog.

, , , , , , ,

  • MagnetMan1

    I’m extremely interested in this processor. What class of core processor is the single-core Meta 4-thread comparable to in die-space? Is the ISA new, or is it instruction compatible with an existing set?

  • http://withimagination.imgtec.com/ alexvoica

    @MagnetMan1
    Hi,

    The 4-threaded Meta HTP core is an application level processor. In terms of die area, the exact figures will depend on configuration. At the same time, this is an in-order core so it does not have the logic overhead for super-scalar or Out of Order which can make it smaller than similar multicore configurations.
    On the flip side the additional resources for things like unique program counters for each thread can grow the logic a little (10-15%) compared to a single-threaded in-order core. Overall I’d suggest contacting the support team (meta@imgtec.com) with an exact configuration, you may be pleasantly surprised.

    The ISA is common across all Meta products from micro-controllers to application level processors but it is separate from other vendor-specific ISAs (PowerPC, MIPS, ARM, x86 etc).

    Best regards,
    Alex Voica.

  • MagnetMan1

    @alexvoica  @MagnetMan1 Many thanks for the in-depth response Alex! Although I am just an enthusiast, I find this technology very innovative and am convinced that it will find a very nice segment of the market. In fact, I think it will do one more and shake up the industry, giving a moment of pause to other CPU designers that have stuck with paradigms such as OoO execution for generations, and those growing into them in the increasing transition to mobile.

    I also really appreciate Imaginations diligence in filling the software gap for this product to ease its introduction to market (which I assume includes a complete compiler toolchain). I just hope that chip designers strongly consider this technology even in wake of the distinct ISA.

    But the most exciting aspect of this technology is Imaginations bold foray into CPU arch. Imagination is a company to bet on.

    Together with your outstanding RPU, Imagination seems to be leading the charge with ever more efficient performance and space concious designs. It’s easy to see why I’m a huge fan! Bravo!

  • MagnetMan1

    I’m extremely interested in this processor. What class of core processor
    is the single-core Meta 4-thread comparable to in die-space? Is the ISA
    new, or is it instruction compatible with an existing set?

  • alexvoica

    Hi,

    The 4-threaded Meta HTP core is an application level processor. In terms
    of die area, the exact figures will depend on configuration. At the
    same time, this is an in-order core so it does not have the logic
    overhead for super-scalar or Out of Order which can make it smaller than
    similar multicore configurations.

    On the flip side the additional resources for things like unique program
    counters for each thread can grow the logic a little (10-15%) compared
    to a single-threaded in-order core. Overall I’d suggest contacting the
    support team (meta@imgtec.com) with an exact configuration, you may be
    pleasantly surprised.

    The ISA is common across all Meta products from micro-controllers to
    application level processors but it is separate from other
    vendor-specific ISAs (PowerPC, MIPS, ARM, x86 etc).

    Best regards,

    Alex Voica.

  • MagnetMan1

    Many thanks for the in-depth response Alex!
    Although I am just an enthusiast, I find this technology very innovative
    and am convinced that it will find a very nice segment of the market.
    In fact, I think it will do one more and shake up the industry, giving a
    moment of pause to other CPU designers that have stuck with paradigms
    such as OoO execution for generations, and those growing into them in
    the increasing transition to mobile.

    I also really appreciate Imaginations diligence in filling the software
    gap for this product to ease its introduction to market (which I assume
    includes a complete compiler toolchain). I just hope that chip designers
    strongly consider this technology even in wake of the distinct ISA.

    But the most exciting aspect of this technology is Imaginations bold foray into CPU arch. Imagination is a company to bet on.

    Together with your outstanding RPU, Imagination seems to be leading the
    charge with ever more efficient performance and space conscious designs.
    It’s easy to see why I’m a huge fan! Bravo!