iApplianceWeb.com

EE Times Network
News Flash Appliance Insights Appliance Directory Standards in IA Webcasts


 

First Look:

IBM, Sony, Toshiba say Cell  Processor makes computing more connected

By Bernard Cole
iApplianceWeb
(11/30/04, 09:46:52 PM GMT)

San Francisco, Ca. – More details are emerging about the revolutionary “cell processor” that IBM, Sony, and Toshiba have been hinting about for months.  

At a press conference here on Monday, Nov. 29th, abstracts of the papers the companies will be presenting at the International Solid State Circuits Conference in February of next year were released, focusing on the hardware features of the radically new computer architecture.  

As enlightening as they are about the hardware, additional documentation that is also available makes it clear that what the companies are after is not just a new CPU that can be used in a number of different net-centric computing appliance applications.

Their aim is a fundamental reordering of existing computer hardware and software architecture to reflect the realities of the new pervasively connected computing environment.

The ISSCC abstracts reveal a multicore 64-bit Power CPU architecture with embedded streaming processors, high speed I/O, SRAM and the use of a dynamic multiplier.  

Currently, target applications (Figure 1, below) for the Cell architecture depend on who is doing the talking. In gaming circles it is viewed as a the muscular gaming engine for Sony’s new Playstation 3. But it has also been promoted for use in set-top boxes, mobile devices and workstations. A version of the Cell processor is already being used by Sony in workstations for use by game developers.  


Scalable Cell Processor Applications


Each processing element consists of an IBM Power-architecture 64-bit RISC CPU, a highly sophisticated direct-memory access controller and up to eight identical streaming processors, all of which reside on a very fast local bus.  

Each processing element is also connected to others over parallel bundles of high speed serial I/O links which are capable of throughputs of about 6.4 GHz per link.

It seems to be conceptually similar to the multi-plane architecture used in network processing units: Power processors handling supervisory, I/O and interface and traditional computational tasks, while in the data plane, the streaming processors -- self-contained SIMD units that operate autonomously once they are launched -- focus on data movement.  

Data and instructions are moved about the via (1) a 128-kbyte local pipe-lined SRAM that is located between each stream processor and the local bus, (2) a bank of one hundred twenty-eight 128-bit registers and (3) a bank of four floating-point and four integer execution units. All operate in single-instruction, multiple-data mode from one instruction stream.  

To make all processing resources appear in a single pool under control of the system software and operate as a tightly couple multiprocessor, the hardware includes a new DMA controller design that allows any processor in the system to access any bank of DRAM in a particular cell module through a bank-switching arrangement.  

Traditional CPU architectures obsolete 

As impressive as the hardware design and performance is, it is driven by, and reflects, a fundamental rethinking of the common programming model and software architecture from which most modern standalone RISC architectures are derived. 

According to the principal developers, the cell processor architecture represents a fundamental shift to a new architectural paradigm that reflects the new connected computing environment.  

According to them, the RISC processors and controllers in current use were all conceived in the era before the Internet and World Wide Web became a mainstream phenomenon and are designed principally for stand-alone computing.  

The sharing of data and application programs over a computer network was not a principal design goal of these CPUs. And while they all have a common RISC heritage, the processor environment on the Internet is heterogeneous.

Each CPU has its own particular instruction set and instruction set architecture (ISA), its own particular set of assembly language instructions and structure for the principal computational and memory elements that execute these instructions.  

Not only does this make a programmer’s life more complicated, it increases the cost of application development, since identical applications have to be written to reflect not only each processor’s ISA, but the physical constraints they must operate within and the specific requirements of the device in which it is used, which in the new connected computing environment are extensive.  

In addition to personal computers (PCs) and servers, they point out, a diversity of computing devices have emerged, including cellular telephones, mobile computers, personal digital assistants (PDAs), set top boxes, digital televisions and many others. The sharing of data and applications among this assortment of computers and computing devices presents substantial problems.  

Java is not enough 

According to the inventors of the new architecture, a number of techniques in the past have been employed to overcome these problems, including sophisticated interfaces and complicated programming techniques, all of which require substantial increases in processing power to implement. The result has been a substantial increase in the time required to process applications and to transmit data over networks.  

One way around this that is commonly employed is to transmit the data and the applications code separately over the Internet. While this approach minimizes the amount of bandwidth needed, it also often causes frustration among users.

The correct application, or the most current application, for the transmitted data may not be available on the client's computer. This approach also requires the writing of multiple versions of each application for each CPU ISA used on the network.  

The Java Virtual Machine “write one, run everywhere” model, they point out -- which uses a platform independent virtual machine written in interpretive form, rather than compiled to make maximum us of each target processor’s resources -- is a partial and increasingly unsuccessful attempt to solve this problem.  

And it will become more inadequate as real-time, multimedia, network applications are become more pervasive, they point out. Such net-centric applications will require. many thousands of megabits of data per second, and the Java programming model makes reaching such processing speeds extremely difficult.  

Therefore, a new network-optimized computer architecture and a new programming model are required, they believe, to overcome the problems of sharing data and applications among the various members of a network without imposing added computational burdens. This new computer architecture and programming model also should overcome the security problems inherent in sharing applications and data among the members of a network.  

“Software cells” turn Java Upside Down 

At the core of the new connected computing architecture the companies have developed is a new “software cell”-based programming model for transmitting data and applications over a network and among the network's members. In one sense, it turns the Java model on its head.  

While it can operate in the Java mode, which downloads a platform independent to run on a node it can also be described as a "write once, reside anywhere and participate everywhere" programming model.

Another way to look at the software cell model is as the Web Services paradigm writ small, in that an application does not have to depend only on the resources resident on the hardware where it resides but can incorporate services from external resources to accomplish its task.

It also differs from the traditional approach in that it combines application and data in the same deliverable "software cell," or apulet, designed for transmission over the network for processing by any processor on the network.  

The code for the applications preferably is based upon the same common instruction set and ISA. Each software cell preferably contains a global identification (global ID) and information describing the amount of computing resources required for the cell's processing.

Since all computing resources have the same basic structure and employ the same ISA, the particular resource performing this processing can be located anywhere on the network and dynamically assigned.  

Identical, scalable hardware resources 

To make the “software cell” approach, however requires the use of a modular hardware architecture from which all members of the network (Figure One, above) -- clients, servers, PCs, mobile computers, game machines, PDAs, set top boxes, appliances, digital televisions -- can be constructed.

This common computing module requires a consistent structure and preferably the same ISA.  In this approach, the only difference between a PDA or mobile phone and a server  is the number of resources available locally in the hardware module for execution of the software cell.

And even though the resources are not available locally, that does not mean that a PDA could not execute the application. Since the hardware modules and software cells are identical in structure, if the network bandwidth and the application requirements locally allowed it, a software cell could be executed remotely and delivered locally to provide the functionality the PDA requires.  

The consistent modular structure, the developers point out, also enables efficient, high speed processing of applications and data by the network's members and the rapid transmission of applications and data over the network. It also simplifies the building of members of the network of various sizes and processing power and the preparation of applications for processing by these members.  



The basic processing module (Figure 2, above) includes a processor element (PE), which consists of a processing unit (PU); a direct memory access controller (DMAC); and a number of  attached processing units (APUs). In the case of the hardware implementation to be descibed at the ISSCC, each PE consists of an IBM Power CPU core, and the APUs are dedicated stream processors.  

Typically a single PE would consist of one PU and up to eight APUs which interact with a shared dynamic random access memory (DRAM) using a cross-bar architecture. The PU schedules and orchestrates the processing of data and applications by the APUS. The APUs perform this processing in a parallel and independent manner. The DMAC controls accesses by the PU and the APUs to the data and applications stored in the shared DRAM.  

The number of  PEs used in any particular network connected appliance device depends on the processing power required locally. A server may use four PEs, while a workstation may employ two PEs and a PDA may require only one PE. The number of APUs of a PE assigned to processing a particular software cell depends upon the complexity and magnitude of the programs and data within the cell (Figure 3, below).


Server Cell Processor Configuration


 

New hardware building blocks 

To make this architecture work, radical new approaches have had to be developed for almost every aspect of a computer system: DRAM, DMAC, synchronization, bus and I/O architecture, security, remote procedure command sequencing, and timing.  

Currently the companies have applied for and/or been granted nine patents covering almost every aspect of the hardware design, details of which will be described in more depth in February at the ISSCC. 

Typically, however, the shared DRAM is configured into sixty-four memory banks, each of which has one megabyte of storage capacity. Each section of the DRAM is controlled by a bank controller, and each DMAC has equal access to each bank controller. What this allows, the developers say, is access by the DMAC to any portion of the shared DRAM.  

The synchronization system developed by the companies to allow an APU to read data from, and the write data to the shared DRAM, is designed to avoid conflicts among the multiple APUs and multiple PEs sharing the DRAM.

This is done by setting aside an area of DRAM for storing full-empty bits, each of which corresponds to a designated area of the DRAM. Because it is integrated into the DRAM, the synchronized system avoids the computational overhead of a data synchronization scheme implemented in software.  

Cell has on-chip security "sandboxes"

To deal with security issues, “sandboxes” are incorporated into the DRAM to protect against the corruption of data for a program being processed by one APU from data for a program being processed by another APU. Each sandbox defines an area of the shared DRAM beyond which a particular APU, or set of APUs, cannot read or write data.  

The new hardware module architecture also handles remote procedure calls in a different way. They are issued by a main PU to the APUs to initiate processing of applications and data. These commands, called APU remote procedure calls (ARPCs), enable the PUs to orchestrate and coordinate the APUs' parallel processing of applications and data without the APUs performing the role of co-processors.  

Considerable new work has gone into the development of a dedicated pipeline structure for the processing of streaming data. With this structure, a coordinated group of APUs, and a coordinated group of memory sandboxes associated with these APUs, are established by a PU for the processing of data. The pipeline's dedicated APUs and memory sandboxes remain dedicated to the pipeline during periods that the processing of data does not occur and are  placed in a reserved state during these periods.  

Timing is of the essence in this new approach to connected computing. So the companies have developed -- and patented -- a new absolute timer design that is  independent of the frequency of the clocks employed by the APUs for the processing of applications and data.  

Applications are written based upon the time period for tasks defined by the absolute timer. If the frequency of the APU clocks increases because of enhancements to the APUS, the time period for a given task as defined by the absolute timer remains the same.

What this scheme allows, said the developers, is the use of enhanced processing timers by newer versions of the APUs without disabling these newer APUs from processing older applications written for the slower processing times of older APUs.  

The new architecture also required the development of an alternate scheme for allowing newer, faster APUs to process older applications written for the slower processing speeds of older APUS.

The approach the developers of the architecture have taken is to analyze, in real time, the particular instructions or microcode employed by the APUs in processing these older applications for problems in the coordination of the APUs' parallel processing created by the enhanced speeds.

"No operation" ("NOOP") instructions are then inserted into the instructions executed by some of these APUs to maintain the sequential completion of processing by the APUs expected by the program. By inserting these NOOPs into these instructions, the developers point out, the correct timing for the APUs' execution of all instructions is maintained.  

Moving ahead with Cell

In addition to Sony's PlayStation and work station plans for Cell, IBM plans to begin pilot production of Cell-based microprocessors circuits during the first half of next year, and Toshiba next year is planning to launch a Cell-based high-definition TV.

While the abstracts do not go into too much detail on throughput, the performance of the streaming-processor/ SRAM block has been estimated at about 4.8 GHz while a four Power CPU-element Cell module would have a performance of about one teraflop.

The companies will be presenting five papers at the ISSCC. Focusing on key concepts of Cell architecture is "The Design and Implementation of a First-Generation Cell Processor" (session 10.2). Other papers are "A Streaming Processing Unit for a Cell Processor" (session 7.4) and "A 4.8GHz Fully Pipelined Embedded SRAM in the Streaming Processor of a Cell Processor" (session 26.7).

Two additional papers on the CELL design include  "A  Double-Precision Multiplier with Fine-Grained Clock-Gating Support  for a First-Generation Cell Processor" (session 20.3) by IBM,  and "Clocking and Circuit Design for a Parallel I/O on a  First- Generation Cell Processor" (session 28.9) by Rambus Inc and Stanford University.

For more information about topics, issues and technologies mentioned in this story go to the flashing icon in the upper left corner on this page or go to the iAppliance Web Views page and call up the associatively-linked Java/XML-based Web map of the iApplianceWeb site.

Enter the appropriate key word, product or company name to list instantly every news and product story, product review and product database entry relating to the topic since the beginning of the 2002. 

 



Copyright © 2004 Appliance-Lab
Terms and Conditions
Privacy Statement