Sony Playstation 3
|
|
Bookmark Sony Playstation 3 |
Sony PlayStation 3 Slim Game console - Charcoal blackHDD 120 GB
PlayStation 3 delivers an experience beyond anything you know today. With a built in Blu-ray Disc drive, PlayStation 3 invites you to a whole new generation in high-definition graphics and media capabilities. Whether it's high-definition gaming, Blu-ray movies, music or online services, PlayStation 3 takes you where you've never dreamed possible - a place where you can play beyond. [ Report abuse or wrong photo | Share your Sony Playstation 3 photo ]
Manual
Preview of first few manual pages (at low quality). Check before download. Click to enlarge.
Download
(English)Sony Playstation 3 - Quick Reference, size: 5.9 MB |
Sony Playstation 3
User reviews and opinions
| pentium-M |
7:04pm on Monday, September 27th, 2010 ![]() |
| Sleek, quiet and packed with multimedia potential. Expensive, online service lacking,,. Awesome Graphics!! NONE YET!! great product uses cost | |
| lrsnider |
7:01pm on Wednesday, July 7th, 2010 ![]() |
| Owned it for about 1.5 months so far. Used it mostly for movies and Guitar Hero World Tour, more games later. Can play Blu-rays and play games. Great entertainment center. Fun games ; built in Wifi and No monthly fee to play online(as opposed to xbox) ; Blueray! Need I say more No problems. This is our second PS3 and it lacks some of the features of our first one. Glad we have one of the older ones, too. Plays blu-ray disks. | |
| glock |
7:56am on Friday, July 2nd, 2010 ![]() |
| "The PS3 is amazing i loved it since i first laid eyes on it. "PS3 blows xbox360 alway. It does it all=blueray movies, wifi, surf the web, free PSN, download music, movies, and games. "I just got the PS3 for christmas 2009 and I absolutely love it. I wish I got the PS3 first or else I would have never wasted my money on a 360. | |
| elihusmails |
4:02pm on Friday, June 18th, 2010 ![]() |
| "Bought the PS3 yesterday and set up was easy. Played the combo pack GOD OF WAR 1 and 2 game. The next day i turn it on, the PS3 starts the update... | |
| pb00145 |
9:26am on Tuesday, April 13th, 2010 ![]() |
| "This is the ultimate gaming console. It completely understands and utilizes the ultimate HD experience, Blu-Ray. "i was disappointed that sony stopped making the ps3 backwards compatible my friend bought the early 80 gig and gets to play his ps2 games but i got s... "the ps3 is a great gaming system. i have the 160gb ps3 system. i wish they should of keep the 4 usb feature and the ps1 and ps2 feature. | |
| MichaelAstor |
2:45am on Sunday, March 14th, 2010 ![]() |
| Sony is fanatical about controlling how you use the system. Since my old system died. I bought a NYKO BluWave remote for the IR dongle to use with my Onkyo TX-SR608 remote for BluRay. Buy a computer and get a middle of the road graphics card and a 21" 1080p computer monitor and you will run circles around a console. | |
Comments posted on www.ps2netdrivers.net are solely the views and opinions of the people posting them and do not necessarily reflect the views or opinions of us.
Documents

Sony PLAYSTATION 3 and PlayStation Portable (PSP)
1. n the main menu, using the left stick or directional pad, go to Settings, I then select Security Settings by pressing the X button. Options for restricting games are listed under Parental Control. A number system indicates the relative level of restriction: the lower the number, the tighter the restrictions. 2. Each number below corresponds with an ESRB rating category: 2 EC (Early Childhood 3+) 3 E (Everyone 6+) 4 E10+ (Everyone 10+) 5 T (Teen 13+) 9 M (Mature 17+)
10 AO (Adults Only 18+)
3. o set parental controls for the Web browser, in Security Settings, T select Internet Browser Start Control. Your options are On or Off. Selecting On will block access to the Internet. 4. he PLAYSTATION 3 and PSP parental T controls are enforced by a four-digit password. The default password is 0000 (four zeros). It is recommended that you reset the password. In the Security Settings menu, select Change Password. Enter the default password, and then select a new password. You can also use Parental Control to: lock access to DVD and Blu-ray B (high-definition) movies by MPAA rating Tips about PLAYSTATION Network: he default settings block content based on registered T user age and restrict chat with other players
sure to set up sub accounts for each child
For more on PLAYSTATION 3, PSP and PLAYSTATION Network, visit: www.us.playstation.com/support

A Rough Guide to Scientic Computing On the PlayStation 3
Technical Report UT-CS-07-595 Version 1.0
by Alfredo Buttari Piotr Luszczek Jakub Kurzak Jack Dongarra George Bosilca Innovative Computing Laboratory University of Tennessee Knoxville May 11, 2007
Contents
Introduction
Hardware
2.1 CELL Processor. 2.1.1 POWER Processing Element (PPE). 2.1.2 Synergistic Processing Element (SPE). 2.1.3 Element Interconnection Bus (EIB). 2.1.4 Memory System.
2.2 PlayStation 3. 2.2.1 Network Card. 2.2.2 Graphics Card.
2.3 GigaBit Ethernet Switch. 2.4 Power Consumption.
Software
3.1 Virtualization Layer: Game OS. 3.2 Linux Kernel. 3.3 Compilers.
CONTENTS
3.4 TCP/IP Stack. 3.5 MPI.
Cluster Setup
4.1 Basic Linux Installation. 4.2 Linux Kernel Recompilation.
4.3 IBM CELL SDK Installation. 4.4 Network Conguration. 4.5 MPI Installation. 4.5.1 MPICH1. 4.5.2 MPICH2. 4.5.3 Open MPI.
Development Environment
5.1 CELL Processor. 5.2 PlayStation 3 Cluster.
Programming Techniques
6.1 CELL Processor. 6.1.1 Short Vector SIMDization.
6.1.2 Intra-Chip Communication. 6.1.3 Basic Steps of CELL Code Development. 6.1.4 Quick Tips.
6.2 PlayStation 3 Cluster.
Programming Models
7.1 CorePy. 7.2 Octopiler. 7.3 RapidMind. 7.4 PeakStream.
7.5 MPI Microtask. 7.6 Cell Superscalar. 7.7 The Sequoia Language.
7.8 Mercury Multi-Core Framework
7.9 IBM Accelerated Library Framework.
UT Knoxville
Application Examples
8.1 CELL Processor. 8.1.1 Dense Linear Algebra. 8.1.2 Sparse Linear Algebra. 8.1.3 Fast Fourier Transform. 8.2 PlayStation 3 Cluster. 8.2.1 The SUMMA Algorithm. 8.3 Distributed Computing. 8.3.1 Folding@Home.
Summary
9.1 Limitations of the PS 3 for Scientic Computing. 9.2 CELL Processor Resources. 9.3 Future.
Acronyms
Acknowledgements
We would like to thank Gary Rancourt and Kirk Jordan at IBM for taking care of our hardware needs and arranging for nancial support. We are thankful to numerous IBM researchers for generously sharing with us their CELL expertise, in particular Sidney Manning, Daniel Brokenshire, Mike Kistler, Gordon Fossum, Thomas Chen, Jason Dale and Michael Perrone. Our thanks also go to Robert Cooper and John Brickman at Mercury Computer Systems for providing access to their hardware and software. We are also thankful to the Mercury research crew for sharing their CELL experience, in particular John Greene, Michael Pepe and Luke Cico. In particular we are grateful to the following people for devoting their time to a carefully review the work and help us improve it: Robert Cooper, Sidney Manning, Jason Dale and Joseph Czechowski (GE Research). We thank Chris Mueller from Indiana University for contributing section 7.1 about synthetic programming on the CELL in Python using CorePy. Thanks to Adelajda Zareba for the photography artwork for this guide.
CHAPTER 1
As much as the Sony PlayStation 3 (PS3) has a range of interesting features, its heart, the CELL processor is what the fuss is all about. CELL, a shorthand for CELL Broadband Engine Architecture, also abbreviated as CELL BE Architecture or CBEA, is a microprocessor jointly developed by the alliance of Sony, Toshiba and IBM, known as STI. The work started in 2000 at the STI Design Center in Austin, Texas, and for more than four years involved around 400 engineers and consumed close to half a billion dollars. The initial goal was to outperform desktop systems, available at the time of completion of the design, by an order of magnitude, through a dramatic increase in performance per chip area and per unit of power consumption. A quantum leap in performance would be achieved by abandoning the obsolete architectural model where performance relied on mechanisms like cache hierarchies and speculative execution, which those days brought diminishing returns in performance gains. Instead, the new architecture would rely on a heterogeneous multi-core design, with highly efcient data processors being at the heart. Their architecture would be stripped of costly and inefcient features like address translation, instruction reordering, register renaming and branch prediction. Instead they would be given powerful short vector SIMD capabilities and a massive register le. Cache hierarchies would be replaced by small and fast local memories and powerful DMA engines. This design approach resulted in a 200 million transistors chip, which today delivers performance barely approachable by its billion transistor counterparts and is available to the broad computing community in a truly off-the-shelf manner via a $600 gaming console. 1
CHAPTER 1. INTRODUCTION
As exciting as it may sound, using the PS3 for scientic computing is a bumpy ride. Parallel programming models for multi-core processors are in their infancy, and standardized APIs are not even on the horizon. As a result, presently, only hand-written code fully exploits the hardware capabilities of the CELL processor. Ultimately, the suitability of the PS3 platform for scientic computing is most heavily impaired by the devastating disproportion between the processing power of the processor and the crippling slowness of the interconnect, explained in detail in section 9.1. Nevertheless, the CELL processor is a revolutionary chip, delivering ground-breaking performance and now available in an affordable package. We hope that this rough guide will make the ride slightly less bumpy.
CHAPTER 2
CELL Processor
Figure 2.1 shows the overall structure of the CELL processor. In the following sections we briey summarize the main features of its most important element: the Power Processing Element (PPE), the Synergistic Processing Elements (SPEs), the Element Interconnection Bus (EIB) and the memory system.
POWER Processing Element (PPE)
The Power Processing Element (PPE) is a representative of the POWER Architecture, which includes the heavy-iron high-end POWER line of processors and as well as the family of PowerPC desktop processors. The PPE consists of the Power Processing Unit (PPU) and a unied (instruction and data) 512 KB 8-way set associative write-back cache. The PPU includes a 32 KB 2-way set associative reload-on-error instruction cache and a 32 KB 4-way set associative write-through data cache. L1 caches are parity protected, the L2 cache is protected with error-correction code (ECC). Cache line size is 128 bytes for all caches. Beside the standard oating point unit (FPU) The PPU also includes a short vector SIMD engine, VMX, an incarnation of the PowerPC Velocity Engine or AltiVec. The PPEs register ne is comprised of 32 64-bit general purpose registers, 32 64-bit oating-point registers and 32 128-bit vector registers.
CHAPTER 2. HARDWARE
2.1. CELL PROCESSOR
Figure 2.1: CELL Broadband Engine architecture [1].
The PPE is a 64-bit, 2-way simultaneous multithreading (SMT) processor binary compliant with the PowerPC 970 architecture. Although it uses the PowerPC 970 instruction set, its design is substantially different. It has a relatively simple architecture with in-order execution, which results in considerably smaller amount of circuitry than its out-of-order execution counterparts and lower energy consumption. This can potentially translate to lower performance, especially for applications heavy in branches. However, the high clock rate, high memory bandwidth and dual threading capabilities may make up for the potential performance deciencies. Especially important is the SMP feature, which to an extent corresponds to Intels HyperThreading technology. The PPE seems to provide two independent execution units to the software layer. In practice the execution resources are shared, but each thread has its own copy of the architectural state, such as general-purpose registers. The technology comes at a 5% increase in the cost of the hardware and can potentially deliver from 10% to 30% increase in performance [2].
Figure 2.2: Synergistic Processing Element architecture [1].
6 25.6 = 153.6 Gop/s for the PlayStation 3. It has to be pointed out, however, that in single precision the SPE only implements truncation rounding, ushes denormal numbers to zero, and handles NaNs as normal numbers. Unfortunately for scientic computing, equal emphasis was not put on double precision performance. Unlike the PPEs VMV, the SPEs support double precision arithmetic, but the double precision instructions are not fully pipelined. In particular the FMA operation has a latency of seven cycles. As a result, the double precision peak of a single SPE equals 3.2/7 = 1.8 Gop/s what adds up to the peak of almost 11 Gop/s for the PlayStation 3, which is still not bad comparing to other common processors. SPEs have two pipelines with one being devoted to arithmetic and the other being devoted to data motion. Both issue instructions in-order and, if certain rules are followed, two instructions can be issued in each clock cycle, one to each pipeline.
Element Interconnection Bus (EIB)
All components of the CELL processors including the PPE, the SPEs, the main memory and I/O are interconnected with the Element Interconnection Bus (EIB). The EIB is build of four unidirectional rings, two in each direction and a token based arbitration mechanism playing the role of trafc lights. Channel width and clock rates aside, each participant is hooked up to the bus with the bandwidth of 25.6 GB/s, and for all practical purposes you can assume that the bus cannot be saturated - you will not run out of internal bandwidth for any realistic workload.
2.2. PLAYSTATION 3
Memory System
The memory system is built of dual-channel Rambus Extreme Data Rate (XDR) memory. The PlayStation 3 provides a modest amount of memory of 256 MB, out of which approximately 200 MB is accessible to Linux OS and applications. The memory is organized in 16 banks. Real addresses are interleaved across the 16 banks on a naturally aligned 128-byte (cache line) basis. Addresses 2 KB apart generate accesses to the same bank. For all practical purposes the memory can provide the bandwidth of 25.6 GB/s to the SPEs through the EIB, provided that accesses are distributed evenly across all the 16 banks.
PlayStation 3
Network Card
The PlayStation 3 has a built-in GigaBit Ethernet network card. However, unlike standard PCs Ethernet controllers, it is not attached to the PCI bus. It is directly connected to a companion chip. Linux as a guest OS has to use dedicated hypervisor call to access or setup the chip. This is done by a Linux driver called gelic net. The network card has a dedicated DMA unit, which allows to make data transfer without PPEs intervention. To help with this, there is a dedicated hypervisor call to set up a DMAC. One of many advantages of GigaBit Ethernet is possibility of increased frame size so called Jumbo Frames. Many standard-compliant equipment pieces allow you to increase frame size from 1500 to 9000. It can increase available bandwidth by 20% in some case and signicantly decreases processor load when handling network trafc. At this point, the PS3s built-in GigaBit NIC is limited by the kernel driver the frame size can not be larger than 2308 bytes (see le
H Recompile the kernel with support for huge TLB pages: K Take the kernel source from the Fedora kernel source CD (linux-20061110.tar.bz2). K Unpack the archive to the directory /usr/src. K Create the symbolic link to the directory containing the kernel source: $ ls -n /usr/src/linux-20061110 /usr/src/linux K Copy the kernel cong le that comes with the Fedora installation to the directory
/usr/src/linux:
$ cp /boot/config-2.6.16 /usr/src/linux/.config K Prepare for kernel conguration: $ make mrproper $ make oldconfig
In this step, the old conguration le is analyzed and you are prompted whenever an option is encountered, which is not present in the old kernel. In this case the old and the new kernel are exactly the same, and no prompts should appear.
K Enable huge TLB pages in the kernel conguration: $ make menuconfig
Go to File systems Pseudo lesystems and enable huge TLB pages by pressing the space bar on the HugeTLB le system support option. Select exit repeatedly and answer yes when asked to save the new kernel conguration.
K Compile the kernel and the modules, and install the modules (It will take around 20 minutes):
$ make all $ make modules install H Install the new kernel: $ cp /usr/src/linux/vmlinux /boot/vmlinux-2.6.16_HTLB H Create a ramdisk image for the new kernel: $ mkinitrd /boot/initrd-2.6.16_HTLB.img 2.6.16 H Tell the bootloader (kboot) where the new kernel is located: $ vim /etc/kboot.conf
Add the following line:
linux_htlb=/boot/vmlinux-2.6.16_HTLB initrd=/boot/initrd-2.6.16_HTLB.img
If you want this kernel to be loaded by default, change the default line into:
default=linux_htlb H Instrument the boot process to include huge TLB pages allocation: $ vim /etc/rc.local
Add the following lines:
mkdir -p /huge echo 20 > /proc/sys/vm/nr_hugepages mount -t hugetlbfs nodev /huge chown root:root /huge chmod 755 /huge
Be sure to change the chown line according to your system settings.
H Reboot. During the boot process, when presented the kboot: prompt you will be able to
choose your kernel using the tab key.
4.3. IBM CELL SDK INSTALLATION
H All the commands added to the rc.local le are executed at the end of the boot sequence. This
means that the allocation of the huge TLB pages is performed when plenty of system memory has already been allocated to other processes. This results in allocation of only six or seven huge pages. In order to obtain a few more huge pages (eight or nine), we have to move the huge TLB pages allocation to an earlier stage in the boot sequence (i.e. to runlevel-1). In order to do that, create the /etc/init.d/htlb script with the content shown in Figure 4.1 and add the service to runlevel-1:
memory transfers, in practice they are not difcult to implement at all. Given that the CELL processor implements a global addressing scheme, in which each local store can be accessed by its effective address, local store to local store communication can be implemented as follows:
H The PPE retrieves the effective address of each local store by calling the spe get ls() function. H The PPE passes the list of addresses of all local stores to all SPEs, through a DMA transfer. H On the SPE side, a communication buffer is declared as a global variable and as a result has
the same physical addresses within the local store on all SPEs.
H An SPE sums the physical buffer address with the effective address of the local store of another
SPE to get the address of the remote buffer. It uses this address as the source address to pull data from the other SPE, or as a destination address to push data to the other SPE. Local store to local store communication may prove invaluable not only for bulk data transfers, but also for synchronization between SPEs. One thing to remember here is the subvector alignment of source and destination for subvector length transfers. Mailboxes are a convenient mechanism for sending short, 32-bit messages from the PPE to the SPEs and between the SPEs. The mailboxes are First-In-First-Out (FIFO) queues, meaning the messages are processed in the order of their issue. Each SPE has a four-entry mailbox for receiving incoming messages from the PPE and other SPEs, and two one-entry mailboxes for sending outgoing messages to the PPE and other SPEs - one of which serves the purpose of raising an interrupt on the receiving device. Mailbox operations have blocking nature on the SPE. An attempt to write to a full outbound mailbox will stall until the mailbox is cleared by a PPE read. Similarly, an attempt to read from an empty inbound mailbox will stall until the PPE writes to the mailbox. The same does not apply to the PPE. Neither an attempt to write to a full mailbox nor an attempt to read an empty mailbox will stall the PPE. Mailboxes are useful to communicate short messages, such as completion ags or progress status. They can also serve the purpose of communicating short data, such as storage addresses and function parameters. The blocking nature of the mailboxes on the SPE side makes them perfect for the PPE to initiate actions on the SPEs. However for two reasons they should not be used by the SPEs to acknowledge completion of operations to the PPE. DMA completion has a local meaning on the SPE. In other words, completion of a DMA on the SPE means that the local buffers are available for reuse, but not that the data made it to the memory. If a DMA transfer is immediately followed by an acknowledgment mailbox message, the message can make it to the PPE before the data. Also, the PPE continuously reading the SPEs outbound mailbox will ood the bus causing loss of bandwidth. A better way of acknowledging completion of an operation or a data transfer from an SPE to the PPE is to use an acknowledgment DMA protected by a fence with respect to the data transfer DMA. The PPE can periodically test the memory location (variable) written to by the SPE, or even spin (busy wait) on the UT Knoxville 38 ICL
H Perform all your initializations and setup on the PPE. Prepare your data for processing on
the SPE in a continuous memory buffer. There is a caveat to this approach, which is that initializations performed by the PPE will move the data to the L2 cache, which slows down the access by the SPEs. Ignore this fact when learning the ropes of CELL programming.
H Write a simple DMA. Implement a copy operation. Make the SPE read a chunk of data from
one place in main memory to another. Verify the operation for correctness on the PPE.
H Introduce a processing stage on the SPE between the read and write operation. Do not vectorize yet, but use standard, scalar, C code. It will compile and run out of the box. Copy-paste the same code in the PPE correctness check code. Do not expect bit-wise correctness. SPEs do not implement IEEE compliant oating point arithmetic. In most cases, expect errors on the order of the precision used (machine epsilon).
H Start vectorizing the SPE code. Keep checking for correctness against the scalar code on the
PPE. Learn the process gradually. Introduce more and more vectorized code, while keeping some scalar code.
H Start measuring your execution time on the SPE by using the decrementer. Start looking at the
output of the spu-timing tool. Your ultimate goal is to eliminate loop overhead via unrolling and maximizing the dual issue rate by mixing arithmetic and data manipulation instructions.
H Try to introduce double-buffering to hide the communication. It is analogous to the similar
technique in computer graphics, where the contents of one frame is being displayed, while another is being produced in the memory. Fetch the data for the upcoming loop iteration N + 1 while processing the data in the current loop iteration N.
H Devise a data or work partitioning strategy, and distribute your task to many SPEs. At the
beginning you can use the PPE to synchronize the SPEs. Later on you can try to make the SPEs synchronize between each other. The main synchronization mechanisms are mailboxes, DMAs and signals. One straightforward method of synchronization is a DMA transfer of a single variable, which can be polled by the PPE or the SPE in order to determine if an operation completed. Remember to declare such synchronization variables as volatile to avoid the compiler optimizing them out.
Figure 6.4: Double buffering [2].
Quick Tips
In this section we give a few more programming tips that you may nd useful:
H Tweak performance using static analysis and the decrementer. Traditionally, programmers
rely on performance counters to identify performance problems. The values typically measured include cache misses (L1, L2,.), TLB misses, and the number of oating point operations. From the SPE standpoint, there is little use in trying to measure these values. Cache misses are nonexistent, TLB misses can be eliminated by using huge TLB pages, and since the SPEs pipelines implement in-order execution, code behavior is precisely dened by the object code. By the same token, the performance of the SPE code operating on the local store can be analyzed by looking at the object/assembly code. The spu-timing comes in handy here. If you desire to measure the actual execution time, or measure time of operations that exhibit variable performance, like DMA data transfers, the indispensable tool is the SPE decrementer, a hardware register that ticks with a xed frequency, which can be read and written by the user.
H Use Unholy Practices. Forget what they taught you in freshmen programming courses, and
throw political correctness out the window when getting up to speed on the CELL. You want to write code that is well structured, readable, and maintainable, but do yourself a favor and free yourself from the burden of thinking about portability at this time. Portability is a feature of well established programming models dened by standardized APIs, which are in their infancy on
6.2. PLAYSTATION 3 CLUSTER
the CELL processor. In particular:
K Use explicit type casts. Save yourself the effort of creating unions to operate on array
elements in a vector as well as scalar way. Explicitly cast vector and scalar types to each other. Do not be intimidated by casting addresses (pointers) in 32-bit mode to unsigned integers. They are both just numbers. That is why you can pass them around in mailboxes. The CELL architecture keeps you conscious about type sizes anyway by enforcing alignment and size restrictions on DMA communication.
K Use pointer arithmetic. Pointer arithmetic is a common practice when programming for
performance and indispensable with aggressive loop unrolling. Pay attention to the type when incrementing a pointer. Incrementing a vector pointer by one moves it forward by a number of scalar elements (e.g., four oats).
Distributed memory programming on a cluster of CELL/PS3 is not conceptually different than programming for any other cluster. MPI is always the method of choice for the SPMD programming paradigm. The most important difference comes from the fact that, locally, computations are ofoaded to the SPEs. In order to take advantage of this situation, it is worth making a distinction between two cases: 1. The PPE is absolutely not involved in local computations: this is usually the case where elementary operations are submitted to the SPEs. 2. The PPE is involved in local computations: this is the case where, for example, complex local computations have to be performed, and the PPE has to schedule the elementary tasks that compose the operation. In both cases, it is possible to overlap MPI communications with local computations but different approaches have to be used. Let us start with the easier case where PPE is not involved. In this case the PPE is idle while the SPEs carry on the local computations and thus can take care of performing the MPI communications. Take as an example the simple code in Figure 6.6. This code contains a simple loop where some data is copied inside a buffer by means of the copy_into_buf. Then a communication is performed (the MPI_comm routine denotes a general MPI communication routine), once the data has been sent/received, the local computations are started on the SPEs with the start_SPEs routine; the last step of the loop consists of waiting until the SPE computations are completed in the
H Streaming model is one where the SPEs are arranged in a pipeline, where each of them
applies a particular computational kernel to the data that passes through it. The model is very attractive due to the fact that the internal bandwidth greatly exceeds the main memory bandwidth. Load balancing may become an issues if the pipeline stages do not have a near equal amount of work.
CHAPTER 7. PROGRAMMING MODELS
7.1. COREPY
H Shared memory multiprocessor model can be utilized thanks to the DMA cache coherency
capabilities. A conventional shared memory store is replaced by a combination of a store to the local store and a DMA to shared memory with the PPE and all SPEs assigned to the same address space. Atomic update primitives can be used by utilizing the DMA lock line commands.
H Asymmetric thread runtime model is extremely exible and widespread on conventional
SMPs. On the SPEs, however, it would be very costly to implement full preemptive task switching, and some other model would have to be implemented, e.g., FIFO run-to-completion. Aside from this taxonomy, it is important to notice that the advent of a multi-core processor brings different communities together. In particular, the CELL processor seems to ignite similar enthusiasm in both the HPC/scientic community, the DSP/embedded community, and the GPU/graphics community. By the same token, the world of programming techniques proposed for the CELL is as diverse as the involved communities and includes shared-memory models, distributed memory models, and stream processing models, to name the most prominent ones. In this chapter, we present a brief overview of a few emerging frameworks for programming the CELL processor.
7.1 CorePy
CorePy is a research project at Indiana University, freely available for evaluation (http://www.
corepy.org). CorePy is library for rapid application development on the CELL processor that lets
developers create SPU and PPU programs using the Python programming language. At its heart, CorePy is a complete replacement for assembly-level programming on the CELL. It provides an API that includes Python functions for every PowerPC, VMX, and SPU instruction. These functions can be used to build highly optimized sub-programs, called synthetic programs, at run time. Once created, synthetic programs can be executed directly from Python on an SPU or PPU, synchronously or asynchronously. By combining very low-level code with a high-productivity language, CorePy enables new approaches to developing high-performance applications. In addition to the instruction-level APIs, CorePy includes libraries of components that abstract common operations. The Variable library provides objects for common data types with semantics similar to C data types. Instead of writing out the SPU instructions to add two vector registers, the Variable library lets developers use Python expressions to generate the instructions as the expression is evaluated. In the same vein, the Iterator library is a collection of Python iterators that generate optimized loops using Python syntax. The iterator library contains iterators for simple array iteration (scalar and vector), double-buffering between main memory and SPU local store, loop unrolling, and automatic block decomposition for executing loops across multiple SPUs. Synthetic programs can interact with any data available to the Python interpreter, making it possible to use synthetic programs in conjunction with other Python libraries, such as NumPy, for high-
7.3. RAPIDMIND
the length of the critical path, and the technique of bundling is utilized to satisfy constraints to enable dual issue. Special attention is devoted to specic architectural constraints of the SPEs. One example is the prevention of the instruction fetch starvation due to the fact that a single local memory port is shared between memory instructions and instruction fetch. Another example can be efcient branch hinting with constraints on the distance between the hint and the branch. The compiler relies on the existing infrastructure of the IBM XL compiler and includes a high-level optimizer called the Toronto Portable Optimizer (TPO), which applies both intraprocedural optimizations and interprocedural optimizations. Automatic parallelization is based on the OpenMP programming model, which provides the programmer with the abstraction of single shared-memory address space. OpenMP directives can be used to identify parallel code regions, and the compiler takes care of producing code sections for the PPE and the SPEs. The PPE executes the OpenMP master thread, which uses the runtime library to distribute the work to the SPEs. The SPEs use DMAs to fetch tasks from a queue. The compiler-controlled software cache mechanism is utilized to improve performance of the single shared-memory abstraction. Automatic code partitioning and the mechanism of overlays are employed collaboratively by the compiler and the runtime environment to reduce the impact of local store limitations.
RapidMind
RapidMind [11] is a commercial product (http://www.rapidmind.net/), which fully supports the CELL processor, with clients that actually use it in shipping products, e.g., RTT RealTrace (http://www.rtt.ag). The RapidMind development environment provides extensions to common programming languages like C and C++ for the development of high performance applications that can be run on multi-core processors like the CELL Broadband Engine, GPUs, etc. The source code is translated by means of a Just-in-Time compiler into a parallel program that is capable of exploiting the multiple execution units present on the target host. RapidMind provides a software development platform that allows the developer to use standard C++ programming to create high performance and massively parallel applications that run on GPUs, CELL processors and other multi-core CPUs. The RapidMind platform acts as an embedded programming language inside C++. It is built around a small set of types that can be used to capture and specify arbitrary computations. Arbitrary functions, including control ow, can be specied dynamically. Parallel execution is primarily invoked by applying these functions to arrays, which generates new arrays. Access patterns on arrays allow data to be collected and redistributed. Collective operations, such as scatter, gather, and programmable reduction, support other standard parallel communication patterns and complete the programming model. The RapidMind platform includes an extensive runtime component as well as interface and dynamic compilation components. The runtime component automates common tasks such as task
7.4. PEAKSTREAM
queuing, data streaming, data transfer, synchronization, and load-balancing. It asynchronously manages tasks executing on remote processors and manages data transfers to and from distributed memory. This runtime component provides a framework for efcient parallel execution of the computation specied by the main program. The way program objects are built is similar to the way OpenGL display lists are built, with the important addition that there is also a mechanism for binding input and output parameters. Of course, in this case program objects store computations, not geometry. Although you can capture and freeze any control ow used when the program is built, which is handy for explicitly unrolling loops and compiling out overhead, RapidMind also does support dynamic control ow on the target with FOR/IF/WHILE etc. keywords.
PeakStream
[12] is a comprehensive application development platform (http://www.
peakstreaminc.com/) designed to maximize the oating point power of multi-core processors, such as the x86 processor family and GPUs. Although the CELL processor is not currently supported, the model is quite applicable to programming the CELL. PeakStream consists of four major components: the PeakStream APIs, the PeakStream VM, the PeakStream Proler and the PeakStream Debugger. The PeakStream Platform supports use of the C and C++ languages for application development. Language bindings to platform operations are provided by a set of header les and shared libraries. The application is coded to use the PeakStream APIs and linked against the PeakStream Virtual Machine (VM) libraries. The libraries handle all of the interaction details with the processor. When the application uses the PeakStream APIs to perform mathematical operations, e.g., addition or use of math library functions like exp, those API calls are processed by the Virtual Machine. The Virtual Machine creates optimized parallel kernels that are executed on the processor on a Just-in-Time (JIT) basis. The application must use explicit I/O calls (read and write) to move data into and out of the VM. Use of arrays as the fundamental data type in the PeakStream Platform, coupled with dynamic translation of programs, has the effect of decoupling the application programming model from the programming model of the processor being used. The VM has detailed knowledge of the specic processor being used (GPU/CPU). It performs optimizations necessary to make the applications array-based code perform well, and deliver results with adequate precision and accuracy.
Gflop/s
DSGESV
DP peak
Figure 8.1: Mixed-precision solver based on LU factorization [21] (left), mixed-precision solver based on Cholesky factorization [23] (right).
achieve high performance even for BLAS-2 dense operations (though they are relatively much easier to implement). The phenomenon that makes the dense operations run close to the peak oating point performance is the surface-to-volume effect. Typically, in case of a dense problem of size n, n3 operations are performed on n2 amount of data. If only the problem size is large enough, the communication will take less time than the computation, and can be completely hidden. Unfortunately, the surfaceto-volume effect does not apply to sparse operations. Since, for sparse operations, communication is more expensive than computation, a perfectly optimized code will only run, at most, at the speed of the memory bus. In a sparse matrix-vector product operation 2 nnz oating-point operations are performed (where nnz is the number of nonzeroes in the matrix) and nnz + O(n) values are transferred from memory to the processor (the O(n) value comes from the movement of the source and destination vectors). Thus, assuming that the CELL bus speed is 25.6 GB/s and that a single precision oating-point value is four bytes, the upper-bound speed for this operation is: 2 nnz (nnz + O(n)) 4/25.6
per f =
< 12.8G f lop/s
This upper bound means that, at best, it is possible to achieve an efciency of 12.8/204.8 0.06 = 6%. For other standard architectures (like the x86 family), the efciency can
be 30% or more. It is, of course, a higher fraction of a much lower peak. Even though 12.8 Gop/s is a remarkable speed (higher than conventional processors), in practice the maximum possible performance can be much lower because, in general, the O(n) term is not negligible at all. Moreover, it is very difcult to transfer and crunch the data at full speed because it can be very hard to apply the programming rules described in section 6.1. The main problems come
H Vectorization: It is a pretty hard task to vectorize operations on sparse data. Conventional
approaches are the usage of block storage formats (like BCSR), diagonal storage formats (like JAD), or the use of techniques like the segmented scan. In the rst case, there is substantial overhead due to the presence of ll-in elements that have to be introduced when the matrix does not have an intrinsic block structure. The diagonal storage formats and the segmented scan technique rely on the availability of gather operations in the ISA that are not included in the CELL instruction set.
Algorithm 2 SUMMA
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:
for i = 1 to n/nb do if I own Ai then Copy Ai in buf1 Bcast buf1 to my proc row end if if I own Bi then Copy bi in buf2 Bcast buf2 to my proc column end if
C = C + bu f 1 bu f 2
end for
achieves more than 97% of the peak of all six SPEs. While the local _GEMMs for step k of the algorithm 2 are executed on the SPEs, the PPE handles the communication broadcasting the data needed for step k + 1. This yields an almost perfect overlapping of communications and computations, what allows to hide the less expensive of the two phases. Figure 8.3 shows part of the execution log for the SUMMA algorithm on a 2x2 processors grid for problem size n = 6144. The blue blocks show the MPI data exchange steps while the yellow blocks are the local _GEMM operations; arrows show the message ows in the processors grid. From this gure it is possible to see that, thanks to the overlapping technique implemented, the less expensive of the communication and computation phases can be completely hidden. For the problem size represented in Figure 8.3, the local computations cost (i.e. the yellow blocks) can be hidden since communications are almost three times more expensive (see Figure 8.4 (right)). Since the algorithm is very simple, it is possible to easily produce a very accurate performance model under the assumption that, if many decoupled communications are performed at the same time, they do not affect each other performance-wise. Figure 8.4 shows the experimental results obtained running our SUMMA implementation on a PlayStation 3 processor grid. A comparison with the model that we developed is presented in the case where one SPE is used for the local computations (left) and in the case where all six available SPEs are used (right). The cost of the local computations is very small when all six SPEs are used; this means that the surface-to-volume effect only comes into play for very big problem sizes (see Figure 8.4 (right)). Due to the system limitations in memory (only 256 MB available on each node), it is not possible to have matrices bigger than 6144, which is pretty far from the point where linear speedup can be achieved (around 16000). When only one SPE is used on each node, the cost of the local computations is relatively much higher (a factor of six), and thus the surface-to-volume effect is reached at problem sizes within the limits of the system,
of the Barcelona Supercomputer Center http://www.bsc.es/, which hosts both plenty of documentation and also serves as software repository. Finally, the CellPerformance website http:
//www.cellperformance.com/ contains useful software installation guidelines and practical performance tweaking tips.
Future
One of the major shortcomings of the current CELL processor for numerical application is the relatively slow speed of the double precision arithmetic. The next reincarnation of the CELL processor is going to include a fully-pipelined double precision unit, which will deliver the speed of 12.8 Gop/s from a single SPE clocked at 3.2 GHz, and 102.4 Gop/s from an 8-SPE system, what is going to make the chip a very hard competitor in the world of scientic and engineering computing. Although in agony, the Moores Law is still alive and we are entering the era of billion-transistor processors. Given that, the current CELL processor employs a rather modest number of transistors of 234 million. It is not hard to envision a CELL processor with more than one PPE and many more SPEs, perhaps reaching the performance of a TeraFlop/s for a single chip. That is still speculation though.
Bibliography
[1] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. Introduction to the Cell multiprocessor. IBM J. Res. & Dev., 49(4/5):589604, 2005. http://www.research.
ibm.com/journal/rd/494/kahle.pdf.
[2] IBM. Cell Broadband Engine Programming Handbook, Version 1.0, April 2006. [3] IBM. IBM Full-System Simulator Users Guide, Modeling Systems based on the Cell Broadband Engine Processort Version 2.0, November 2006. [4] IBM. Performance Analysis with the IBM Full-System Simulator, Modeling the Performance of the Cell Broadband Engine Processor, Version 2.0, November 2006. [5] IBM. IBM Full-System Simulator Command Reference, Understanding and Applying Commands in the IBM Full-System Simulator Environment, Version 0.01, October 2005. [6] IBM. Software Development Kit 2.0 Installation Guide, Version 2.0, December 2006. [7] A. J. Eichenberger et al. Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture. IBM Sys. J., 45(1):5984, 2006. http://www.research.
ibm.com/journal/sj/451/eichenberger.pdf.
[8] IBM. Cell Broadband Engine Programming Tutorial, Version 2.0, December 2006. [9] IBM. C/C++ Language Extentions for Cell Broadband Engine Architecture, Version 2.2.1,
November 2006.
BIBLIOGRAPHY
[10] IBM. SPU Assembly Language Specication, Version 1.4, October 2006. [11] M. McCool. Data-parallel programming on the Cell BE and the GPU using the RapidMind development platform. In GSPx Multicore Applications Conference, 2006. http://www.rapidmind.
Tags
GK 1640 Tonearm SGH-A177 Nerfoop LG 9100 Recorder XEA203 Excel CDP-991 RT-29FD16RX HT-DB750 7700GX Watch D70 TAM809 C4210 YP-U2R E1260 Deals FAX-8070P SF-6800P Mustang Bass LM-M140 MP-330 GR-DVM5 RX-V2600 KX-TDA 15 Netshare 250gb Adapter KAC-8401 IC-M402S B3031A NN-L564wbepg SPA-120 Crown-victoria-1998 Drive Seiko 4S15 78-9518 ANT24-0801 Cowon A2 KDL-L40HVX Husqvarna 266 DS-55 ER8997B Pilots Digital 40M86BD TX-28CK1C HT-150 JOG-1999 SA-AK57 Deskjet 9800 WMT-555C MC8088HL Lexibook TW30 Impressa F50 Ekeys 37 Games Uno 52 RS21fasm LCD2090UXI Edition40786 Repair Speedtouch 546 4 5 Price Drop CFD-C1000 NN-E252WB BD-P1000 700E-CB777f-aa- Photo Nokia 6234 MS7117C UN46B7000WF 10 4 L204WS-BF HAP 220 42PD7200 R6008HS 648BI Slim 160gb S250X Gothic II AVC-A1XV PS42C450 Magicolor 2210 M2394D S803J GR-642AVP Magicolor 2300 80gb Reception SGH-P520 DMC-FX55 EMP-720 1508 2 Sedan 2001 ZKT623LX CS-SC12BKP HP-70 Coolpix S50C VW-BN1 DP-4510 Problems XM504X Macro MHC-GN1000D Sagem D50T
manuel d'instructions, Guide de l'utilisateur | Manual de instrucciones, Instrucciones de uso | Bedienungsanleitung, Bedienungsanleitung | Manual de Instruções, guia do usuário | инструкция | návod na použitie, Užívateľská príručka, návod k použití | bruksanvisningen | instrukcja, podręcznik użytkownika | kullanım kılavuzu, Kullanım | kézikönyv, használati útmutató | manuale di istruzioni, istruzioni d'uso | handleiding, gebruikershandleiding
Sitemap
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101








1. PlayStation 3 120 GB by Sony (Video Game Aug. 25, 2009) PlayStation 3
2. PlayStation 3 160 GB by Sony (Video Game Aug. 17, 2010) PlayStation 3
3. Sony PlayStation 3 Blu ray Disc Remote by Sony Computer Entertainment (Accessory Jan. 9, 2007) PlayStation 3
4. PlayStation 3 Dualshock 3 Wireless Controller by Sony Computer Entertainment (Video Game Apr. 4, 2008) PlayStation 3
5. PlayStation 3 320GB System with PlayStation Move Bundle by Sony (Video Game Sept. 17, 2010) PlayStation 3
6. PlayStation 3 Wireless keypad by Sony Computer Entertainment (Accessory Dec. 15, 2008) PlayStation 3


