Reviews & Opinions
Independent and trusted. Read before buy LG MC256!

LG MC256


Bookmark
LG MC256

Bookmark and Share

 

LG MC256About LG MC256
Here you can find all about LG MC256 like manual and other informations. For example: review.

LG MC256 manual (user guide) is ready to download for free.

On the bottom of page users can write a review. If you own a LG MC256 please write about it to help other people.
[ Report abuse or wrong photo | Share your LG MC256 photo ]

 

 

Manual

Download (French)
LG MC-256 Microwave Oven, size: 3.3 MB
Download (English)
Check if your language version is avaliable.
Most of manuals are avaliable in many languages.

 

LG MC256

 

 

User reviews and opinions

<== Click here to post a new opinion, comment, review, etc.

No opinions have been provided. Be the first and add a new opinion/review.

 

Documents

doc1

Algorithms to Take Advantage of Hardware Prefetching

Shen Pan

Abstract
Cache-oblivious and cache-aware algorithms have been developed to minimize cache misses. Some of the newest processors have hardware prefetching where cache misses are avoided by predicting ahead of time what memory will be needed in the future and bringing that memory into the cache before it is used. It is shown that hardware prefetching permits the standard Floyd-Warshall algorithm for all-pairs shortest paths to outperform cache-oblivious and cache-aware algorithms. A simple improvement to the standard simple dynamic programming algorithm yields an algorithm that takes advantage of prefetching, and outperforms cache-oblivious and cache-aware algorithms. Finally, it is shown that variants of standard FFT algorithms exhibit good prefetching performance.

Cary Cherng

Kevin Dick

Richard E. Ladner

they are called Cache Ecient algorithms. Many cache ecient algorithms for problems have been developed that have superior performance to standard algorithms for the same problems. However, a recent development in processor design, hardware prefetching, raises the question as to whether some of these custom cache ecient algorithms are always needed to reduce cache misses. With hardware prefetching, cache misses are avoided by predicting ahead of time what data in memory will be needed in the future and bringing that data into the cache before it is used. In this paper we explore this question and discover that hardware prefetching can be exploited to yield fast algorithms for the all-pairs shortest paths problem, simple dynamic programming, and the Fast Fourier Transform (FFT). 2 Hardware Prefetcher in the Pentium 4. In the Pentium 4 processor, associated with the L2 cache is a hardware prefetcher [7] that monitors data access patterns and prefetches data automatically into the L2 cache. It attempts to stay 256 bytes ahead of the current data access locations. This prefetcher remembers the history of cache misses to detect concurrent, independent streams of data that it tries to prefetch ahead of use in the program. It follows one stream per 4KB page (either load or store) and can prefetch up to 8 simultaneous independent streams from eight dierent 4KB regions. The hardware prefetcher also has a few weaknesses. First of all, it requires rather regular memory access patterns. Moreover, start-up penalty applies before the hardware prefetcher triggers, and there might be unnecessary fetches after the end of an array is reached. For short arrays this overhead can reduce the eectiveness of the hardware prefetcher. To understand the range and eciency of the prefetcher, we timed sequences of array accesses with and without the prefetcher enabled. Prefetcher activation is controlled by setting bits 9 and 19 of the IA32 MISC ENABLE model-specic register. More information can be found in Appendix B of Volume 3B of the Intel 64 and IA-32 Architectures Software Developers Manual [6]. This study was performed on a machine running Linux 2.6.16-16 using a 3.4 GHz Pentium 4 processor

Introduction.

The memory subsystem on modern computers is ubiquitously structured in a hierarchy with registers in the lowest level followed by the L1 cache, L2 cache, main memory and external memory such as hard disks, with memory access time increasing quickly from lower levels to higher levels. For the sake of discussion we consider a two-level model that consists of a cache of size M and an arbitrarily large main memory partitioned into blocks of size B. If the byte is not stored in the cache, the entire memory block where it resides is brought into the cache, and we call this a cache miss. The I/O complexity of an algorithm therefore becomes the number of blocks transferred upon cache misses between these two levels. Cache Oblivious algorithms are algorithms that do not use knowledge of M and B, yet still have good cache performance. On the other hand, Cache Aware algorithms do use knowledge of M and B of the host machine to optimize their cache performance. Together,

Amazon, 605 5th Ave. South, Seattle, WA 98104. shenpan@amazon.com Google Kirkland, 720 4th Ave, Suite 400, Kirkland WA 98003. cary@google.com. Computer Science Department, California Institute of Technology, 1200 E. California Boulevard, MC 256-80, Pasadena, CA 91125. kdick@caltech.edu. Department of Computer Science and Engineering, Box 352350, University of Washington, Seattle, WA 98195. ladner@cs.washington.edu.
Copyright by SIAM. Unauthorized reproduction of this article is prohibited
500 Stride Length Prefetcher Enabled Prefetcher Disabled

Normalized Running Time

prefetching can give signicant speedup with sequential accesses to memory that are close together, but that prefetching can actually slow down accesses that are spaced far apart. 3 Cache Ecient Algorithms. Expert programmers have known for many years that reducing the number of cache misses can signicantly improve the running time of programs. Much eort has been put into designing the cache ecient versions of various dynamic programming algorithms. These algorithms work by reducing the constant factor in the complexity incurred by the cache misses. One major approach to improving the performance of the cache is to design cache-oblivious algorithms. The cache-oblivious approach is explored by Frigo et al. in [4], which discusses the cache performance of cache-oblivious algorithms for matrix transpose, FFT and sorting. Park et al. [10] presented a cache-oblivious implementation of the Floyd-Warshall algorithm for the fundamental graph problem of all-pairs shortest paths by relaxing some dependencies in the iterative version. The cache-oblivious algorithm runs roughly 7 times faster than the Floyd-Warshall algorithm on a Pentium 3 machine. Chowdhury et al. [2] gave a new cache-oblivious framework called the Gaussian Elimination Paradigm (GEP) for Gaussian elimination without pivoting that also gives cache-oblivious algorithms for Floyd-Warshall all-pairs shortest paths in graphs and matrix multiplication, among other problems. New cache-oblivious and cache-aware algorithms for simple dynamic programming based on Valiants context-free language recognition algorithm are designed, implemented, analyzed and empirically evaluated with timing studies and cache simulations by Cherng et al. [1]. A major technique in designing cache aware algorithms is blocking, that is partitioning the problem into cache size subproblems and solving each subproblem while its data is in the cache. In the cache-oblivious techniques quite often the problem partitions itself naturally into smaller subproblems in a recursive way. Unfortunately, if the recursion is continued all the way to bottom, then there is a lot of overhead from the recursion. A blocking technique that stops the recursion when the subproblem size reaches the size of the cache (say L2 cache), then solving the problem using a standard iterative approach, often yields a signicantly faster program. On the negative side, the blocking technique only works if the cache size can be communicated to the program.

Figure 1: Normalized running time in seconds of sequential array accesses.
with an 8 KB L1 cache (8-way set associative with 64 B lines) and a 2 MB L2 cache (8-way set associative with 64 B lines). It had 1 GB of main memory and used g++ version 4.1.(Red Hat 4.1.1-1 with optimization -o). The program rst allocates a large array and then traverses it ten times, each time reading from then writing to every n-th byte, where n is the stride length. The results, shown in Figure 1, use an array of forty million bytes with a stride varying from one to ve hundred. The normalized time is reported, meaning the stated values are proportional to the time needed per array access. The given measurements are the medians of seven trials. Running this experiment for other large array sizes gave similar results. With the prefetcher disabled, we expect the normalized time to depend heavily on the number of cache misses. The L1 and L2 caches use blocks of 64 bytes, so for strides of 63 and under accesses to elements already brought into the cache by previous operations come at a low cost. When the stride length is at least 64, we are eectively measuring the time taken (without normalization) for 10 ()/n cache misses. When the prefetcher is enabled, when n 256, elements of the array will be brought into the L2 cache. This gives us some improvement, since many elements that would have been drawn from the main memory are instead pulled from the L2 cache. However, the hardware prefetcher requires a few initial misses before it can start prefetching, and it only prefetches from main memory into the L2 cache [7]. Furthermore, the overhead required by the prefetcher actually slows down the array accesses for large strides which are out of range of the prefetcher. These results suggest that hardware
4 All-Pairs Shortest Paths. In the all-pairs shortest paths problem we are given a directed graph G with vertices indexed {1, 2,. , n} and for each directed edge (i, j) an associated non-negative cost c(i, j). For each i and j we wish to nd the lowest cost of all paths from i to j, where the cost of a path is the sum of cost of the edges on the path. The Floyd-Warshall algorithm (shown in Algorithm 1) is the standard iterative dynamic programming solution to the all-pairs shortest paths problem [3]. It runs in O(n3 ) time and works by looking at paths with successively more and more possible interior vertices until all vertices are exhausted. The work is divided into n iterations. Initially, X[i, j] = c(i, j) if (i, j) is an edge, X[i, j] = if (i, j) is not an edge, and X[i, i] = 0. for k = 1 to n do for i = 1 to n do for j = 1 to n do X[i, j] := min(X[i, j], X[i, k] + X[k, j]) end for end for end for Algorithm 1: The Floyd-Warshall Algorithm

Procedure F (X, k1 , k2 ) if k1 = k2 then X[i1 , j1 ] := min(X[i1 , j1 ], X[i1 , k1 ] + X[k1 , j1 ]) else km := k1 +kF (X11 , k1 , km ) F (X12 , k1 , km ) F (X21 , k1 , km ) F (X22 , k1 , km ) F (X22 , km + 1, k2 ) F (X21 , km + 1, k2 ) F (X12 , km + 1, k2 ) F (X11 , km + 1, k2 ) end if Algorithm 2: GEP Algorithm half the dimension as in the case of the GEP algorithm. The matrix multiply and accumulate operation is also done recursively using divide-and-conquer.

X22 X12 X11

:= := :=

X12 X22 X11 + X12 X21

X11 := X11 4.1 Cache Ecient Algorithms for All-Pairs X12 := X11 X12 Shortest Paths. The two cache-oblivious algorithms k described below are only dened for problems of size 2 X21 := X22 X21 for some k. To solve problems that are not the size a X21 := X21 X11 power of two the array can be padded appropriately. X22 := X22 + X21 X12 The Gaussian Elimination Paradigm or GEP, introduced in [2], is a general cache-oblivious framework for problems. When specialized to the all-pairs shortAlgorithm 3: MMP Algorithm est paths problem we obtain the recursive formulation described in Algorithm 2. If the input array X has size larger than then it is subdivided into four equal Cache-aware algorithms for the all-pairs shortest size matrices: paths algorithm can be dened using blocked versions of the cache-oblivious algorithms. The blocked GEP X11 X12 X= algorithm has a parameter S such that if the subproblem X21 X22 size n S, then the cost submatrix is computed using In the algorithm, when the base case is called k1 = k2 the Floyd-Warshall algorithm directly. The blocked and the array X is and (i1 , j1 ) is the index of MMP algorithm has two parameters S and M. The the one element in the array. All-pairs shortest paths is parameter S is such that if the subproblem size n then solved by calling F (X, 1, n). S then X is computed using the standard iterative Another cache-oblivious technique is derived from dynamic program (Floyd-Warshalls Algorithm). The the reduction of path problems to matrix multiplication parameter M is such that if n M then the matrix [5, 9]. For this formulation of matrix multiplication, multiply and accumulate operations are done in the addition is the min operation and multiplication is the standard way, not with recursive divide-and-conquer. + operation. The recursive step of the Matrix Multiply Paradigm or MMP algorithm can be described elegantly 4.2 Experimental Results for All-Pairs Shortby Algorithm 3. If X is not then the result X is est Paths. We implemented the Floyd-Warshall algocomputed recursively by dividing X into submatrices of rithm, the GEP algorithm, the MMP algorithm, the

Blocked GEP algorithm (S = 64) and the Blocked MMP algorithm (S = 64, M = 32) are determined experimentally on a problem size of 2048. In contrast to results 10 from prior studies [10], the Floyd-Warshall algorithm FloydWarshall 8 GEP clearly out performs all the other algorithms. This MMP Blocked GEP 64 is certainly an unexpected result because the Floyd6 Blocked MMP 6432 Warshall algorithm does not have the strong tempo4 ral locality exhibited by the cache-oblivious and cache2 aware algorithms. Therefore, the Pentium 4 must have something that dramatically changes the performance 4096 Problem Size characteristics of the Floyd-Warshall algorithm. Indeed, the Pentium 4 has hardware prefetching, that appears Figure 2: Normalized running time in nanoseconds of to obviate the need for special algorithms to help cache the ve algorithms for the All-Pairs Shortest Paths performance. Figure 3 shows the running time for the problem on Pentium 4. ve algorithms on the same Pentium 4 machine, only with the hardware prefetcher turned o. Without hardware prefetching, the Floyd-Warshall algorithm is less 14 than half as fast as the best cache-aware algorithm, the 12 blocked GEP 64. This is a result more consistent with the prior results [10]. Examining the Floyd-Warshall 10 Algorithm (Algorithm 1) closely shows that the inner FloydWarshall 8 GEP loop accesses two rows, the i-th and k-th, simultaneMMP Blocked GEP 64 ously. Thus, we have two access streams with a stride 6 Blocked MMP 6432 of 4 bytes each, which is very amenable to hardware 4 prefetching.
12 Normalized Running Time 2 Normalized Running Time 1024 Problem Size 2048 4096
Figure 3: Normalized running time in nanoseconds of the ve algorithms for the All-Pairs Shortest Paths problem on Pentium 4 with the hardware prefetcher turned o.
Blocked GEP algorithm and the Blocked MMP algorithm and conducted various running time experiments. In our implementation we chose to store the matrix, which used 4 Byte integers chosen randomly, in rowmajor order, that is, the rows of the matrix are stored in linear memory by storing row 1, then row 2, and so on. These experiments were run under Red Hat Fedora Core 4 on a 2.8 GHz Pentium 4 with 8 KB L1 data cache (4-way associative with 64 B lines) and 512 KB L2 data cache (8-way associative with 64 B lines). The machine on where the processor resides has 4 GB of main memory. All algorithms were implemented in C++. The compiler used was g++ 4.0.(Red Hat 4.0.2-8, with optimization -O3). In our studies of the all-pairs shortest paths algorithms the normalized time is the average of ten experiments divided by n3. Figure 2 shows the running time results for the ve algorithms on the Pentium 4. The best block size for the

5 Simple Dynamic Programming. Another form of dynamic programming is called Simple Dynamic Programming problems in [1]. Input elements x1 , , xn of a simple dynamic program of size n come from a set X which is the domain of a non-associative semi-ring (U, +, , 0), where + is an associative (x + (y + z) = (x + y) + z), commutative (x + y = y + x), idempotent (x + x = x), binary operator and is a nonassociative, noncommutative, binary operator. The value 0 is +-identity (x + 0 = x) and -annihilator (x 0 = 0 x = 0). Finally, the operators satisfy the distributive laws (x (y + z) = x y + x z and (y + z) x = y x + z x). The objective is to compute the sum (+) of all ways to generate the product () of x1 ,. , xn in this order and under all possible groupings for the product. The simple dynamic programming problem can be solved in O(n3 ) time by Algorithm 4, which is the Cocke-Kasami-Younger (CKY) algorithm [8, 13], the standard iterative algorithm to solve this problem. Initialize the nn array D by D[i, i] = xi and D[i, j] = 0 if j = i. The algorithm proceeds to ll the upper-right of the matrix D one column at at time to yield the nal result in D[1, n]. We call this algorithm the Vertical algorithm [1]. Two other iterative algorithms, the Horizontal Algorithm and Diagonal Algorithm, have the subproblems
for j = 2 to n do for i = j 1 to 1 do for k = i to j 1 do D[i, j] := D[i, j] + D[i, k] D[k + 1, j] end for end for end for Algorithm 4: Vertical Algorithm for Simple Dynamic Programming
Y11 solved row by row and diagonal by diagonal, respectively. 5.1 Cache Ecient Algorithms for Simple Dynamic Programming. The cache-oblivious algorithm for solving simple dynamic programming presented in [1] is based on Valiants algorithm for general contextfree recognition [12] and runs in O(n3 ) time. The algorithm, summarized as in [1], has two recursive routines, the plus (+) and the star () algorithms. Let X be a square matrix of size n = 2k. Unlike the previous algorithms, the input is placed just above the diagonal in X, that is, X[i, i + 1] = xi for 1 i < n with the remainder of the array zero. This means the algorithm handles naturally input lengths of a power of two minus one. Arbitrary lengths can be handled by appropriate padding. If n = 2, then X = X +. Otherwise, partition X into sixteen matrices of size 2k1 (nine of which are zero). X11 X12 X22 X23 X := X33 X34 X44
Y23 Y33 Y13 Y13 Y33 Y24 Y24 Y44 Y14 Y14 Y14 Y44

:= := := := := := := :=

Y23 Y33 Y13 + Y12 Y23 Y11 Y13 Y33 Y24 + Y23 Y34 Y22 Y24 Y44 Y14 + Y12 Y24 Y14 + Y13 Y34 Y11 Y14 Y44
Algorithm 6: Valiants Star () Algorithm
matrix multiply and accumulate. It can be seen that Valiants Algorithm has some of the avor of the MMP algorithm, in that both are recursive reductions to matrix multiplication. An alternative is a cache-aware algorithm called the Blocked Valiants Algorithm which chooses two parameters S and M , the rst for when to cut o recursive calls to Valiants Star Algorithm and the second for when to cut o recursive calls to the DP-Closure and matrix multiply and accumulate operations.

5.2 Experimental Results for Simple Dynamic Programming. Using the same experimental setup as for the all-pairs shortest paths study (c.f. Section 4.2) we implemented the Vertical, Horizontal, and Diagonal Algorithms, and the cache-oblivious and cache-aware algorithms for simple dynamic program. Figure 4 shows the results of the running time experiments for + Then X , the DP-Closure of X, is shown in Algorithm the ve algorithms (S = 64 and M = 32 were the 5, and X is computed using the Valiants Star algo- optimal parameters chosen experimentally). Figure 5 shows the same experiments except with the hardware rithm (Algorithm 6). prefetcher turned o. Unfortunately, the standard algorithm did not benet much from the hardware + X11 X12 X11 X12 prefetcher as it did in the all-pairs shortest paths := X22 X22 problem. The cache ecient algorithms still outperform + X33 X34 X33 X34 the standard algorithms. Even though the hardware := X44 X44 prefetcher denitely improves the running time of the X := X standard algorithms, it is clearly not enough to counter the impact of the cache misses. Algorithm 5: Valiants DP-Closure (+) Algorithm 5.3 Improving Simple Dynamic Programming. Let Y be a square matrix of size n = 2k whose On careful examination of the Vertical Algorithm (Alupper left and lower right hand quarters are already gorithm 4) it can be seen that in the inner loop, the i-th DP-closed. If n = 2, then Y = Y . Otherwise, partition row and j-th column are accessed simultaneously. For Y into sixteen submatrices of size 2k1 (six of which are large matrices the accesses in the j-th column have a zero). large stride because the row-major order layout of the All operations performed can be done in place using array. Hence, the prefetching hardware of the Pentium
4.3.5 Normalized Running Time 3 2.1.Vertical Horizontal Diagonal Valiant Blocked Valiant 6432 Normalized Running Time
5.4.3.2.1.Vertical Horizontal Diagonal Valiant Blocked Valiant 6432
0.1024 Problem Size 1024 Problem Size 2048 4096
Figure 4: Normalized running time in nanoseconds of Figure 5: Normalized running time in nanoseconds of the ve algorithms for Simple Dynamic Programming the ve algorithms for Simple Dynamic Programming on Pentium 4. with the hardware prefetcher turned o on Pentium 4.

4 is not eective for this algorithm. The same is true of the horizontal and diagonal algorithms. To eliminate the column access stream of the CKY algorithms we use a form of data redundancy: for i j the value D[i, j] is also stored in D[j, i]. The enables the CKY algorithm to be implemented with two row streams rather than a row and column stream. The Data Redundant Vertical Algorithm is described in Algorithm 7.
3.5 Normalized Running Time 3 2.1.0.5 0
Vertical Horizontal Diagonal Valiant Blocked Valiant 6432 DR Vertical DR Horizontal DR Diagonal

Problem Size

for j = 2 to n do for i = j 1 to 1 do for k = i to j 1 do D[i, j] := D[i, j] + D[i, k] D[j, k + 1] D[j, i] := D[i, j] end for end for end for Algorithm 7: Data Redundant Vertical Algorithm for Simple Dynamic Programming
Figure 6: Normalized running time in nanoseconds of the Data Redundancy algorithms, the standard algorithms and the cache ecient algorithms for simple dynamic programming on Pentium 4.
Data redundant algorithms based on the horizontal and diagonal algorithms can be dened similarly. Figure 6 shows the results from implementing the data redundant algorithms. The bottom two curves are from the Data Redundant Vertical and Horizontal Algorithms. On the negative side, if memory is a constraint then the data redundant algorithm, which uses twice as much memory as the standard CKY-Algorithm, will suer page faults.
6 Fast Fourier Transform. The sequential accesses needed in running the Fast Fourier Transform (FFT) make it an ideal candidate for analysis under hardware prefetching. The FFT is an algorithm used to compute the Discrete Fourier Transform (DFT) of an input array A of n complex numbers in O(n log n) time. This is given by the output n1 kj array D[0.n 1], where D[k] = j=0 A[j]n. Here n denotes the nth complex root of unity. We assume that n is a power of two. The rst step of the FFT is to rearrange the input array A by the taking the bitreversal permutation [3]. Whether or not the prefetcher was enabled had little eect on the speed of the bitreversal permutation, so its details are omitted. The remainder of the FFTs computation involves several
buttery operations [3]. Each buttery operation is just a few steps of complex arithmetic, and only the ordering of the buttery operations is signicant for study under prefetching. m := 2 while m n do for j = 0 to m/do for k = 0 to n 1 by m do buttery(A[k + j], A[k + j + m/2], j, m) end for end for m=2m end while Algorithm 8: Downwards Fast Fourier Transform m := 2 while m n do for k = 0 to n 1 by m do for j = 0 to m/do buttery(A[k + j], A[k + j + m/2], j, m) end for end for m=2m end while Algorithm 9: Across Fast Fourier Transform Two standard FFT algorithms are presented, with their only dierence being the ordering of the buttery operations. The downwards method in Algorithm 8 is based on the FFT implementation of Numerical Recipes in C [11] while the across method in Algorithm 9 is based on one of Cormen et al. [3]. 6.1 Prefetcher-Friendly FFT Algorithm. Long sequences of array accesses are desirable for hardware prefetching. In the downwards method, a longer-lasting k loop gives longer sequences of accesses, while in the across method, a long j loop is preferred. This translates into small and large values of m, respectively. We can see some of the benets of both by having the rst few executions of the m loop use the downwards method then having the remaining executions use the across method. This combination of the two standard approaches, described in Algorithm 10, requires a parameter s. This value is expected to be some power of two specifying how many iterations of the m loop will be done with the downwards method before switching. 6.2 Cache-Ecient FFT Algorithm. Another variant of this algorithm, described in Algorithm 11, is designed to be cache-aware. After applying bit-reversal,

m := 2 while m s do for j = 0 to m/do for k = 0 to n 1 by m do buttery(A[k + j], A[k + j + m/2], j, m) end for end for m=2m end while while m n do for k = 0 to n 1 by m do for j = 0 to m/do buttery(A[k + j], A[k + j + m/2], j, m) end for end for m=2m end while Algorithm 10: Prefetcher-Friendly Fast Fourier Transform the downwards FFT is applied individually to the arrays A[0.l 1], A[l.2l 1],. , A[n l.n 1]. Here l is some power of two less than or equal to n. The remainder of the needed buttery operations are done by the across method operating on the entire array. The appeal of this approach is that, for a well selected value of l, we can ll the cache (either the L1 or L2) with elements of the array and then perform much of our arithmetic without having cache misses. for i = 0 to n 1 by l do m := 2 while m l do for j = 0 to m/do for k = 0 to n 1 by m do buttery(A[i+k+j], A[i+k+j+m/2], j, m) end for end for m=2m end while end for while m n do for k = 0 to n 1 by m do for j = 0 to m/do buttery(A[k + j], A[k + j + m/2], j, m) end for end for m=2m end while Algorithm 11: Cache-Ecient Fast Fourier Transform
Prefetcher-Friendly, Prefetcher Enabled Cache-Efficient, Prefetcher Enabled Prefetcher-Friendly, Prefetcher Disabled Cache-Efficient, Prefetcher Disabled
times. This suggests that hardware prefetching eliminates some of the need to design cache-ecient implementations. 7 Acknowledgments.
8 Normalized Running Time
We would like to thank Jean-Loup Baer, Steve Swanson and Jan Sanislo for providing us with information and insight about hardware prefetchers. References
[1] C. Cherng and R. E. Ladner, Cache ecient simple dynamic programming, Proceedings of the International Conference on the Analysis of Algorithms, 2005, pp. 4958. [2] R. A. Chowdhury and V. Ramachandran, Cacheoblivious dynamic programming, Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2006, pp. 591600. [3] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to Algorithms, MIT Press, 2nd ed., 2001. [4] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran, Cache-oblivious algorithms, Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS), 1999, pp. 1718. [5] M. E. Furman, Application of fast multiplication of matrices in the problem of nding the transitive closure of a graph, Dokl. Akad Nauk SSSR, 194:524 (Russian), Soviet Math. Dokl., 11(5):1252, 1970. [6] Intel 64 and IA-32 Architectures Software Developers Manual, Volume 3B, http://www.intel.com/design/ processor/manuals/253669.pdf. [7] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel, The Microarchitecture of the Pentium 4 Processor, http://www.intel.com/. [8] T. Kasami, An ecient recognition and syntax algorithm for context-free languages, Scientic Report AFCRL-65-758, Air Force Cambridge Research Laboratory, Bedford, Mass., 1965. [9] I. Munro, Ecient determination of the transitive closure of a directed graph, Information Processing Letters, 1(2):5658, 1971. [10] J.-S. Park, M. Penner and V. K. Prasanna, Optimizing graph algorithms for improved cache performance, IEEE Transactions on Parallel and Distributed Systems, vol. 15(9), pp. 769782, 2004. [11] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes in C: The Art of Scientic Computing, Second Edition, Cambridge University Press, Cambridge, UK, 1992. [12] L. G. Valiant, General context-free recognition in less than cubic time, Journal of Computer and Systems Sciences, 10:308315, 1975. [13] D. H. Younger, Recognition of context-free languages in time n3 , Information and Control, 10(2):189208, 1967.

Switch Location 15 20

Figure 7: Normalized running time in seconds of dierent FFT implementations.
6.3 Experimental Results for the FFT. The FFT implementations of Algorithms 10 and 11 were timed with and without the hardware prefetcher enabled using the same experimental setup as in Section 4.2. The arrays had 218 randomly generated pairs of oats, with each pair corresponding to real and imaginary parts. Resulting times (in seconds) are multiplied by 108 /(218 log 218 ) for normalization. The results are shown in Figure 7. For the prefetcher-friendly implementation, the x-axis denotes how many of the m loop iterations are done with the downwards method before switching to the across method. Our results suggest that a few iterations of the downwards method followed iterations of the across method gives the fastest times. As expected, the cache-ecient approach is fastest when we can ll or nearly ll the cache with several elements to be used repeatedly, but is slower when we apply the downwards method to arrays larger than the cache. Here, the x-axis is the number of m loops performed by the downwards method to each subarray (whose sizes also depend on the value of the x coordinate) before the across method is employed. We see a substantial jump in time when we go from 16 m loop iterations using the downwards method to 17. Since 16 iterations uses 216 pairs of oats, the needed array takes up 4 Bytes which is precisely the size of the L2 cache (512 KB) used in the experiment. Using either the prefetcher-friendly or the cacheecient implementation would require tuning, that is choosing s or l so as to minimize the running time. Without prefetching enabled, the cache-ecient method gives signicant improvement over the prefetcherfriendly method. With the prefetcher enabled, however, both methods give nearly the same minimum running

 

Tags

PN50C8000 SL705S Polaroid I533 ROC 240 AR-NB2 N System XCW 250 CCD-TRV11E Scratch 1 Software Universal 2 220-240V HM-07C03 EFC90950X RX-5042S SLV-R350 BV9995B Mark-S 712 GT-E2510 Elite RSX-1067 FX140-2003 MDS-JE510 PJ-TX100W DCR-IP1E Plantronics 220 Scoop Surfer Digidesign Ilok 8830 S Rexton2 Bluetooth H700 N130-JA01 Husqvarna 240E K6200 Monitors CCD-TR820E Ypg-525 CMT-EP313 PAD-5 DVD-2800 Travelmate-3200 UN40C6300 RE-SD10 VP-D230 CL 600 HL-7050 DCR-TRV420E Scanner H4next WS-320M 3-0-0 Studio 12R OF Duty Server DVD-8721N Olympus OM10 AVH-P7500dvdii Pror3 Review Premium AVH-P6850DVD I845GE Printer 021YE XL H1 150SX SL-P1200 SA-VA55 27AF44 MS2344B V2100 SRU 3040 XR-Q150 Liebert 376 EW1055F DSC-W310 B RBC30SES Averatec 3150 SU-V98 Quest-2000 Parts DHP-307AV Alesis QS82 Minikit Slim NMS 100 IR1024I IPF610 DAC8007EE PET702 FWT3102 22PFL5403D 10 Crazy Taxi CD2610 Olympus EP-2 CR560X Satellite L555 4839 CI 11540 19LU5000 SRU1020-10

 

manuel d'instructions, Guide de l'utilisateur | Manual de instrucciones, Instrucciones de uso | Bedienungsanleitung, Bedienungsanleitung | Manual de Instruções, guia do usuário | инструкция | návod na použitie, Užívateľská príručka, návod k použití | bruksanvisningen | instrukcja, podręcznik użytkownika | kullanım kılavuzu, Kullanım | kézikönyv, használati útmutató | manuale di istruzioni, istruzioni d'uso | handleiding, gebruikershandleiding

 

Sitemap

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101