Please written by computer source Question 1: Suppose we wish to write a procedure that computes the inner product of two vectors u and v. An abstract version of the function has a CPE of 14–18 with x86- 64 for different types of integer and floating-point data. By doing the same sort of transformations we did to transform the abstract program combine1 into the more efficient combine4, we get the following code: Our measurements show that this function has CPEs of 1.50 for integer data and 3.00 for floating-point data. For data type double, the x86-64 assembly code for the inner loop is as follows: Assume that the functional units have the characteristics listed in Figure 5.12. **See last page for figures A. Diagram how this instruction sequence would be decoded into operations and show how the data dependencies between them would create a critical path of operations, in the style of textbook Figures 5.13 and 5.14. vmovsd vmovsd vmulsd vaddsd | | | | V V V V Get udata(i) Load vdata(i) Multiply Add to sum | V | 5. Increment i | V 6. Compare limit B. For data type double, what lower bound on the CPE is determined by the critical path? C. Assuming similar instruction sequences for the integer code as well, what lower bound on the CPE is determined by the critical path for integer data? D. Explain how the floating-point versions can have CPEs of 3.00, even though the multiplication operation requires 5 clock cycles. The processor can issue one multiplication per cycle if there are no data dependencies between the multiplications. Processors also have multiple functional units for performing floating-point operations, which can further increase the parallelism and reduce the latency of the critical path. Question 2: Write a version of the inner product procedure described in Question 1 that uses 6 × 6 loop unrolling. Our measurements for this function with x86-64 give a CPE of 1.06 for integer data and 1.01 for floating-point data. What factor limits the performance to a CPE of 1.00? Question 3: Write a version of the inner product procedure described in Question 1 that uses 6 × 1a loop unrolling to enable greater parallelism. Our measurements for this function give a CPE of 1.10 for integer data and 1.05 for floating-point data.

Please written by computer source Question 1: Suppose we wish to write a procedure that computes the inner product of two vectors u and v. An abstract version of the function has a CPE of 14–18 with x86- 64 for different types of integer and floating-point data. By doing the same sort of transformations we did to transform the abstract program combine1 into the more efficient combine4, we get the following code: Our measurements show that this function has CPEs of 1.50 for integer data and 3.00 for floating-point data. For data type double, the x86-64 assembly code for the inner loop is as follows: Assume that the functional units have the characteristics listed in Figure 5.12. **See last page for figures A. Diagram how this instruction sequence would be decoded into operations and show how the data dependencies between them would create a critical path of operations, in the style of textbook Figures 5.13 and 5.14. vmovsd vmovsd vmulsd vaddsd | | | | V V V V Get udata(i) Load vdata(i) Multiply Add to sum | V | 5. Increment i | V 6. Compare limit B. For data type double, what lower bound on the CPE is determined by the critical path? C. Assuming similar instruction sequences for the integer code as well, what lower bound on the CPE is determined by the critical path for integer data? D. Explain how the floating-point versions can have CPEs of 3.00, even though the multiplication operation requires 5 clock cycles. The processor can issue one multiplication per cycle if there are no data dependencies between the multiplications. Processors also have multiple functional units for performing floating-point operations, which can further increase the parallelism and reduce the latency of the critical path. Question 2: Write a version of the inner product procedure described in Question 1 that uses 6 × 6 loop unrolling. Our measurements for this function with x86-64 give a CPE of 1.06 for integer data and 1.01 for floating-point data. What factor limits the performance to a CPE of 1.00? Question 3: Write a version of the inner product procedure described in Question 1 that uses 6 × 1a loop unrolling to enable greater parallelism. Our measurements for this function give a CPE of 1.10 for integer data and 1.05 for floating-point data.

C++ Programming: From Problem Analysis to Program Design

8th Edition

ISBN:9781337102087

Author:D. S. Malik

Publisher:D. S. Malik

Chapter15: Recursion

Section: Chapter Questions

Problem 8SA

See similar textbooks

Related questions

Question

Please written by computer source

Question 1:

Suppose we wish to write a procedure that computes the inner product of two vectors u and v. An abstract version of the function has a CPE of 14–18 with x86- 64 for different types of integer and floating-point data. By doing the same sort of transformations we did to transform the abstract program combine1 into the more efficient combine4, we get the following code:

Our measurements show that this function has CPEs of 1.50 for integer data and 3.00 for floating-point data. For data type double, the x86-64 assembly code for the inner loop is as follows:

Assume that the functional units have the characteristics listed in Figure 5.12.

**See last page for figures

A. Diagram how this instruction sequence would be decoded into operations and show how the data dependencies between them would create a critical path of operations, in the style of textbook Figures 5.13 and 5.14.

vmovsd vmovsd vmulsd vaddsd

| | | |

V V V V

Get udata(i) Load vdata(i) Multiply Add to sum

V |

5. Increment i |

6. Compare limit

B. For data type double, what lower bound on the CPE is determined by the critical path?

C. Assuming similar instruction sequences for the integer code as well, what lower bound on the CPE is determined by the critical path for integer data?

D. Explain how the floating-point versions can have CPEs of 3.00, even though the multiplication operation requires 5 clock cycles.

The processor can issue one multiplication per cycle if there are no data dependencies between the multiplications. Processors also have multiple functional units for performing floating-point operations, which can further increase the parallelism and reduce the latency of the critical path.

Question 2:

Write a version of the inner product procedure described in Question 1 that uses 6 × 6 loop unrolling. Our measurements for this function with x86-64 give a CPE of 1.06 for integer data and 1.01 for floating-point data.

What factor limits the performance to a CPE of 1.00?

Question 3:

Write a version of the inner product procedure described in Question 1 that uses 6 × 1a loop unrolling to enable greater parallelism. Our measurements for this function give a CPE of 1.10 for integer data and 1.05 for floating-point data.

Expert Solution

Trending now

This is a popular solution!

Step by step

Solved in 4 steps

SEE SOLUTION Check out a sample Q&A here

Knowledge Booster

Learn more about

Need a deep-dive on the concept behind this application? Look no further. Learn more about this topic, computer-science and related others by exploring similar questions and additional content below.