Task 2: Secondary Index & Aggregation Query Processing Assume the relation CITIZEN(ID, Tax-Code, Salary, Age) storing information about UK citizens’ tax codes and annual salaries. There are r = 60,000,000 records. Each attribute has the same size: 128 bytes. The relation is stored in a file sorted by the salary attribute. The block size B = 1024 bytes and any pointer in the system has size = 128 bytes. The salary attribute is assumed to be uniformly distributed across tuples. There are 6,000 distinct salary values and there are 10,000 different tax-code values. The tax-code and age are assumed to be statistically independent of salary. A data scientist, who has built a Secondary Index over the Tax Code (non- ordering, non-key attribute), is interested in the specific Tax Code: ‘1234L’. The data scientist did analyse the distribution of the Tax Code attribute and noted the following: • Let X be the number of citizens with Tax Code 1234L per data block. That is, given a random block, there are X citizens with Tax Code 1234L. Objective: Experimentation with Secondary Index over non-ordering non-key attribute and Aggregate Query Execution Planning.• P(X ≥ 1) = 0.5, i.e., the probability that at least one citizen has tax code 1234L is 50% within a block. Therefore, when we pick up a block at random, the probability of finding therein at least one citizen with Tax Code 1234L is 0.5. If there are b data blocks in the file and we are asked to retrieve those citizens with Tax Code 1234L, then ideally, we expect to access b/2 blocks (ideal case). However, in reality, we do not know where these blocks are! If we use the naïve solution (scan the whole file) to retrieve all those citizens, then we need to access b blocks. Q2.1: The data scientist claims that the expected cost using the Secondary Index should be between b/2 and b. Which is the expected cost of retrieving the citizens of Tax Code 1234L using the Secondary Index? How much bigger is this cost compared to the ideal case? Q2.2 : The data scientist is asked for a query processing plan for the aggregation query: SELECT Salary, AVG(Age) FROM CITIZEN WHERE TaxCode = ‘1234L’ GROUP BY Salary If the data management system devotes 100,000 blocks of RAM (memory), i.e., approx. 103MB (each one of 1024 bytes) for executing the query, and 5000 blocks for storing the results of the query, help the scientist by providing a query execution plan. Describe your plan (e.g., steps, methods, ideas) and report on the corresponding expected number of block accesses of your proposed solution. How much memory would you need to store the result of the aggregate query?

Task 2: Secondary Index & Aggregation Query Processing Assume the relation CITIZEN(ID, Tax-Code, Salary, Age) storing information about UK citizens’ tax codes and annual salaries. There are r = 60,000,000 records. Each attribute has the same size: 128 bytes. The relation is stored in a file sorted by the salary attribute. The block size B = 1024 bytes and any pointer in the system has size = 128 bytes. The salary attribute is assumed to be uniformly distributed across tuples. There are 6,000 distinct salary values and there are 10,000 different tax-code values. The tax-code and age are assumed to be statistically independent of salary. A data scientist, who has built a Secondary Index over the Tax Code (non- ordering, non-key attribute), is interested in the specific Tax Code: ‘1234L’. The data scientist did analyse the distribution of the Tax Code attribute and noted the following: • Let X be the number of citizens with Tax Code 1234L per data block. That is, given a random block, there are X citizens with Tax Code 1234L. Objective: Experimentation with Secondary Index over non-ordering non-key attribute and Aggregate Query Execution Planning.• P(X ≥ 1) = 0.5, i.e., the probability that at least one citizen has tax code 1234L is 50% within a block. Therefore, when we pick up a block at random, the probability of finding therein at least one citizen with Tax Code 1234L is 0.5. If there are b data blocks in the file and we are asked to retrieve those citizens with Tax Code 1234L, then ideally, we expect to access b/2 blocks (ideal case). However, in reality, we do not know where these blocks are! If we use the naïve solution (scan the whole file) to retrieve all those citizens, then we need to access b blocks. Q2.1: The data scientist claims that the expected cost using the Secondary Index should be between b/2 and b. Which is the expected cost of retrieving the citizens of Tax Code 1234L using the Secondary Index? How much bigger is this cost compared to the ideal case? Q2.2 : The data scientist is asked for a query processing plan for the aggregation query: SELECT Salary, AVG(Age) FROM CITIZEN WHERE TaxCode = ‘1234L’ GROUP BY Salary If the data management system devotes 100,000 blocks of RAM (memory), i.e., approx. 103MB (each one of 1024 bytes) for executing the query, and 5000 blocks for storing the results of the query, help the scientist by providing a query execution plan. Describe your plan (e.g., steps, methods, ideas) and report on the corresponding expected number of block accesses of your proposed solution. How much memory would you need to store the result of the aggregate query?

Database System Concepts

7th Edition

ISBN:9780078022159

Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan

Publisher:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan

Chapter1: Introduction

Section: Chapter Questions

Problem 1PE

See similar textbooks

Related questions

Concept explainers

Question

Task 2: Secondary Index & Aggregation Query Processing

Assume the relation CITIZEN(ID, Tax-Code, Salary, Age) storing information about UK citizens’ tax codes and

annual salaries. There are r = 60,000,000 records. Each attribute has the same size: 128 bytes. The relation is

stored in a file sorted by the salary attribute. The block size B = 1024 bytes and any pointer in the system has

size = 128 bytes. The salary attribute is assumed to be uniformly distributed across tuples. There are 6,000

distinct salary values and there are 10,000 different tax-code values. The tax-code and age are assumed to be

statistically independent of salary. A data scientist, who has built a Secondary Index over the Tax Code (non-

ordering, non-key attribute), is interested in the specific Tax Code: ‘1234L’. The data scientist did analyse the

distribution of the Tax Code attribute and noted the following:

• Let X be the number of citizens with Tax Code 1234L per data block. That is, given a random block,

there are X citizens with Tax Code 1234L.

Objective: Experimentation with Secondary Index over non-ordering non-key attribute and

Aggregate Query Execution Planning.• P(X ≥ 1) = 0.5, i.e., the probability that at least one citizen has tax code 1234L is 50% within a block.

Therefore, when we pick up a block at random, the probability of finding therein at least one citizen

with Tax Code 1234L is 0.5.

If there are b data blocks in the file and we are asked to retrieve those citizens with Tax Code 1234L, then

ideally, we expect to access b/2 blocks (ideal case). However, in reality, we do not know where these blocks

are! If we use the naïve solution (scan the whole file) to retrieve all those citizens, then we need to access b

blocks.

Q2.1: The data scientist claims that the expected cost using the Secondary Index should be between

b/2 and b. Which is the expected cost of retrieving the citizens of Tax Code 1234L using the Secondary Index?

How much bigger is this cost compared to the ideal case?

Q2.2 : The data scientist is asked for a query processing plan for the aggregation query:

SELECT Salary, AVG(Age)

FROM CITIZEN

WHERE TaxCode = ‘1234L’

GROUP BY Salary

If the data management system devotes 100,000 blocks of RAM (memory), i.e., approx. 103MB (each one of

1024 bytes) for executing the query, and 5000 blocks for storing the results of the query, help the scientist by

providing a query execution plan. Describe your plan (e.g., steps, methods, ideas) and report on the

corresponding expected number of block accesses of your proposed solution. How much memory would you

need to store the result of the aggregate query?

AI-Generated Solution

AI-generated content may present inaccurate or offensive content that does not represent bartleby’s views.

Unlock instant AI solutions

Tap the button
to generate a solution

Click the button to generate
a solution

SEE SOLUTION

Knowledge Booster

Learn more about

Need a deep-dive on the concept behind this application? Look no further. Learn more about this topic, computer-science and related others by exploring similar questions and additional content below.