Task 2: Secondary Index & Aggregation Query Processing  Assume the relation CITIZEN(ID, Tax-Code, Salary, Age) storing information about UK citizens’ tax codes and annual salaries. There are r = 60,000,000 records. Each attribute has the same size: 128 bytes. The relation is stored in a file sorted by the salary attribute. The block size B = 1024 bytes and any pointer in the system has size = 128 bytes. The salary attribute is assumed to be uniformly distributed across tuples. There are 6,000 distinct salary values and there are 10,000 different tax-code values. The tax-code and age are assumed to be statistically independent of salary. A data scientist, who has built a Secondary Index over the Tax Code (non- ordering, non-key attribute), is interested in the specific Tax Code: ‘1234L’. The data scientist did analyse the distribution of the Tax Code attribute and noted the following: • Let X be the number of citizens with Tax Code 1234L per data block. That is, given a random block, there are X citizens with Tax Code 1234L. Objective: Experimentation with Secondary Index over non-ordering non-key attribute and Aggregate Query Execution Planning.• P(X ≥ 1) = 0.5, i.e., the probability that at least one citizen has tax code 1234L is 50% within a block. Therefore, when we pick up a block at random, the probability of finding therein at least one citizen with Tax Code 1234L is 0.5. If there are b data blocks in the file and we are asked to retrieve those citizens with Tax Code 1234L, then ideally, we expect to access b/2 blocks (ideal case). However, in reality, we do not know where these blocks are! If we use the naïve solution (scan the whole file) to retrieve all those citizens, then we need to access b blocks. Q2.1: The data scientist claims that the expected cost using the Secondary Index should be between b/2 and b. Which is the expected cost of retrieving the citizens of Tax Code 1234L using the Secondary Index? How much bigger is this cost compared to the ideal case? Q2.2 : The data scientist is asked for a query processing plan for the aggregation query: SELECT Salary, AVG(Age) FROM CITIZEN WHERE TaxCode = ‘1234L’ GROUP BY Salary If the data management system devotes 100,000 blocks of RAM (memory), i.e., approx. 103MB (each one of 1024 bytes) for executing the query, and 5000 blocks for storing the results of the query, help the scientist by providing a query execution plan. Describe your plan (e.g., steps, methods, ideas) and report on the corresponding expected number of block accesses of your proposed solution. How much memory would you need to store the result of the aggregate query?

Database System Concepts
7th Edition
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Chapter1: Introduction
Section: Chapter Questions
Problem 1PE
icon
Related questions
Question

Task 2: Secondary Index & Aggregation Query Processing 

Assume the relation CITIZEN(ID, Tax-Code, Salary, Age) storing information about UK citizens’ tax codes and

annual salaries. There are r = 60,000,000 records. Each attribute has the same size: 128 bytes. The relation is

stored in a file sorted by the salary attribute. The block size B = 1024 bytes and any pointer in the system has

size = 128 bytes. The salary attribute is assumed to be uniformly distributed across tuples. There are 6,000

distinct salary values and there are 10,000 different tax-code values. The tax-code and age are assumed to be

statistically independent of salary. A data scientist, who has built a Secondary Index over the Tax Code (non-

ordering, non-key attribute), is interested in the specific Tax Code: ‘1234L’. The data scientist did analyse the

distribution of the Tax Code attribute and noted the following:

• Let X be the number of citizens with Tax Code 1234L per data block. That is, given a random block,

there are X citizens with Tax Code 1234L.

Objective: Experimentation with Secondary Index over non-ordering non-key attribute and

Aggregate Query Execution Planning.• P(X ≥ 1) = 0.5, i.e., the probability that at least one citizen has tax code 1234L is 50% within a block.

Therefore, when we pick up a block at random, the probability of finding therein at least one citizen

with Tax Code 1234L is 0.5.

If there are b data blocks in the file and we are asked to retrieve those citizens with Tax Code 1234L, then

ideally, we expect to access b/2 blocks (ideal case). However, in reality, we do not know where these blocks

are! If we use the naïve solution (scan the whole file) to retrieve all those citizens, then we need to access b

blocks.

Q2.1: The data scientist claims that the expected cost using the Secondary Index should be between

b/2 and b. Which is the expected cost of retrieving the citizens of Tax Code 1234L using the Secondary Index?

How much bigger is this cost compared to the ideal case?

Q2.2 : The data scientist is asked for a query processing plan for the aggregation query:

SELECT Salary, AVG(Age)

FROM CITIZEN

WHERE TaxCode = ‘1234L’

GROUP BY Salary

If the data management system devotes 100,000 blocks of RAM (memory), i.e., approx. 103MB (each one of

1024 bytes) for executing the query, and 5000 blocks for storing the results of the query, help the scientist by

providing a query execution plan. Describe your plan (e.g., steps, methods, ideas) and report on the

corresponding expected number of block accesses of your proposed solution. How much memory would you

need to store the result of the aggregate query?

AI-Generated Solution
AI-generated content may present inaccurate or offensive content that does not represent bartleby’s views.
steps

Unlock instant AI solutions

Tap the button
to generate a solution

Knowledge Booster
Fundamentals of Datawarehouse
Learn more about
Need a deep-dive on the concept behind this application? Look no further. Learn more about this topic, computer-science and related others by exploring similar questions and additional content below.
Similar questions
  • SEE MORE QUESTIONS
Recommended textbooks for you
Database System Concepts
Database System Concepts
Computer Science
ISBN:
9780078022159
Author:
Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:
McGraw-Hill Education
Starting Out with Python (4th Edition)
Starting Out with Python (4th Edition)
Computer Science
ISBN:
9780134444321
Author:
Tony Gaddis
Publisher:
PEARSON
Digital Fundamentals (11th Edition)
Digital Fundamentals (11th Edition)
Computer Science
ISBN:
9780132737968
Author:
Thomas L. Floyd
Publisher:
PEARSON
C How to Program (8th Edition)
C How to Program (8th Edition)
Computer Science
ISBN:
9780133976892
Author:
Paul J. Deitel, Harvey Deitel
Publisher:
PEARSON
Database Systems: Design, Implementation, & Manag…
Database Systems: Design, Implementation, & Manag…
Computer Science
ISBN:
9781337627900
Author:
Carlos Coronel, Steven Morris
Publisher:
Cengage Learning
Programmable Logic Controllers
Programmable Logic Controllers
Computer Science
ISBN:
9780073373843
Author:
Frank D. Petruzella
Publisher:
McGraw-Hill Education