qm2-all-tutorial-answers-organized-by-week

.pdf

School

University of Melbourne *

*We aren’t endorsed by this school

Course

20003

Subject

Economics

Date

Apr 3, 2024

Type

pdf

Pages

71

Uploaded by MasterJackal4052

Report
StuDocu is not sponsored or endorsed by any college or university QM2 All Tutorial Answers (Organized by Week) Quantitative Methods 2 (University of Melbourne) StuDocu is not sponsored or endorsed by any college or university QM2 All Tutorial Answers (Organized by Week) Quantitative Methods 2 (University of Melbourne) Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
L. Kónya, 2020, Semester 2 ECON20003 - Solutions 1 1 ECON20003 – QUANTITATIVE METHODS 2 TUTORIAL 1 Solutions Exercises for Assessment Exercise 2 One of the major measures of the quality of service provided by any organisation is the speed with which the organisation responds to customer complaints. Last year the flooring department of a large family-owned department store received 50 complaints about carpet installation. The following data represent the number of days between the receipt and resolution of these complaints. Days 54 35 29 2 1 11 126 4 35 26 12 165 27 26 74 13 5 29 22 26 33 137 28 123 14 5 110 52 94 20 19 32 152 25 27 4 27 61 36 5 10 31 29 81 13 68 110 30 31 23 a) Is the variable Days qualitative or quantitative? If it is quantitative, is it discrete or continuous? In addition, determine its level of measurement. Explain your answers. The observations are numbers of days resulting from a counting process and the possible values are non-negative integers. Therefore, Days is a quantitative variable, it is discrete (countable infinite). The measurement scale is ratio since there is a unit of measurement (day) and a genuine zero point (0 day). b) Launch RStudio and close the Script tab, if it is open at all. Create a new RStudio project and script, and name both t1e2 . Follow similar steps than in Exercise 1. c) Enter the observations from your keyboard to an RStudio spreadsheet and save them in an RData file. Quit RStudio . When prompted, save only the t1e2.R file. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
L. Kónya, 2020, Semester 2 ECON20003 - Solutions 1 2 Follow similar steps than in Exercise 1. d) Open your working directory. Capture your screen by taking a screenshot ( Alt + Print Screen ) and paste it with your answers for part (a) in a Word document. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 2 1 ECON20003 – QUANTITATIVE METHODS 2 TUTORIAL 2 Solutions Exercises for Assessment Exercise 4 In this exercise you are going to work on the data you saved in Exercise 2 last week. a) Launch RStudio and close the Script tab, if it is open. Create a new RStudio project and script, and name both t2e4 . Retrieve the t1e2 data set and save it as t2e4.RData . You can complete these tasks by following similar steps than in Exercise 2 of Tutorial 2. The variable of interest, Days , is a discrete quantitative variable. The data set is cross- sectional and it can be displayed graphically with, for example, a histogram or a boxplot. b) Use RStudio to illustrate the data on Days with a histogram. Customize your plot as you did in Exercise 3. Briefly describe what the graph tells you. A basic histogram is generated by the following command: hist(Days) In return, RStudio displays the first plot on the next page. It is black and white and looks a bit strange because the axes are too short. However, it can be easily improved by adding a few arguments: hist(Days, xlim = c(0,200), ylim = c(0, 25), col = "yellow") The new histogram is second on the next page. These histograms show that the sample data of Day s is heavily skewed to the right and that the second class interval, from 20 to 40, has the highest frequency, 21. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 2 2 Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 2 3 c) Use RStudio to illustrate the data on Days with a boxplot and customize your plot. Briefly describe what the graph tells you. Use the boxplot(Days) command to develop a basic boxplot and then add a main title to it, add the Days label to the vertical axis, and colour the rectangle on the boxplot red. A basic boxplot is generated by the boxplot(Days) command: To add the required customization, execute boxplot(Days, main = "Boxplot for Days", ylab = "Days", col = "red") The new boxplot is on the next page. It shows that in the sample of Days , (i) the median ( Q 2 ) is a bit above 25, (ii) the first quartile ( Q 1 ) is about 30, (iii) the third quartile ( Q 3 ) is a bit above 50, (iv) Q 1 – 1.5 ( Q 3 Q 1 ) is about zero, (v) Q 3 + 1.5 ( Q 3 Q 1 ) is about 110, and (vi) there are a few outliers at the upper end of the range. 1 1 Observations that differ greatly from the majority of the data set in the sense that they are either smaller than Q 1 – 1.5 ( Q 3 Q 1 ) or larger than Q 3 + 1.5 ( Q 3 Q 1 ) are considered to be outliers. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 2 4 Exercise 5 The table below details the number of international visitors (aged 15 years and over) to Australia from its top 10 markets during the 2018/19 financial year by country of residence ( COR ). 2 Overseas arrivals (‘000) by country of residence (COR) COR Visitors China 1331 Hong Kong 284 India 364 Japan 455 Korea 250 Malaysia 344 New Zealand 1276 Singapore 417 UK 670 US 771 2 Source: Estimates for the year ending June 2019 from the International Visitor Survey , Data, Table 1a, https://www.tra.gov.au/International/International-tourism-results/overview . Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 2 5 a) There are two variables: Market and Visitors . Are they qualitative or quantitative, discrete or continuous? Explain your answers. COR is a qualitative variable as its possible values are names / labels. Visitors , i.e. the number of international visitors aged 15 years to Australia, is a quantitative variable because the possible values are numbers resulting from a counting process. Originally this variable is discrete, and its possible values are non-negative integers, but the actual observations have been rounded to the nearest thousand. b) Launch RStudio , create a new RStudio project and script ( t2e5 ), enter the observations from your keyboard to an RStudio spreadsheet and save it as an RData file. Follow similar steps than in Exercise 1 and Exercise 2 of tutorial 1. c) Depict the number of visitors by country of residence market with a bar graph. 3 Use the barplot(Visitors) command to develop a basic bar graph. It returns the following plot: d) Annotate your bar graph with axes labels Country of Residence ( x -axis), Visitors to Australia ( y- axis) and with the Bar graph for Visitors to Australia title. Review the application of the main , ylab and xlab arguments in Exercise 3. The following command 3 Notice that this time a histogram would be inappropriate because the observations are classified by categories (countries of origin) rather than adjacent class intervals. 0 200 600 1000 Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 2 6 barplot(Visitors, main = "Bar graph of Visitors to Australia", xlab = "Country of residence", ylab = "Number of visitors") returns d) Increase the scale on the vertical axis to (0,1400) and colour the bars orange. Review the application of the ylim and col arguments in Exercise 3. The following command barplot(Visitors, main = "Bar graph of Visitors to Australia", xlab = "Country of residence", ylab = "Number of visitors", ylim = c(0,1400), col = “orange”) returns the bar graph shown on the next page. Bar graph of Visitors to Australia Country of residence Number of visitors 0 200 600 1000 Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 2 7 e) To make the bar graph more informative, expand the barplot command with the names.arg = COR and cex.names = 0.5 arguments. The expanded command is barplot(Visitors, main = "Bar graph of Visitors to Australia", xlab = "Country of residence", ylab = "Number of visitors", ylim = c(0,1400), col = "orange", names.arg = COR, cex.names = 0.5) It returns the bar graph shown on the next page. f) Briefly describe what the bar graph in part (e) tells you. This bar graph shows that in 2018/19 the most tourists to Australia arrived from China, followed by New Zealand, the US and the UK. Bar graph of Visitors to Australia Country of residence Number of visitors 0 200 600 1000 1400 Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 2 8 Although it was not part of Exercise 5, there is one more thing worth to mention. To make this bar graph even more informative, it is good idea to display the bars in the descending order of their heights. Let’s do this in three steps. First, we set up a data frame called original that consists of COR and Visitors by executing the original = data.frame(COR, Visitors) command. Second, we rearrange original in the descending order of Visitors and call the new data frame ordered . The relevant command is ordered = original[order(-original$Visitors),] Third, we run the barplot command like in part (e), but on the ordered data frame, i.e. barplot(ordered$Visitors, main = "Bar graph of Visitors to Australia", xlab = "Country of residence", ylab = "Number of visitors", ylim = c(0,1400), col = "orange", names.arg = ordered$COR, cex.names = 0.5) China India Korea New Zealand UK US Bar graph of Visitors to Australia Country of residence Number of visitors 0 200 600 1000 1400 Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 2 9 The new bar graph looks like this: China New Zealand US UK Japan Singapore India Malaysia Hong Kong Korea Bar graph of Visitors to Australia Country of residence Number of visitors 0 200 600 1000 1400 Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 3 1 ECON20003 – QUANTITATIVE METHODS 2 TUTORIAL 3 Solutions Exercises for Assessment Exercise 6 (Selvanathan, p. 397, ex. 10.43) A parking officer is conducting an analysis of the amount of time left on parking meters. A quick survey of 15 cars that have just left their metered parking spaces produced the times ( T , in minutes) saved in the t3e6 Excel file. Assuming that the population of T is normally distributed, estimate with 95% confidence the mean amount of time left for all the vacant meters. Do the calculations first manually and then with R . Since the population of T is said to be normally distributed and the population standard deviation is unknown, the appropriate confidence interval estimator for the mean is /2 x x t s Using your hand calculator you can obtain the sample mean and the sample standard deviation: 18.133 , 9.753 x s From the sample standard deviation and the sample size the estimate of standard error of the sample mean is 9.753 2.518 15 x s s n From the t -table the 97.5th percentile of the t distribution with df = n - 1 = 14 is 2.145. Putting all these together, /2 18.133 2.145 2.518 12.732 ; 23.534 x x t s Therefore, with 95% confidence, the mean amount of time left for all the vacant meters is somewhere between 12.732 and 23.534 minutes. To obtain this confidence interval in R , import the data to RStudio and execute the t.test(T, mu = 0, conf.level = 0.95) command, which returns: Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 3 2 The 95% confidence interval on this printout confirms our manual calculations. Exercise 7 (Selvanathan, p. 499, ex. 12.41) In this exercise do all calculations manually. a) A random sample of eight observations was taken from a normal population. The sample mean and standard deviation are 75 and 50, respectively. Can we infer at the 10% significance level that the population mean is less than 100? Just like in the previous exercise, we are interested in the mean of an allegedly normally distributed population whose standard deviation is unknown. This time, however, instead of developing a confidence interval to estimate the population mean, we need to perform a hypothesis test. Let’s follow the six-step test procedure. The hypotheses are 1 0 : 100 , : 100 A H H The sample mean is normally distributed, but since its standard error must be estimated from the sample, the test statistic is 0 x X T s Under the null hypothesis this test statistic has a t distribution with df = n – 1. The significance level is 10% and the critical value for this left-tail test is , 0.10,7 1.415 df t t     and we reject the null hypothesis if the calculated test static happens to be smaller than this critical value. The calculated or observed value of the test statistic is 1 It is easier to start with the alternative hypothesis because it is implied by the question. This is usually the case, except when the implied statement takes the form of an equality that must be in the null hypothesis. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 3 3 0 75 100 1.414 50 / 8 obs x x t s   Since the observed value of the test statistic is (slightly) larger than the critical value (-1.415), we maintain H 0 and conclude that at the 10% level there is not enough evidence to infer that the population mean is smaller than 100. b) Repeat part (a) assuming that you know that the population standard deviation is 50. If the population standard deviation is known and it is = 50, then the test statistic is 0 x X Z The critical value is 0.10 1.282 z z     Although the test statistic is different than in part (a), its calculated value is the same: 0 75 100 1.414 50 / 8 obs x x z   Since the observed value of the test statistic is smaller than the critical value (-1.282), we reject H 0 and conclude that at the 10% level there is enough evidence to infer that the population mean is less than 100. c) Review parts (a) and (b). Explain why the test statistics differ. The tests in parts (a) and (b) led to different conclusions. This is due to the fact that in part (a) we had to use the t distribution, while in part (b) we could use the standard normal distribution. Both distributions are symmetric around zero, but the t distribution is more dispersed than the standard normal distribution and hence the critical value in part (a) is further from zero than in part (b). Exercise 8 Environmental engineers have found that the percentages of active bacteria in sewage specimens collected at a sewage treatment plant have a non-normal distribution with a median of 40% when the plant is running properly. If the median is larger than 40%, some adjustments must be made. The percentages of active bacteria ( PAB ) in a random sample of 10 specimens are saved in the t3e8 Excel file. Do the data provide enough evidence (at = 0.05) to indicate that adjustments are needed? Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 3 4 a) What are the null and alternative hypotheses? Unlike the previous exercise, this one is about a population median. The hypotheses are 0 : 40 , : 40 A H H b) Which test(s) can be used to answer this question? What are the required conditions? Do you think that these conditions are likely satisfied this time? Explain your answer. We learnt about two nonparametric tests that can be used this time, the one-sample sign test for the median and the one-sample Wilcoxon signed ranks test for the median. The sign test assumes that (i) the data is a random sample, (ii) the variable of interest is qualitative or quantitative, and (iii) the measurement scale is at least ordinal. In this case we are told that the sample at hand is a random sample. The variable of interest, PAB , is a quantitative variable measured on a ratio scale. Hence, all three requirements are met. The Wilcoxon signed ranks test assumes that (i) the data is a random sample, (ii) the variable of interest is quantitative and continuous, (iii) the measurement scale is interval or ratio, and (iv) the distribution of the sampled population is symmetric. The first three requirements are clearly met. As for the fourth one, due to the small sample size it is difficult to verify it. Let’s just assume at this stage that it is satisfied and see whether the Wilcoxon signed ranks test leads to same conclusion as the sign test. If it does, then the issue of symmetricity is irrelevant. c) Perform the test(s) first manually and then with R . Explain your decision and conclusion. Sign test: There are three negative deviations and seven positive deviations, so 3 , 7 10 S S n S S The test is a right-tail test and the test statistic is S = S + = 7. From the binomial table of Selvanathan (Table 1, Appendix B, pp. 1068-1071, n = 10, k = 7, p = 0.5), the probability of observing at least 7 ‘successes’ in 10 trials is P ( S 7) = 1 – P ( S 6) = 1 - 0.8281 = 0.1719. Since this p -value is above the selected significance level ( = 0.05), we maintain the null hypothesis at the 5% significance level and conclude that there is no need for adjustment. To repeat this test with R , launch RStudio , create a new project and script ( t3e8 ), import the t3e3 data from Excel to RStudio , and execute the following commands: attach(t3e8) library(DescTools) SignTest(PAB, mu = 40, alternative = "greater") PAB 41 33 43 52 46 37 44 49 53 30 DEV = PAB ‐ 40 1 ‐7 3 12 6 ‐3 4 9 13 ‐10 Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 3 5 You should get the following printout: The p -value is 0.1719, larger than 0.05, so at the 5% significance level there is not enough evidence against the null hypothesis. Wilcoxon signed ranks test: T - = 16.5 and T + = 38.5. Their sum is 55 = (10)(11)/2. The test is a right-tail test and the test statistic is T = T + = 38.5. From the Wilcoxon Signed Rank Sum Test table of Selvanathan (Table 9, Appendix B, p. 1089, Part (b), n = 10) the 5% one-tail critical values are T L = 11 and T U = 44. Since T + = 38.5 < T U = 44, we maintain the null hypothesis at the 5% significance level and conclude that there is no need for adjustment. To repeat this test with R , execute the following commands: library(exactRankTests) wilcox.exact(PAB, mu = 40, alternative = "greater") You should get the following printout: The p -value is 0.1436, larger than 0.05, so at the 5% significance level there is not enough evidence against the null hypothesis. This time neither the sign test nor the Wilcoxon signed ranks test rejects the null hypothesis. Since they lead to the same conclusion, we do not need to worry about whether the sampled population is symmetric. Quit RStudio and save your RData and R files. PAB 41 33 43 52 46 37 44 49 53 30 DEV = PAB ‐ 40 1 ‐7 3 12 6 ‐3 4 9 13 ‐10 ABSDEV 1 7 3 12 6 3 4 9 13 10 RANK 1 6 2.5 9 5 2.5 4 7 10 8 Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 4 1 ECON20003 – QUANTITATIVE METHODS 2 TUTORIAL 4 Solutions Exercises for Assessment Exercise 4 (Selvanathan et al., p. 887, ex. 20.9) In recent years, insurance companies offering medical coverage have given discounts to companies that are committed to improving the health of their employees. To help determine whether this policy is reasonable, the general manager of one large insurance company in the US organised a study of a random sample of 30 workers who regularly participate in their company’s lunchtime exercise program and 30 workers who do not. Over a two-year period, he observed the total dollar amount of medical expenses for each individual. The data are stored in the t4e4 (column 1: Expense s; column 2: Exercise , yes or no) Excel file. Do all calculations with R . a) Can the manager conclude at the 5% significance level that companies that provide exercise programs should be given discounts? Perform the independent-samples t -test to answer the question.1 Do not forget to specify the null and alternative hypotheses. Let X Y and X N denote the medical expenses of those who regularly participate in their company’s lunchtime exercise program and who do not, respectively. Companies that provide exercise programs should be given discounts if their employees have smaller average medical expenses than the employees of those companies which do not provide this facility. Therefore, the question implies the following null and alternative hypotheses 0 : 0 , : 0 Y N A Y N H H You might recall, that this test can be performed under three different scenarios depending on what we know or assume about the population variances. Hence, like in Exercise 2 of Tutorial 4, you should first consider the sample variances. The by(Expenses, Exercise, sd) command returns the sample standard deviations: Exer ci se: no [ 1] 271. 6985 1 We use the independent-samples t -test because we have two unrelated (hence, independent) random samples of workers. However, if the sample consisted of pairs of workers where in each pair the two workers are employed by the same company and one of them regularly participates in the lunchtime exercise program while the other one does not, then we should perform the matched-pairs t -test. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 4 2 Exer ci se: yes [ 1] 266. 278 They are quite similar, so we can assume that the corresponding population variances are equal. 2 The t.test(Expenses ~ Exercise, alternative = "less", var.equal = TRUE, conf.level = 0.95) command returns The t -test statistic is 0.85656, i.e. positive. This seems to contradict the alternative hypothesis. However, as the sample estimates part of these printouts shows, R considers the No exercise group as group 1 and the Yes exercise group as group 2. Hence, we need to re-write the hypotheses as 0 : 0 , : 0 N Y A N Y H H This is a right-tail test, so execute t.test(Expenses ~ Exercise, alternative = "greater", var.equal =TRUE, conf.level = 0.95) to obtain 2 Recall that we should compare the sample variances to each other, but since this time the sample standard deviations are indeed close, the difference between their squares is not too large either. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 4 3 The test statistic did not change 3 , but the p -value did. It is 0.1976, far too large to reject the null hypothesis of equal population means. Hence, we conclude that the average medical expenses of those employees who regularly participate in their company’s lunchtime exercise program is not significantly smaller than the average medical expenses of those employees who do not exercise. If we are not willing to assume equal population variances, then the appropriate R command is t.test(Expenses ~ Exercise, alternative = "greater", conf.level = 0.95) and it returns As you can see, this time the two t -tests have practically the same degrees of freedom and identical observed test statistic and p -values, so it makes no difference whether we assume equal or unequal population variances. b) What assumptions must hold to ensure the validity of the hypothesis test in part (a) above? Does it appear that these conditions are satisfied? The samples must be random and independent. Given that this is a textbook example, we have no reason to question these requirements. The variable of interest should be quantitative and continuous. The total dollar amount of medical expenses is a quantitative variable. As it is given in dollars, the actual observations are discrete, but there are so many different possible values that we can treat this variable as continuous. The sample sizes are 30, so CL T holds. However, the population standard deviations are unknown, so the sampled populations should be normally distributed (at least not extremely non-normal). The histograms and the usual descriptive statistics can be obtained like in Exercise 2 of Tutorial 2, combined with the subset command you used in Exercise 2 of tutorial 4. hist(subset(Expenses, Exercise == "yes"), col = "blue") hist(subset(Expenses, Exercise == "no"), col = "green") return the histograms shown on the next page. Both histograms are skewed to the right, indicating that the sampled populations are unlikely to be normally distributed. 3 The test statistic value, in general, depends on the hypothesized parameter value, but is the same for two- tail, left-tail and right-tail tests. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 4 4 qqnorm(subset(Expenses, Exercise == "yes"), main = "Normal Q-Q Plot for Exercise = yes", xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", col = "blue") qqline(subset(Expenses, Exercise == "yes"), col = "red") Histogram of subset(Expenses, Exercise == "yes") subset(Expenses, Exercise == "yes") Frequency 0 200 400 600 800 1000 1200 1400 0 5 10 15 20 Histogram of subset(Expenses, Exercise == "no") subset(Expenses, Exercise == "no") Frequency 0 200 400 600 800 1000 0 5 10 15 20 Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 4 5 qqnorm(subset(Expenses, Exercise == "no"), main = "Normal Q-Q Plot for Exercise = no", xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", col = "green") qqline(subset(Expenses, Exercise == "no"), col = "red") produce the following normal QQ plots. -2 -1 0 1 2 0 200 400 600 800 1200 Normal Q-Q Plot for Exercise = yes Theoretical Quantiles Sample Quantiles -2 -1 0 1 2 0 200 400 600 800 1000 Normal Q-Q Plot for Exercise = no Theoretical Quantiles Sample Quantiles Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 4 6 Only a few dots are close to the reference lines, so it seems unreasonable to assume that the sub-populations of Expenses are normally distributed. The library(pastecs) round(stat.desc(subset(Expenses, Exercise == "yes"), basic = FALSE, desc = TRUE, norm = TRUE), 3) commands provide the following statistics for the Exercise = yes group while round(stat.desc(subset(Expenses, Exercise == "no"), basic = FALSE, desc = TRUE, norm = TRUE), 3) returns for the Exercise = no group. As you can see, for both samples, the mean is far above the median and skewness appears to be significantly positive 4 , confirming that the samples are skewed to the right. In addition, the excess kurtosis statistics are also significantly positive 5 , so the distributions of the samples are peaked and have relatively short tails compared to the corresponding normal distributions. Finally, the reported p -values of the Shapiro-Wilk test are zero, rejecting the null hypothesis of normality. Consequently, it is very unlikely that these samples have been drawn from normally distributed populations. c) Assuming that some of the assumption(s) mentioned above is (are) not satisfied, which nonparametric hypothesis-testing procedure could be used? Conduct this test and give the appropriate conclusion in the context of the problem. Compare your conclusions in parts (a) and (c). Since the sampled populations are most likely non-normal, the independent samples t -test in part (a) is inappropriate. However, the samples are independent, the variable of interest is continuous and is measured on a ratio scale, and the histograms suggest that the two distributions have similar shapes. Therefore, we can rely on the Wilcoxon rank-sum test. 4 Look at the skew.2SE statistics , they are both well above 1. 5 Look at the kurt.2SE statistics , they are both well above 1. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 4 7 The hypotheses are 0 : , : N Y A N Y H H The library(exactRankTests) wilcox.exact(Expenses ~ Exercise, alternative = "greater") commands generate the following printout: The p -value is 0.03681, so at the 5% level we reject H 0 and conclude that the median medical expenses of exercisers is significantly lower than the median medical expenses of non-exercisers, so the insurance company should give discounts to companies that provide exercise programs for their employees. d) Compare your conclusions in parts (a) and (c). In part (a) the parametric t -tests failed to reject the null hypothesis, but in part (c) the non- parametric Wilcoxon rank-sum test detected ample evidence against the null hypothesis. This illustrates that although parametric tests are more powerful than their non-parametric counterparts when their assumptions are satisfied, they can be powerless when some of their assumptions is violated. Quit RStudio and save your RData and R files. Exercise 5 (Selvanathan et al., p. 886, ex. 20.5) In a taste test of a new beer, 25 people rated the new beer and another 25 rated the leading brand on the market. The possible ratings were Poor, Fair, Good, Very Good, and Excellent. a) Suppose the responses for the new beer and the leading beer were stored using a 1-2- 3-4-5 coding system (1 = Poor, …, 5 = Excellent). Based on the data saved in the t4e5a file, can we infer that the new beer is rated less highly than the leading brand? The variable of interest is rating , and there are two populations: ratings of the new beer and ratings of the leading brand on the market. Notice, that although the possible values of rating are numbers (1, 2, 3, 4 and 5), these numbers are just labels used to identify the categories (poor, fair, good, very good and excellent), which have a natural order. Therefore, rating is qualitative variable measured on an ordinal scale, and as such, it does not have a mean, its central location is best captured by the median. It is also important to recognize that the data Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 4 8 set comprises two independent random samples on rating , one related to the new beer and another one related to the leading beer on the market. For these reasons, we need to perform a nonparametric test, namely the Wilcoxon rank- sum test, for the comparison of the population medians. If we consider the population of ratings of the new beer as population 1 and the population of ratings of the leading beer on the market as population 2, then the relevant hypotheses are 0 1 2 1 2 : , : A H H Launch RStudio , create a new project and script ( t4e5 ), import the t4e5a data from Excel to RStudio , load it to your project, and execute the following commands: library(exactRankTests) wilcox.exact(New_a, Leading_a, alternative = "less") You should get the following output: The p -value is 0.3929, far too big to reject the null hypothesis. Thus, the conclusion is that the sample does not provide sufficient evidence to infer that the new beer is rated less highly than the leading brand. b) Suppose the responses were recoded so that 3 = Poor, 8 = Fair, 22 = Good, 37 = Very Good, and 55 = Excellent. Based on the recoded data, saved in the t4e5b file, can we infer that the new beer is rated less highly than the leading brand? Import the t4e5b data from Excel to RStudio , load it to your existing project, and execute the following commands: library(exactRankTests) wilcox.exact(New_b, Leading_b, alternative = "less") You should get the following output: Apart from the variable names, this printout is the same than in part (a). Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 4 9 c) What does this exercise tell you about ordinal data? In the case of ordinal data, the actual numbers used to recode the original names or labels of the categories are absolutely arbitrary. One could use any set of numbers, granted that there are as many different numbers as categories and that these numbers preserve the rankings of the categories. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 5 1 (ECON20003 – QUANTITATIVE METHODS 2 TUTORIAL 5 Solutions Exercises for Assessment Exercise 5 In Exercise 2 of Tutorial 4, first we developed a confidence interval for the difference between the mean ages of purchasers and non-purchasers of a particular brand of toothpaste (part a), and then performed a t -test to see whether there was sufficient evidence to conclude that there was a difference in the mean age of purchasers and non-purchasers (part b). Based on the sample variances, in both cases we assumed that the two unknown population variances are different. Let’s check now whether this assumption is supported by the data. Namely, using the same data, a) Estimate the ratio of the two population variances with 95% confidence. Last week we already checked the normality assumption and found no sign of extreme non- normality. Thus, we can develop the required confidence interval the same way as we did in Exercise 2. The sample variances are s 1 2 = 13.62119 2 = 185.54 and s 2 2 = 10.03992 2 = 100.80. Both sample sizes are 20 and the confidence level is (1- ) 100% = 95%. The required F percentiles from Table 6(b) of Selvanathan (Appendix B, pp. 1080-81) are 1 2 /2, 1, 1 0.025,19,19 2.53 n n F F and 1 2 1 /2, 1, 1 0.975,19,19 0.025,19,19 1 1 0.395 2.53 n n F F F The confidence interval estimate is 1 2 1 2 2 2 2 2 1 2 1 2 /2, 1, 1 1 /2, 1, 1 / / 185.54/100.80 185.54/100.80 , , (0.728,4.660) 2.53 0.395 n n n n s s s s F F Therefore, with 95% confidence, the ratio of the variances of the populations of purchasers and non-purchasers is somewhere between 0.728 and 4.660. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 5 2 b) Can we conclude at the 5% significance level that the population variances differ? What do you conclude if the significance level is increased to 10%? The hypotheses are 2 2 1 1 0 2 2 2 2 : 1 , : 1 A H H This is a two-tail test and the significance level is 5%, so we can rely on the 95% confidence interval developed in part (a). Since that interval includes 1, the hypothesized ratio of the two population variances, the null hypothesis of equal variances cannot be rejected at the 5% significance level. The formal hypothesis test is as follows. The upper and lower critical values are the same F values than in part (a), i.e. 1 2 /2, 1, 1 0.025,19,19 2.53 n n F F and 1 2 1 /2, 1, 1 0.975,19,19 0.395 n n F F The observed test static value is 2 1 2 2 185.54 1.841 100.80 obs s F s Since it is between the lower and upper critical values, at the 5% significance level we cannot reject the null hypothesis and hence cannot conclude that the population variances differ. At the 10% significance level the critical values are 1 2 /2, 1, 1 0.05,19,19 2.17 n n F F and 1 2 1 /2, 1, 1 0.95,19,19 0.05,19,19 1 1 0.46 2.17 n n F F F Since the observed test statistic value is still between the lower and upper critical values, our decision and conclusion do not change. To complete parts (a) and (b) in R , create a new project and script ( t5e5 ), import the data from the t4e2 Excel file and execute attach(t4e2) library(DescTools) VarTest(Age ~ Householder) The printout is at the top of the next page. The 95% confidence interval is about (0.729, 4.650), almost the same we got in part(a). The test statistic is about 1.841, the same as above. The p -value is 0.1927, so we would fail to reject the null hypothesis even at the 19% significance level. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 5 3 F t est t o compar e t wo var i ances dat a: Age by Househol der F = 1. 8406, num df = 19, denom df = 19, p- val ue = 0. 1927 al t er nat i ve hypot hesi s: t r ue r at i o of var i ances i s not equal t o 1 95 per cent conf i dence i nt er val : 0. 728549 4. 650295 sampl e est i mat es: r at i o of var i ances 1. 840643 Exercise 6 (Selvanathan, p. 558, ex. 13.58) In a public opinion survey, 60 out of a sample of 100 high-income voters and 40 out of a sample of 75 low-income voters supported the introduction of a new national security tax. Can we conclude at the 5% level of significance that there is a difference between the proportions of high- and low-income voters favouring a new national security tax? Do the calculations both manually and with R . Let X 1 be the population of high-income voters and X 2 be the population of low-income voters. The proportions of voters who are in favour of a new national security tax is p 1 and p 2 . The hypotheses are 0 1 2 1 2 : 0 , : 0 A H p p H p p The sample proportions are 1 2 60 40 ˆ ˆ 0.6000 , 0.5333 100 75 p p Using these sample proportions as estimates of the corresponding population proportions, 1 1 2 2 ˆ ˆ ˆ ˆ 60 , (1 ) 40 , 40 , (1 ) 35 np n p np n p They are all much bigger than 5, so we can rely on the normal approximation and perform a Z -test. The critical values are /2 0.025 1.96 z z   and H 0 is to be rejected if the calculated tests statistic is either smaller than -1.96 or larger than 1.96. Under the null hypothesis the hypothesized difference between the two population proportions is D 0 = 0, so we can estimate the common population proportion from the pooled sample: Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 5 4 1 2 1 2 60 40 ˆ 0.5714 100 75 f f p n n The estimate of the standard error is 1 2 ˆ ˆ 1 2 1 1 1 1 ˆˆ 0.5714 0.4286 0.0756 100 75 p p s pq n n and the test statistic is 1 2 1 2 ˆ ˆ ˆ ˆ 0.6000 0.5333 0.8823 0.0756 obs p p p p z s Since it is between the lower and upper critical values, we cannot reject the null hypothesis. Therefore, at the 5% level there is no significant difference between the proportions of high- and low-income voters favouring a new national security tax. To perform this test in R , create a new RStudio project and script ( t5e6 ), and execute the following command: 1 prop.test(x = c(60,40), n = c(100,75), correct = FALSE) You should get 2- sampl e t est f or equal i t y of pr opor t i ons wi t hout cont i nui t y cor r ect i on dat a: c( 60, 40) out of c( 100, 75) X- squar ed = 0. 77778, df = 1, p- val ue = 0. 3778 al t er nat i ve hypot hesi s: t wo. si ded 95 per cent conf i dence i nt er val : - 0. 08154755 0. 21488088 sampl e est i mat es: pr op 1 pr op 2 0. 6000000 0. 5333333 The p -value is 0.3778, so we maintain the null hypothesis. The chi-square test statistic is 0.77778 and its square root is about 0.8819, almost the same than the Z test statistic we obtained manually, 0.8823. 1 The alternative and conf.level arguments can be omitted because this is a two-tail test at the 5% significance level. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 6 1 ECON20003 – QUANTITATIVE METHODS 2 TUTORIAL 6 Solutions Exercises for Assessment Exercise 5 A farmer wants to know if the weight of parsley plants is influenced by using a fertilizer. He selects 90 plants and randomly divides them into three groups of 30 plants each. He applies a biological fertilizer to the first group, a chemical fertilizer to the second group and no fertilizer at all to the third group. After a month he weighs all plants and saves the measurements in the t6e5 Excel file. Can we conclude from these data at the 5% significance level that fertilizer affects weight? a) Obtain the basic descriptive statistics with R and then perform the ANOVA F -test manually. This exercise is similar to Exercise 1, so we need to follow the same steps. library(pastecs) round(stat.desc(t6e5, basic = FALSE , desc = TRUE, norm = TRUE, p = 0.95),3) returns the following descriptive statistics: None Bi ol ogi cal Chem i cal medi an 50. 000 53. 500 57. 500 mean 51. 200 53. 633 56. 967 SE. mean 1. 431 1. 617 1. 434 CI . mean. 0. 95 2. 926 3. 307 2. 933 var 61. 407 78. 447 61. 689 st d. dev 7. 836 8. 857 7. 854 coef . var 0. 153 0. 165 0. 138 skewness 0. 488 - 0. 049 - 0. 173 skew. 2SE 0. 571 - 0. 057 - 0. 203 kur t osi s - 0. 695 - 0. 951 - 0. 694 kur t . 2SE - 0. 417 - 0. 571 - 0. 417 nor mt est . W 0. 958 0. 984 0. 977 nor mt est . p 0. 271 0. 922 0. 741 H 0 : 1 = 2 = 3 and H A : not all three population means are the same. k = 3, n = 3 30 = 90. The 5% critical value is F , k-1 , n-k = F 0.05,2,87 F 0.05,2,90 = 3.10. Therefore, H 0 is to be rejected if F obs > 3.10. The calculations are as follows. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 6 2 1 51.20 53.63 56.97 53.93 3 k j j x x k 2 2 2 1 2 30 [ 51.20 53.93 53.63 53.93 56.97 53.93 ] 503.54 k j j j SST n x x 503.54 251.77 1 2 SST MST k 2 2 2 2 2 1 1 1 ( 1) 29 (7.84 8.86 7.85 ) 5846.04 j n k k ij j j j j i j SSE x x n s  5846.04 67.20 87 SSE MSE n k 251.77 3.747 67.20 obs MST F MSE Since F obs > 3.10, we reject H 0 and conclude at the 5% significance level that fertilizer affects weight. b) Repeat the ANOVA F -test with R . You need to execute the following commands Weight = c(None, Biological, Chemical) Fertilizer = gl(3, 30, 90, c("None", "Biological", "Chemical")) summary(aov(Weight ~ Fertilizer)) to obtain Df Sum Sq Mean Sq F val ue Pr ( >F) Fer t i l i zer 2 503 251. 43 3. 743 0. 0276 * Resi dual s 87 5845 67. 18 - - - Si gni f . codes: 0 ‘ * * * ’ 0. 001 ‘ * * ’ 0. 01 ‘ * ’ 0. 05 ‘ . ’ 0. 1 ‘ ’ 1 The ANOVA F -test statistic is 3.743 and its p -value is 0.0276 < 0.05, so at the 5% significance level we reject H 0 . c) What are the required conditions for the tests in parts (a) and (b)? Do they seem to be satisfied? The ANOVA F -test assumes that (i) the data set constitutes k independent random samples of independent observations drawn from k (sub-) populations; (ii) the variable of interest is quantitative and continuous; (iii) the measurement scale is interval or ratio; (iv) each (sub-) population is normally distributed and (v) has the same variance. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 6 3 The first condition is not testable. The variable of interest is the weight of parsley plants, a quantitative and continuous variable measured on a ratio scale, so the second and third conditions are satisfied. The normality assumption is supported by the descriptive statistics and the Wilk-Shapiro tests on the first page. 1 As for the last requirement, it can be checked with the Levene test. library(car) leveneTest(Weight ~ Fertilizer) return Levene' s Test f or Homogenei t y of Var i ance ( cent er = medi an) Df F val ue Pr ( >F) gr oup 2 0. 5377 0. 586 87 The test statistic value is 0.5377 and its p -value is 0.586, so we can safely maintain the null hypothesis of equal variances (i.e. homoskedasticity) at any reasonable significance level. Consequently, we can rely on the ANOVA F -test. d) Perform the Welch F -test in R . Does it lead to the same conclusion than the ANOVA F - test? oneway.test(Weight ~ Fertilizer) returns One- way anal ysi s of means ( not assum i ng equal var i ances) dat a: wei ght s and f er t i l i zer s F = 4. 0328, num df = 2. 000, denom df = 57. 827, p- val ue = 0. 02293 This time the Welch F -test statistic and p -value are very similar to the ANOVA F -test statistic and p -value, so the two tests lead to the same conclusion. We can safely conclude at the 5% significance level that fertilizer affects weight, irrespectively of the (sub-) population variances. e) Perform the Kruskal-Wallis test in R (use = 0.05). Does it lead to a different conclusion than the parametric tests in parts (b) and (d)? kruskal.test(Weight, Fertilizer) produces the following printout: Kr uskal - Wal l i s r ank sum t est dat a: wei ght s and f er t i l i zer s Kr uskal - Wal l i s chi - squar ed = 7. 3443, df = 2, p- val ue = 0. 02542 1 For the sake of brevity, I skip the explanation this time. Please note, however, that this answer would not be appreciated in the assignment and on the exam. If you are asked to check normality, explain briefly what the four quick checks suggest. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 6 4 The test statistic is 7.344. Since each sample size is large enough, we can rely on the reported p -value based on chi-square approximation. It is 0.02542 < 0.05, so at the 5% significance level it leads to the same conclusion than the F -tests in parts (b) and (d). Namely, fertilizer affects the weight of parsley plants. Exercise 6 (Selvanathan, p. 644, ex. 15.36) A randomised block experiment produced the data listed below. Treatment Block 1 2 3 4 1 6 5 4 4 2 8 5 5 6 3 7 6 5 6 a) Conduct F -tests at the 5% significance level to find out whether (i) the treatment means differ; (ii) the block means differ. Do the calculations first manually and then in R . Obtain the required descriptive statistics with your Casio calculator: ,1 ,2 ,3 ,4 ,1 ,2 ,3 7.000 , 5.333 , 4.667 , 5.333 4.750 , 6.000 , 6.000 5.583 , 1.165 T T T T B B B x x x x x x x x s From these statistics, 2 2 ( -1) 11 1.165 14.929 SS n s 2 2 2 , 1 2 2 3 [(7.000 5.583) (5.333 5.583) (4.667 5.583) (5.333 5.583) ] 3 2.972 8.916 k T j j SST b x x 8.916 2.972 1 3 SST MST k 2 2 2 2 , 1 4 [(4.750 5.583) (6.000 5.583) (6.000 5.583) ] 4 1.042 4.168 b B i i SSB k x x Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 6 5 4.168 2.084 1 2 SSB MSB b 14.929 8.916 4.168 1.845 SSE SS SST SSB 1.845 0.307 1 6 SSE MSE n k b ANOVA for the treatment means H 0 : T, 1 = T, 2 = T, 3 = T, 4 and H A : not all four T,j are the same 2.972 9.681 0.307 T MST F MSE 0.05,3,6 4.76 crit F F Since the observed test statistic value is larger than the critical value, at the 5% significance level we reject H 0 . Hence, there is enough evidence to conclude that the treatment means are not all equal. ANOVA for the block means H 0 : B, 1 = T, 2 = B, 3 and H A : not all three B,j are the same 2.084 6.788 0.307 B MSB F MSE 0.05,2,6 5.14 crit F F Since the observed test statistic value is larger than the critical value, at the 5% significance level we reject H 0 and conclude that the block means are significantly different. To do these tests in R , you need to import and reshape the data like in Exercise 4 to be able to use the aov function. The Y = c(as.matrix(t6e6[-1,-1])) Treatment = gl(4, 3, 12) Block = gl(3, 1, 12) summary(aov(Y ~ Treatment + Block)) commands produce the following printout: Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 6 6 Df Sum Sq Mean Sq F val ue Pr ( >F) Tr eat ment 3 8. 917 2. 9722 9. 727 0. 0101 * Bl ock 2 4. 167 2. 0833 6. 818 0. 0285 * Resi dual s 6 1. 833 0. 3056 - - - Si gni f . codes: 0 ‘ * * * ’ 0. 001 ‘ * * ’ 0. 01 ‘ * ’ 0. 05 ‘ . ’ 0. 1 ‘ ’ 1 Because of rounding errors, there are some small differences between the test statistics calculated manually and obtained in R , but the conclusions are the same. (b) Conduct a Friedman test at the 5% significance level to determine whether the treatment medians (central locations) differ. Do the calculations first manually and then in R . Following the same steps than in Exercise 4, you obtain: Treatment Block 1 Ranks 2 Ranks 3 Ranks 4 Ranks 1 6 4.0 5 3.0 4 1.5 4 1.5 2 8 4.0 5 1.5 5 1.5 6 3.0 3 7 4.0 6 2.5 5 1.0 6 2.5 T j 12.0 7.0 4.0 7.0 In this case, unlike in Exercise 4, there are ties. The uncorrected test statistic is 2 2 2 2 2 1 12 12 3 ( 1) (12 7 4 7 ) 3 3 5 6.6 ( 1) 3 4 5 k r j j F T b k b k k The correction factor is 3 3 3 3 1 3 3 ( ) (2 2) (2 2) (2 2) 1 1 0.9 ( ) 3(4 4) b i i i t t C b k k and the corrected test statistic is 6.6 7.33 0.9 r rc F F C The small-sample critical value is 7.4, a bit larger than the observed value of the test statistic (7.33), so H 0 is to be maintained. Consequently, at the 5% significance level there is not enough evidence to conclude that the treatment medians are not all equal. In R , you need to execute friedman.test(Y ~ Treatment | Block) Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 6 7 to get Fr i edman r ank sum t est dat a: Y and Tr eat ment and Bl ock Fr i edman chi - squar ed = 7. 3333, df = 3, p- val ue = 0. 062 The reported test statistic is 7.3333, the same as F rc , and the p -value is 0.062 > 0.05, implying H 0 at the 5% significance level. Note, however, that this p -value has been derived from the chi-square distribution with df = 3, which is inaccurate because k = 4 and b = 3 are relatively small. (c) Are the required conditions of the Friedman test valid this time? The Friedman test assumes that (i) the data is a random sample of b independent blocks of k number of observations that are not independent of each other (i.e. the experimental design is a randomised block design), (ii) the variable of interest is quantitative and continuous, and (iii) the measurement scale is at least ordinal. The first assumption was said to be satisfied. The second and third assumptions, however, cannot be verified because the variable of interest is not specified this time. Also recall, that the chi-square approximation to the sampling distribution of the Friedman test statistic is good enough only if k > 6 and/or b > 24. This time, however, b = 3 and k = 4, so the conclusion drawn in part (b) cannot be taken at face value. The appropriate 5% ‘small- sample’ critical value is 7.4. It is smaller than the chi-square critical value, but still larger than the observed test statistic value, so it does not alter our conclusion. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 7 1 ECON20003 – QUANTITATIVE METHODS 2 TUTORIAL 7 Solutions Exercises for Assessment Exercise 6 (Selvanathan et al., p. 678, ex. 16.1) Consider a multinomial experiment involving n = 300 trials and k = 5 cells. The observed frequencies resulting from the experiments 1 to 5 are 24, 64, 84, 72, 56, and the hypotheses to be tested are as follows: H 0 : p 1 = 0.1, p 2 = 0.2, p 3 = 0.3, p 4 = 0.2, p 5 = 0.2 H A : at least one p i ( i = 1, 2, 3, 4, 5) is not equal to its value specified in H 0 . Test the null hypothesis at the 1% significance level. This is an example for the chi-square test of goodness of fit. The critical value is 2 , k -1 = 2 0.01,4 = 13.3 and H 0 is to be rejected if the calculated test statistic value is larger than this critical value. The expected frequencies are equal to the number of trials times the probabilities under H 0 . The details are shown in the following table: i p i,0 o i e i ( o i ‐e i ) 2 / e i 1 0.1000 24 30 1.200 2 0.2000 64 60 0.267 3 0.3000 84 90 0.400 4 0.2000 72 60 2.400 5 0.2000 56 60 0.267 Sum 1.000 300 300 4.533 The expected frequencies are all large enough (i.e. ≥ 5), so the chi-square approximation is valid. Since 2 obs = 4.533 < 13.3, H 0 cannot be rejected at the 1% level. Hence, there is not sufficient evidence to conclude that at least one p i is not equal to its value specified in H 0 . To perform this test in R , you just need to execute the following command: Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 7 2 chisq.test(c(24, 64, 84, 72, 56), p = c(0.1, 0.2, 0.3, 0.2, 0.2)) The printout is Chi - squar ed t est f or gi ven pr obabi l i t i es dat a: c( 24, 64, 84, 72, 56) X- squar ed = 4. 5333, df = 4, p- val ue = 0. 3386 The p -value is 0.3386, so H 0 cannot be rejected at any reasonable significance level. Exercise 7 Return to the case study described in Exercise 2. Is it possible to infer at the 1% significance level that the preference for Australian made grocery products ( Aussie ) and the impact of brand name on product choice ( Brand ) are related to each other? Perform a chi-square test of independence with R . The null hypothesis is that Aussie and Brand are independent of each other, while the alternative hypothesis is that they are related to each other. Following similar steps than in part (a) of Exercise 2, chisq.test(Aussie, Brand) returns the following printout: Pear son' s Chi - squar ed t est dat a: Aussi e and Br and X- squar ed = 10. 839, df = 4, p- val ue = 0. 02843 There is no warning message, so all expected frequencies are at least five. The (Pearson) chi-square test statistic value is 10.839 and the corresponding p -value is 0.0284. Consequently, at the 1% significance level we cannot reject H 0 and conclude that the preference for Australian made grocery products ( Aussie ) and the impact of brand name on product choice ( Brand ) might not be related to each other. Exercise 8 A survey was conducted in five countries. The percentages of respondents whose household members own more than one personal computer, laptop, notebook or iPad are as follows: Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 7 3 Australia New Zealand China Japan South Korea 53% 48% 38% 54% 49% Suppose that the survey was based on 500 respondents in each country. (a) At the 0.05 level of significance, determine whether there is some significant difference in the proportion of households in these countries who own more than one computer (personal computer, laptop, notebook or iPad). Do the calculations first manually and then in R . This is an example for the application of the chi-square test of homogeneity. The null hypothesis is that the proportion of households who own more than one computer is the same in these countries, while the alternative hypothesis is that there are differences. The number of surveyed households who own more than one computer is 0.53 500 = 265 in Australia, 0.48 500 = 240 in New Zealand, 0.38 500 = 190 in China, 0.54 500 = 270 in Japan and 0.49 500 = 245 in South Korea. Accordingly, the number of surveyed households who have at most one computer is 235 in Australia, 260 in New Zealand, 310 in China, 230 in Japan and 255 in South Korea. These observed frequencies can be summarised in a 2 5 contingency table and the calculations can be performed like in Exercise 3. o ij Country Computer Australia NZ China Japan Korea Total More than one 265 240 190 270 245 1210 None or one 235 260 310 230 255 1290 Total 500 500 500 500 500 2500 e ij Country Computer Australia NZ China Japan Korea Total More than one 242.00 242.00 242.00 242.00 242.00 1210.00 None or one 258.00 258.00 258.00 258.00 258.00 1290.00 Total 500.00 500.00 500.00 500.00 500.00 2500.00 Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 7 4 o ij 2 / e ij Country Computer Australia NZ China Japan Korea Total More than one 290.19 238.02 149.17 301.24 248.04 1226.65 None or one 214.05 262.02 372.48 205.04 252.03 1305.62 Total 504.24 500.03 521.65 506.28 500.07 2532.27 From the table bove, the observed test static is 2 2 1 1 2532.27 2500 32.27 r c ij i j ij o n e  The degrees of freedom is df = (2 1)(5 1) = 4 and the 5% critical value is 2 , df = 2 0.05,4 = 9.49. The observed test statistic value is larger than this critical value, so at the 5% significance level we reject H 0 and conclude that there is some significant difference in the proportion of households in these countries who own more than one computer. You can either import the four columns of the observed frequencies into R from the t7e8 Excel file or can enter them from the keyboard. Either way, you need to create a matrix of the four columns (I call this matrix Respondents ) and perform the test on it. Respondents = cbind(Australia, NZ, China, Japan, Korea) chisq .test(Respondents) You should get Pear son' s Chi - squar ed t est dat a: Respondent s X- squar ed = 32. 273, df = 4, p- val ue = 1. 682e- 06 (b) Find the approximate p -value of the test in (a) from the relevant statistical table. The p -value is equal to the probability that a chi-square random variable with 4 degrees of freedom takes on a number equal to or larger than the observed test statistic value. From the chi-square table the largest critical value with df = 4 and = 0.005 is 14.9. Since it is still smaller than 2 obs = 32.27, the p -value is smaller than 0.005. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 7 5 Exercise 9 In Exercise 4 you performed a t -test on the Pearson correlation coefficient between Pric e and Odometer and concluded at the 5% significance level that there is a significantly negative linear relationship between them. Later, however, you realised that this test might be misleading because Price is probably non-normal. To double check your conclusion, calculate and test the Spearman correlation coefficient with R . You need to execute cor.test(Price, Odometer, alternative = "less", method = "spearman", exact = FALSE) to obtain Spear man' s r ank cor r el at i on r ho dat a: Pr i ce and Odomet er S = 300540, p- val ue < 2. 2e- 16 al t er nat i ve hypot hesi s: t r ue r ho i s l ess t han 0 sampl e est i mat es: r ho - 0. 8034326 As you can see, the Spearman sample correlation coefficient (-0.8034326) is very similar to the Pearson sample correlation coefficient (-0.8082646) and it is significantly negative. Hence, we conclude that there is a significantly negative relationship between Odometer and Price . Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 8 1 ECON20003 – QUANTITATIVE METHODS 2 TUTORIAL 8 Solutions Exercises for Assessment Exercise 3 Lotteries have become important sources of revenue for governments. Many people have criticised lotteries, however, referring to them as a tax on the poor and uneducated. In an examination of the issue a random sample of 100 adults was asked how much they spend on lottery tickets as a percentage of the total household income. They were also interviewed about various socioeconomic variables, like number of years of education , age , number of children , and personal income (in thousands of dollars). The data are stored in file t8e3 . Obtain and test appropriate correlation coefficients with R to study the following beliefs. Use = 0.05. a) Relatively uneducated people spend a greater proportion of their income on lotteries than do relatively educated people. b) Older people spend a greater proportion of their income on lottery tickets than do younger people. c) People with more children spend a greater proportion of their income on lotteries than do people with fewer children. d) Relatively poor people spend a greater proportion of their income on lotteries than do relatively rich people. You learnt about two correlation coefficients, the Pearson correlation coefficient and its nonparametric counterpart, the Spearman correlation coefficient. The Pearson correlation coefficient is appropriate when the variables are quantitative and are measured on an interval or on a ratio scale. The t -test for H 0 : xy = 0, however, is based on the stronger assumption that both variables, X and Y , are normally distributed. If these requirements are met, it is better to use the Pearson correlation coefficient, but if not, you should rely on the nonparametric Spearman correlation coefficient. The five variables in this example are all quantitative. The actual measurements, however, are certainly not normally distributed because they are rounded to the nearest integers and hence, they are discrete. Still, if these variables assume relatively large numbers of integers, it might be possible to approximate their distributions with normal distributions. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 8 2 In order to decide whether this is the case, it is worth to have a look at the usual descriptive statistics. According to the Minimum and Maximum values, in the samples at hand Children assumes only 7 different integers, Lottery and Education take on only 14 different integers, while Age and Income assume 62 and 85 different integers, respectively. Given the small numbers of different actual values, the distributions of Children , Lottery and Education clearly cannot be approximated with normal distributions. As regards the other two variables, the SW test rejects normality at any significance level for Income and at the 5% significance level for Age . For these reasons, for any pair of variables the strength of the (linear) relationship is best measured by the Spearman correlation coefficient. a) The belief that relatively uneducated people spend a greater proportion of their income on lotteries than do relatively educated people implies a negative correlation between Lottery and Education and accordingly the hypotheses are H 0 : s = 0 and H A : s < 0. The appropriate R command (see Exercise 5 of Tutorial 7) is cor.test(Lottery, Education, method = "spearman", exact = TRUE, alternative = "less") which returns Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 8 3 Spearman rho is about -0.603 < 0 and its reported p- value is practically zero. 1 Thus, H 0 can be rejected at any reasonable significance level implying that there is a negative correlation between Lottery and Education . b) The belief that older people spend a greater proportion of their income on lottery tickets than do younger people implies a positive correlation between Lottery and Age , so H 0 : s = 0 and H A : s > 0. Spearman rho is about 0.141 > 0 and its p- value is about 0.0809. Thus, H 0 cannot be rejected at the 5% level, i.e. Lottery and Age are only insignificantly positively correlated with each other. Hence the data does not support the second hypothesis. c) The belief that people with more children spend a greater proportion of their income on lotteries than do people with fewer children implies a positive correlation between Lotter y and Children , so H 0 : s = 0 and H A : s > 0. Spearman rho is about -0.042 < 0 and its p- value is 0.6618, far too large to reject the null hypothesis of no correlation in favour of the alternative of a positive correlation between Lotter y and Children . 1 This reported p -value is just an approximation because, as R warns us, there are ties. Still, it is so small (1.617 / 10 11 ) that we have every reason to assume that the exact p -value is also smaller than any reasonable significance level. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 8 4 d) The belief that relatively poor people spend a greater proportion of their income on lotteries than do relatively rich people implies a negative correlation between Lottery and Income , so H 0 : s = 0 and H A : s < 0. Spearman rho is about -0.532 < 0 and its p- value is practically zero. Thus, H 0 can be rejected at the 5% level, or at any reasonable significance level. This means that there is enough evidence to conclude that the belief is probably correct - there is a negative correlation between Lottery and Income . Exercise 4 (Selvanathan et al., p. 765, ex. 17.74) The head office of a life insurance company believed that regional managers should have weekly meetings with their salespeople, not only to keep them abreast of current market trends but also to provide them with important facts and figures that would help them in their sales. Furthermore, the company felt that these meetings should be used for pep talks. One of the points the management felt strongly about was the high value of new contact initiation and follow-up phone calls. To dramatize the importance of phone calls on prospective clients and (ultimately) on sales, the company undertook the following small study. Twenty randomly selected life insurance salespeople were surveyed to determine the number of weekly calls they made and the number of policy sales they concluded. The data ( Calles and Sales ) are saved in file t8e4 . Perform the following tasks with R . a) Do you expect Calls and Sales to be related to each other? If yes, do you expect the relationship between them to be positive or a negative? Which variable is likely determining the other? If management is right and new contact initiation and follow-up phone calls are indeed useful, then Calls and Sales are certainly related to each other and the relationship between them is positive. Moreover, everything else held constant, Calls can be expected to influence Sales , not the other way around. b) Illustrate the data on a scattergram. What does this plot suggest about the relationship between the two variables? Calls (independent variable) is measured on the horizontal axis and Sales (dependent variable) on the vertical axis. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 8 5 The plot(Calls, Sales, main = "Scatterplot of Sales versus Calls", col = "green", pch = 19) command generates the scatterplot on the next page. It shows that the two variables tend to move in the same direction, so in this sample there is indeed a positive, and seemingly strong, linear relationship between Calls and Sales , as expected. c) Find the correlation coefficient between Calls and Sales . What does this coefficient and the corresponding t -test statistic and p -value tell you about the relationship between the two variables? Can we rely on this t -test? Calls and Sales are both quantitative variables so the strength of a linear relationship between them can be measured by the Pearson correlation coefficient. The Pearson correlation coefficient is about 0.955, so in this sample there is indeed a strong linear relationship between Calls and Sales . The corresponding t -statistic is 13.615 > 0 and the p -value for a right-tail test is zero, thus H 0 : xy = 0 can be safely rejected in favor of H A : xy > 0. Hence, there is overwhelming evidence to infer that there is a positive linear relationship between Calls and Sales . Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 8 6 This t -test for the Pearson population correlation coefficient assumes that both sampled populations are normally distributed. A quick look at the descriptive statistics and the SW test results (see next page) suggests that, despite the limited sample size, there is no reason to question normality. d) Find the least squares regression line that expresses the number of Sales as a function of the number of Calls . The summary(lm(Sales ~ Calls)) R command generates the following printout: e) What do the coefficients tell you? From the Estimate column the point estimates of the intercept and slope parameters are Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 8 7 0 1 ˆ ˆ 2.059 , 0.345   The y -intercept estimate is negative. Since it refers to number of sales, this point estimate is clearly meaningless. The slope estimate tells us that by every additional call a week the number of policies sold is expected to increase by 0.345. f) What proportion of the variability in the number of sales can be attributed to the variability in the number of calls? The answer to this question is provided by the coefficient of determination, R 2 ( Multiple R-squared on the printout). It is about 0.911, meaning that about 91% of the total sample variation in Sales can be attributed to the variation in Calls , and thus can be explained by this simple linear regression model. g) Is there enough evidence (with = 0.05) to indicate that the larger the number of calls, the larger the number of sales? The question implies the following hypotheses about the slope coefficient: 0 1 1 : 0 : 0 A H vs H and the hypothetical parameter value is zero. The observed t -statistic is 13.616. It is positive, as implied by the alternative hypothesis, and its p -value, half of Pr (> | t |) , is zero. Hence, the slope estimate is significantly positive, practically at any level, and more calls can be expected to generate more sales. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 9 1 ECON20003 – QUANTITATIVE METHODS 2 TUTORIAL 9 Solutions Exercises for Assessment Exercise 3 (Selvanathan et al., p. 827, Case 18.4) A leader of the Workers Union in New Zealand would like to study the movement in the average hourly earnings of New Zealand workers. He collected and recorded data on average earnings ( AE , $), labour cost 1 ( LC , $) and rate of inflation ( RI , %). His data are saved in the t9e3 file. a) Set up a suitable regression model to investigate the impact of labour cost and rate inflation on the hourly earnings of an average New Zealand worker. Given the objective of the research project, the population regression model is 0 1 2 i i i i AE LC RI b) Do you expect the slope parameters to be positive or negative? Labour cost and inflation are likely to put an upward pressure on average earnings, so both slope parameters are expected to be positive. c) Estimate the regression model. 1 Labour cost is the sum of all wages paid to employees, as well as the cost of employee benefits and payroll taxes paid by an employer. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 9 2 d) Do the estimated slope coefficients have the logical signs? Carefully explain the meanings of the estimated slope coefficients. The slope estimate of LC is positive, as expected. However, the slope estimate of RI is negative and this does not seem to be reasonable. The slope estimates suggest that (i) Given the rate of inflation, every additional dollar labour cost is expected to raise the average hourly earnings by about 2.7 cents; (ii) Given the labour cost, a 1 percentage point increase of the inflation rate is expected to bring down the average hourly earnings by about 1.7 cents. e) What do the unadjusted and the adjusted coefficients of determination tell you about the quality of the fit? In this case there is hardly any difference between R 2 and Adj. R 2 . This is because 2 2 1 1 (1 ) 1 n R R n k and 1 67 1 1.03125 1 67 2 1 n n k is very close to one, so 2 2 1 1 R R . Both statistics are almost one, implying that this regression model can explain almost all the variations in the average hourly earnings. Suppose now that you have not estimated the regression model yourself but received a hard copy of the R printout from a friend. However, your friend’s printer was running out of ink and some details are not visible on your copy, which is shown on the top of the next page. Complete the remaining tasks using only this incomplete printout and the relevant statistical tables. f) Try to recover the missing y -intercept estimate. The t -statistic is the point estimate divided by the standard error, so 0 0 ˆ ˆ 0 ˆ 101.451 0.089571 9.087 t s     Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 9 3 g) What are the missing t-Statistic and Prob . value for RI ? 2 2 2 ˆ ˆ ˆ 0.017015 0.82509 0.020622 t s   On the R regression printout the Pr(> | t |) value of a t -test is the p -value for a two-tail test with zero hypothesized parameter value, i.e. twice the probability that the t -test statistic assumes a value that is at least as extreme as the observed test-statistic value. Therefore, 64 65 65 65 2 ( 0.82509) 2 ( 0.82509) 2 ( 0.82509) 2 ( 1.295) 2 0.10 0.20 df df df df P t P t P t P t     Note, that the second last step was based on the table value t 0.1, 65 = 1.295. h) Perform the F -test of overall significance at the 0.005 level. State the null and alternative hypotheses, show the calculation of the test statistic, make a statistical decision based on the critical value approach and on the p -value approach, respectively, and draw your conclusion. The hypotheses are 0 1 2 1 2 : 0 . : 0 / 0 A H vs H or and or, equivalently, 2 2 0 : 0 . : 0 A H R vs H R The critical value is Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 9 4 , , 1 0.005,2,64 0.005,2,60 5.79 k n k F F F   and H 0 is to be rejected if the observed test statistic value is larger than this critical value. From the formula of the F -test statistic based on R 2 , 2 2 / 0.999346 / 2 48897.66 (1 ) / ( 1) (1 0.999346) / 64 obs R k F R n k Since the observed value of the test statistic is far bigger than the critical value, at the 0.005 we reject H 0 and conclude that the model is useful because either LC or RI or both have some significant effect on AE . From the F -table, we cannot obtain the p -value of this test, but it is certainly smaller than 0.005 because 1 2 1 2 1 2 2, 64 2, 70 2, 70 ( 48897.66) ( 48897.66) ( 5.72) 0.005 df df df df df df P F P F P F Exercise 4 In part (e) of Exercise 2 you performed a general F -test with R on the following hypotheses: 0 2 3 2 3 2 3 : 1.8, 3.2 , : 1.8 3.2 1.8, 3.2 A H H or or a) Derive and estimate the restricted regression implied by the null hypothesis. By plugging the restriction in the model, we obtain 0 1 2 3 0 1 1.8 3.2 time depart reds trains depart reds trains so the restricted model is 0 1 1.8 3.2 time reds trains depart The corresponding R regression printout is on the next page. b) Using the sum of squares for errors from the unrestricted and restricted regressions perform the general F -test manually at the 5% significance level. Did you manage to get the same results than in part (e) of Exercise 1? Would it be possible to calculate the test statistic from the coefficients of determination of the unrestricted and restricted regressions? Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 9 5 The null hypothesis comprises two linear restrictions, 2 = 1.8 and 3 = 3.2, so m = 2. From the restricted regression the sum of squares due to error is SSE r = 3952.042, so the observed general F -test statistic is 1 231 3 1 3952.04 3729.870 3729.87 2 6 1 2 0 .76 r obs SSE SSE n k F m SSE The critical value is F , df 1, df 2 = F 0.05,2, 227 F 0.05,2, = 3.00. It is smaller than the observed test statistic value, so at the 5% significance level we reject H 0 . The test statistic and the statistical decision are the same than in part (e) of Exercise 2. This time the test statistic could not be calculated from the coefficients of determination because the unrestricted and the restricted regressions have different dependent variables. c) Using a 5% significance level, test the null hypothesis that Bill’s expected delay from a train at the Murrumbeena level crossing is 3.5 minutes and the delay from a train at the Murrumbeena level crossing is double that from a red light. Perform the test with R only. The hypotheses are 0 3 3 2 3 3 2 3 3 2 : 3.5, 2 , : 3.5 2 3.5, 2 A H H or or Use the linearHypothesis function of the car package and specify the null hypothesis as c("trains = 3.5", "trains = 2*reds") . The relevant printout is o the next page. The F -test statistic is 7.6036 and its p -value is 0.0006. Hence, at the 5% significance level we reject H 0 and conclude that the expected delay from a train at the Murrumbeena level crossing is either different from 3.5 minutes, or is not double that from a red light, or both. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 9 6 Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10 1 ECON20003 – QUANTITATIVE METHODS 2 TUTORIAL 10 Solutions Exercises for Assessment Exercise 4 (Gujarati, pp. 370-374) This exercise is based on a study published by James W. Longley in 1967 about the computational accuracy of least-squares estimates in several computer programs. 1 This study is clearly outdated by now, but the Longley data has become the workhorse to illustrate several econometric problems, including multicollinearity. This data set is saved in the t10e4 Excel file. It contains U.S. time series data for the years 1947–1962 on the following seven variables. emp = number of people employed, in thousands; def = GNP implicit price deflator; gnp = GNP, millions of dollars; unemp = number of people unemployed in thousands, arm = number of people in the armed forces, pop = noninstitutional population over 14 years of age 2 ; and year = year, equal to 1 in 1947, 2 in 1948, …, and 16 in 1962. Assume that our objective is to model Y on the basis of the six X variables. Although in practice after having estimated a regression model we should always assess and interpret the results, this time we skip some of the usual steps and focus on multicollinearity. a) Using R , estimate a multiple linear regression model. The m = lm(emp ~ def + gnp + unemp + arm + pop + year) summary(m) commands generate the printout on the top of the next page. 1 Longley, J.W. (1967): An appraisal of least-squares programs from the point of view of the user. Journal of the American Statistical Association , vol. 62, pp. 819–841. 2 In the United States, the civilian noninstitutional population refers to people residing in the 50 States and the District of Columbia who are not inmates of institutions (penal, mental facilities, homes for the aged), and who are not on active duty in the Armed Forces. ( Wikipedia ). Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10 2 b) Apply the three simple indicators or rules of thumb that can be used to detect imperfect multicollinearity. Based on them, does multicollinearity seem to be a problem in this regression model? Explain your opinion. i. R 2 is very high (0.9955), but three of the six independent variables ( def , gnp and pop ) are statistically insignificant. 3 This is a classic symptom of multicollinearity. ii. Execute library(Hmisc) rcorr(as.matrix(t10e4), type = "pearson") to obtain the correlation 4 matrix for the variables in the model and the corresponding p -values. According to the results displayed on the next page some of the independent variables are strongly correlated with each other (| r | > 0.8), namely gnp and def , pop and def , pop and gnp , year and def , year and gnp , and year and pop . This suggests that there may be a severe multicollinearity problem. iii. To obtain the Variance Inflation Factors, execute library(car) round(vif(m), 4) 3 As for the other three independent variables, at any reasonable significance level, the slopes of unemp and arm are significantly negative, and the slope of year is significantly positive. The negative slopes of unemp and arm make sense as, ceteris paribus, more unemployed and more people in the armed forces are expected to decrease the number of people employed. The positive slope of year is also reasonable because this time variable can be considered as a proxy for some omitted variables whose combined effect on the number of people employed increases every year. 4 All variables in the model are quantitative, so we use the Pearson correlation coefficient. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10 3 The VIF values are Clearly, every independent variable has extremely high (i.e. much larger than 5) VIF statistic, except arm , suggesting that indeed the Longley data are plagued by the multicollinearity problem. All things considered, multicollinearity appears to be severe this time. c) You learnt on the lectures that the problem of severe multicollinearity in general might be mitigated by increasing the sample size, or transforming some of the multicollinear variables, or dropping all but one of the multicollinear variables. The first option is not available for us, but as for the second and third, one might argue as follows. (i) Because of inflation, nominal GNP ( gnp ) and the GNP implicit price deflator ( def ) are likely strongly correlated, so instead of these variables it might be better to use real GNP, which is nominal GDP divided by the implicit price deflator, i.e . rgnp = gnp / def . (ii) Noninstitutional population over 14 years of age tends to increase in time, so pop and year are also likely strongly correlated with each other. A possible solution is to keep pop in the model but drop year . (iii) The number of unemployed ( unemp ) and noninstitutional population over 14 years of age ( pop ) can be also strongly correlated with each other, so it might be a good idea to keep pop but drop unemp . Incorporate these changes in the model and estimate the new model. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10 4 The new regression can be estimated with the rgnp = gnp/def m2 = lm(emp ~ rgnp + arm + pop) summary(m2) R commands. The new regression printout is below. d) Apply the three simple indicators or rules of thumb for the detection of imperfect multicollinearity on the regression you estimated in part (c). Does multicollinearity seem to be a problem in this new regression model? Explain your opinion. i. R 2 is still very high (0.9814) and, at the 5% level, the first slope is significantly positive and the other two slopes are significantly negative. Hence, multicollinearity does not appear to be as severe in the new model as in the original model. However, the sign of the third slope estimate does not seem to be logical 5 , so even if we managed to mitigate multicollinearity, the new model is still not ideal. ii. The new correlation matrix obtained by executing rcorr(cbind(emp, rgnp, arm, pop), type = "pearson") is on the next page. As pop and rgnp are strongly correlated with each other, multicollinearity still appears to be an issue. iii. The Variance Inflation Factors generated by the round(vif(m2), 4) 5 Ceteris paribus, higher real GNP and less people in the armed forces are expected to be accompanied with higher employment, so the signs of the first and second slopes are logical. The negative sign of the third slope, however, is surprising – why would employment decrease when population increases? Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10 5 command are also on the next page. VIF: One VIF value is smaller than 5, but the other two are very large, so the new model is also plagued by the multicollinearity problem. Hence, multicollinearity appears to be severe in the new model as well. Exercise 5 (Selvanathan et al., p. 825, ex. 18.48) The Director of the Department of Education in Queensland was analysing the last year average mathematics test scores in the schools under his control. He noticed that there were dramatic differences in scores among the schools. In an attempt to improve the scores of all the schools, he attempted to determine the factors that account for the differences. Accordingly, he took a random sample of 40 schools across the state and, for each, determined the mean mathematics test score , the percentage of teachers in each school who have at least one university degree in mathematics ( math ), the mean age , and the mean annual income ($ ‘000) of the mathematics teachers. These data are saved in the t10e5 Excel file. a) Perform a multiple regression analysis on these data with R . What is your sample regression equation? The m = lm(score ~ math + age + income) summary(m) commands generate the regression printout displayed on the top of the next page. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10 6 From the printout the sample regression equation is 35.7 0.247 0.245 0.133 i i i i score math age income b) Is the model useful in explaining the variation among schools? Explain. To answer this question, one needs to evaluate the F -test of overall significance and to interpret the adjusted coefficient of determination. As regards the F -test of overall significance, the null hypothesis is that none of the independent variables help explaining the dependent variable, i.e. every slope parameter is zero, while the alternative hypothesis is that at least one independent variable is important and thus its slope parameter is different from zero. In symbols, 0 1 2 3 : 0 , : ' 0' A i H H at least one The test statistic is F obs = 6.663 and the corresponding p -value is about 0.001, so the null hypothesis can be rejected even at the 0.5% significance level. Therefore, we conclude that the model is useful as at least one independent variable has a significant effect on the mean test score. The adjusted coefficient of determination is 0.303. It means that after having taken the sample size and the number of independent variables into consideration, about 30% of the total sample variation of the mean test scores can be accounted for by the variations in the three independent variables, math , age and income . c) Are the normality and homoskedasticity conditions satisfied? Explain. Executing olsres = residuals(m) hist(olsres) Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10 7 qqnorm(olsres, pch = 1) qqline(olsres) library(pastecs) stat.desc(olsres, basic = FALSE, norm = TRUE) shapiro.test(olsres) yhat = fitted.values(m) plot(yhat, olsres, main = "OLS residuals versus yhat", col = "red", pch = 19, cex = 0.75) library(lmtest) bptest(m, ~ math + age + income + I(math^2) + I(age^2) + I(income^2) + I(math * age) + I(math * income) + I(age * income)) you can generate the following graphs and printouts: -2 -1 0 1 2 -20 -15 -10 -5 0 5 10 15 Normal Q-Q Plot Theoretical Quantiles Sample Quantiles Histogram of olsres olsres Frequency -20 -10 0 10 20 0 5 10 15 Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10 8 The histogram seems to have a longer left tail than right tail, the mean is smaller than the median and skewness is negative. These indicate that the distribution of the residuals is skewed to the left and thus, being not symmetrical, it is not normally distributed. However, skewness is close to zero, excess kurtosis is close to zero, skew.2SE and kurt.2SE are both very small in absolute value, and the SW test is insignificant. Hence, the random error variables, i , might be normally distributed. As for homoskedasticity, the residual plot does not reveal any discernible pattern that would suggest heteroskedasticity and the White test has a large p -value (0.3845), implying that the null hypothesis of homoskedasticity is maintained at any reasonable significance level. Hence, there is no reason to doubt the validity of the homoskedasticity assumption. d) Is multicollinearity a problem? Explain. Two of the three independent variables, age and income , are clearly insignificant (their p - values are 0.1945 and 0.3889, respectively), and the coefficient of determination is relatively small (0.357). Hence, there is no contradiction between the overall quality of the model (poor) and the individual significance/insignificance of the independent variables. This implies that imperfect multicollinearity is unlikely to be severe. The library(Hmisc) rcorr(as.matrix(t10e5), type = "pearson") 55 60 65 70 -20 -15 -10 -5 0 5 10 15 OLS residuals versus yhat yhat olsres Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10 9 library(car) round(vif(m), 4) commands return the following results: The strongest correlation is between age and income ( r = 0.57), i.e. between two independent variables. Still, even this correlation is only moderately strong, so multicollinearity might not be very severe. The three VIF values are all smaller than 1.5, far below the threshold value of 5. All things considered, multicollinearity does not appear to be severe in this model. e) Interpret and the coefficients. Do you find their signs reasonable? Why or why not? Based on your expectations perform t -tests on the coefficients (use = 0.05). In this case the y-intercept does not have a meaningful interpretation as neither age nor income of mathematics teachers can be zero in real life. The first slope estimate suggests that, keeping age and income constant, a percentage point increase in the proportion of teachers who have at least one university degree in mathematics increases the mean mathematics test score by about 0.247. The second slope estimate suggests that, keeping math and income constant, when the mean age of the mathematics teachers increases by one year, the mean mathematics test score increases by about 0.245. The third slope estimate suggests that, keeping math and age constant, when the mean annual income of the mathematics teachers increases by 1000$, the mean mathematics test score increases by about 0.133. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10 10 All three slope estimates are positive. One might argue that this makes sense, as better qualified ( math ), more experienced ( age ) and better paid ( income ) mathematics teachers are probably doing better jobs. Based on this argument, we perform right-tail t -tests on the slopes with 0 : 0 , : 0 ( 1,2,3) i A i H H i The p -value, i.e. half of Pr(> | t |) , is about 0.0005 < 0.05 for math , but it is 0.0.0973 > 0.05 for age and 0.1944 > 0.05 for income . Hence, at the 5% level, math has a significantly positive effect on score , but the individual effects of age and income on score are only insignificantly positive. f) Test the null hypothesis that neither the teachers’ mean age nor their mean annual income has a significant effect on the average mathematics test scores (use = 0.05). This question requires to perform a general F -test with 0 2 3 2 3 2 3 : 0 , : 0, 0, 0 0 A H H or or and Execute linearHypothesis(model = m, c("age = 0", "income = 0")) to obtain the following printout: The F -statistic is 2.8093 and its p -value is 0.0735, so at the 5% significance level we maintain H 0 and conclude that age and income are jointly insignificant. 6 6 LK: In part (c) we saw that the random error variables might be normally distributed, so the F test is appropriate. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 11 1 ECON20003 – QUANTITATIVE METHODS 2 TUTORIAL 11 Solutions Exercises for Assessment Exercise 5 (Selvanathan et al., p. 846, ex. 19.7) Create and identify indicator variables to represent the following nominal variables. a) Religious affiliation (Catholic, Protestant and other). This nominal/qualitative variable has three possible values/categories, which can be represented by two indicator/dummy variables. There are three equivalent options. (i) D c = 1 for Catholic and 0 otherwise (i.e. non-Catholic) D p = 1 for Protestant and 0 otherwise (i.e. non-Protestant) In this case D c = 1 and D p = 0 imply Catholic, D c = 0 and D p = 1 imply Protestant, and D c = 0 and D p = 0 imply other (i.e. neither Catholic nor Protestant). 1 (ii) D c = 1 for Catholic and 0 otherwise (i.e. non-Catholic) D o = 1 for other religious affiliation (i.e. neither Catholic nor Protestant) and 0 otherwise (i.e. either Catholic or Protestant) In this case D c = 1 and D o = 0 imply Catholic, D c = 0 and D o = 0 imply Protestant, and D c = 0 and D o = 1 imply other religious affiliation (i.e. neither Catholic nor Protestant). (iii) D p = 1 for Protestant and 0 otherwise (i.e. non-Protestant) D o = 1 for other religious affiliation (i.e. neither Catholic nor Protestant) and 0 otherwise (i.e. either Catholic or Protestant) In this case D p = 1 and D o = 0 imply Protestant, D p = 0 and D o = 0 imply Catholic, and D p = 0 and D o = 1 imply other religious affiliation (i.e. neither Catholic nor Protestant). 1 Note that D c = 1 and D p = 1 would mean Catholic and Protestant, which does not make sense. Similarly, in the two other options the two dummy variables cannot be equal to one simultaneously. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 11 2 b) Working shift (9 a.m.–5 p.m., 5 p.m.–1 a.m., and 1 a.m.–9 a.m.). Again, there are three possible categories (shift 1, shift 2, shift 3), which can be represented by two dummy variables. There are three options. (i) D 1 = 1 for shift 1 and 0 otherwise (i.e. shift 2 or 3) D 2 = 1 for shift 2 and 0 otherwise (i.e. shift 1 or 3) In this case D 1 = 1 and D 2 = 0 imply shift 1, D 1 = 0 and D 2 = 1 imply shift 2, and D 1 = 0 and D 2 = 0 imply shift 3 (i.e. neither shift 1 nor shift 2). (ii) D 1 = 1 for shift 1 and 0 otherwise (i.e. shift 2 or 3) D 3 = 1 for shift 3 and 0 otherwise (i.e. shift 1 or 2) In this case D 1 = 1 and D 3 = 0 imply shift 1, D 1 = 0 and D 3 = 1 imply shift 3, and D 1 = 0 and D 3 = 0 imply shift 2 (i.e. neither shift 1 nor shift 3). (iii) D 2 = 1 for shift 2 and 0 otherwise (i.e. shift 1 or 3) D 3 = 1 for shift 3 and 0 otherwise (i.e. shift 1 or 2) In this case D 2 = 1 and D 3 = 0 imply shift 2, D 2 = 0 and D 3 = 1 imply shift 3, and D 2 = 0 and D 3 = 0 imply shift 1 (i.e. neither shift 2 nor shift 3). c) Supervisor (David Jones, Mary Brown, Rex Ralph and Kathy Smith). This nominal /qualitative variable has four possible values/categories, which can be represented by three dummy variables. There are four options. For example, defining the ‘base’ category as Kathy Smith: D DJ = 1 for David Jones and 0 otherwise (Mary Brown or Rex Ralph or Kathy Smith) D MB = 1 for Mary Brown and 0 otherwise (i.e. David Jones or Rex Ralph or Kathy Smith) D RR = 1 for Rex Ralph and 0 otherwise (i.e. David Jones or Mary Brown or Kathy Smith) In this case D DJ = 1, D MB = 0 and D RR = 0 imply David Jones, D DJ = 0, D MB = 1 and D RR = 0 imply Mary Brown, D DJ = 0, D MB = 0 and D RR = 1 imply Rex Ralph, and D DJ = 0, D MB = 0 and D RR = 0 imply Kathy Smith. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 11 3 Exercise 6 (Selvanathan et al., p. 846, ex. 19.9) The director of a graduate school of business wanted to find a better way of deciding which students should be accepted into the MBA program. Currently, the records of the applicants are examined by the admissions committee, which looks at the undergraduate grade point average ( UGPA ) and the MBA admission score ( MBAA ). The director believed that the type of undergraduate degree also influenced the student’s MBA grade point average ( MBAGPA ). The most common undergraduate degrees of students attending the graduate school of business are BCom, BEng, BSc and BA. Because the type of degree is a qualitative variable, the following three dummy variables were created: D 1 = 1 if the degree is BCom and 0 if the degree is not BCom D 2 = 1 if the degree is BEng and 0 if the degree is not BEng D 3 = 1 if the degree is BSc and 0 if the degree is not BSc. The director took a random sample of 100 students who entered the program two years ago, and recorded for each student the MBAGPA , UGPA and MBAA scores and the values of the D 1 , D 2 , D 3 dummy variables. These data are saved in the t11e6 Excel file. a) Using these data, estimate the following model 0 1 2 3 1 4 2 5 3 MBAGPA UGPA MBAA D D D Does the model seem to perform satisfactorily? How do you interpret the slope coefficients? In this model there are five independent variables. UGPA and MBAA are quantitative variables, while D 1 , D 2 and D 3 are dummy variables. 2 The three dummy variables are used to represent the type of undergraduate degree, which is a qualitative variable. They are sufficient to distinguish the four undergraduate degrees, since for BCom D 1 = 1, D 2 = 0 and D 3 = 0, for BEng D 1 = 0, D 2 = 1 and D 3 = 0, for BSc D 1 = 0, D 2 = 0 and D 3 = 1, and for BA D 1 = 0, D 2 = 0 and D 3 = 0. Apart from the fact that some of the independent variables are dummy variables, this regression model can be estimated with R the same way as any multiple regression model. Hence, launch RStudio , create a new project and script, name them t11e6 , import the data from the t11e6 Excel file and execute the following commands attach(t11e6) m = lm(MBAGPA ~ UGPA + MBAA + D1 + D2 + D3) summary(m) to obtain 2 In R we cannot use subscripts, so we are going to denote these dummy variables as D1 , D2 and D3 . Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 11 4 The adjusted coefficient of determination suggests that, taking the sample size and the number of independent variables into consideration, this model can account for about 45% of the total sample variation of the MBA grade point average. This means that the model does not fit to the data extremely well. Yet, the overall F -test rejects the null hypothesis that all slope parameters are zero ( p -value = 0), so the model is significant overall. The slope estimates of UGPA and MBAA are positive. This is acceptable since they imply that a student’s expected MBA grade point average is an increasing function of her/his undergraduate grade point average and MBA admission score. The actual values of these slope estimates mean that, keeping the other independent variables, including the dummy variables, constant, i) if the undergraduate grade point average increases by one, the MBA grade point average is expected to go up by 0.313, and ii) by every additional MBA admission score the MBA grade point average is expected to go up by 0.009. The slope estimates of the D 1 , D 2 , D 3 intercept dummy variables are also positive. Recall that the three dummy variables represent BCom, BEng, and BSc, respectively, so BA is the base category. Therefore, the slope estimates of the dummy variables indicate that, keeping all other independent variables in the model constant, compared to the MBA grade point average ( MBAGPA ) of a student with a BA first degree, iii) the MBAGPA of a student with a BCom first degree is expected to be 0.922 higher, iv) the MBAGPA of a student with a BEng first degree is expected to be 1.501 higher, and v) the MBAGPA of a student with a BSc first degree is expected to be 0.620 higher. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 11 5 b) Test to determine whether individually each of the independent variables is linearly related to MBAGPA. The question implies the following hypotheses: 0 : 0 , : 0 ( 1,...,5) i A i H H i The p -values ( Pr(| t | >) ) of the first four slope coefficients are practically zero, and the fifth one is about 0.021. Therefore, at the 2.5% level each independent variable has a significant linear relationship with MBAGPA. c) Is every slope estimate significantly positive? The question implies the following hypotheses: 0 : 0 , : 0 ( 1,...,5) i A i H H i Since every slope estimate is positive and the p -value for a one-tail t -test is half of the reported Pr(| t | >) value, we can reject every null hypothesis at the 1.1% or higher level and conclude that each slope is significantly positive. d) Can we conclude that, on average, a BCom graduate performs better than a BA graduate? Given the three dummy variables, BA is the base category. If BCom graduates tend to perform better than BA graduates, then the coefficient of the BCom dummy variable (i.e. D 1 ) should be significantly positive. As we saw in part (c), it is significantly positive, so we can conclude that on average BCom graduates outperform BA graduates. e) Predict the MBAGPA of a BEng graduate with 3.0 undergraduate GPA and 700 MBAA score, first manually and then with R . For a BEng graduate 1 2 3 0, 1, 0 D D D , and given an undergraduate GPA score of 3.0 and an MBAA score of 700, the predicted MBAGPA mark is ˆ 0.437 0.313 3 0.009 700 0.922 0 1.501 1 0.620 0 8.303 y   To double-check this prediction, execute the following R commands: newdata1 = data.frame(UGPA = 3, MBAA = 700, D1 = 0, D2 = 1, D3 = 0) predict(m, newdata1, interval = "prediction") predict(m, newdata1, interval = "confidence") You get the following printouts: and Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
L. Kónya, 2020, Semester 2 ECON20003 – Solutions 11 6 As you can see, the point prediction we calculated manually and the one reported by R (i.e. fit ) are slightly different. This is because, unlike R , we used only 3 decimals in the calculation. R also reports the 95% prediction and confidence intervals. The first implies that, with 95% confidence, the MBAGPA of a BEng graduate with 3.0 undergraduate GPA and 700 MBAA score is between 6.541 and 10.543, while the second implies that, with 95% confidence, the average MBAGPA of all BEng graduates with 3.0 undergraduate GPA and 700 MBAA score is between 7.500 and 9.584. f) Repeat part (e) for a BA graduate with the same undergraduate GPA and MBAA score. For a BA graduate 1 2 3 0, 0, 0 D D D , so with the same undergraduate GPA and MBAA scores than in part (e), the predicted MBAGPA mark is ˆ 0.437 0.313 3 0.009 700 0.922 0 1.501 0 0.620 0 6.802 y   Execute the following R commands: newdata2 = data.frame(UGPA = 3, MBAA = 700, D1 = 0, D2 = 0, D3 = 0) predict(m, newdata2, interval = "prediction") predict(m, newdata2, interval = "confidence") You get the following printouts: and Hence, with 95% confidence, the MBAGPA of a BA graduate with 3.0 undergraduate GPA and 700 MBAA score is between 5.083 and 8.999, and the average MBAGPA of all BA graduates with 3.0 undergraduate GPA and 700 MBAA score is between 6.084 and 7.998. Downloaded by James Hudin (jameshudin@gmail.com) lOMoARcPSD|12574417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help