Question 1 (20 marks)
The United States Federal Trade Commission annually rates varieties of domestic cigarettes according to their tar, nicotine, and carbon monoxide content. The US Surgeon General considers each of these substances hazardous to a smoker’s health. Past studies have shown that increases in the tar and nicotine content of a cigarette are accompanied by an increase in the carbon monoxide emitted from the cigarette smoke.
The file cigarettes.csv, available on Wattle, contains a summary of the data collected in one year, for 25 brands of cigarettes and includes the variables: Brand name (Brand); Tar content in milligrams (Tar); Nicotine content in milligrams (Nicotine); Weight in grams (Weight); and Carbon monoxide content in milligrams (CO). In this assignment, we would like to use all of the available variables to build a multiple regression model to examine the factors that affect the amount of carbon monoxide emitted by these cigarettes.
(a) Produce a pair-wise scatterplot matrix for the cigarettes data. Comment on the relationships shown in this scatterplot, assuming that CO is going to be the response variable for your multiple regression analysis. What type of variable is Brand name (Brand)? Can Brand be sensibly included as an explanatory variable in your multiple regression model? (3 marks)
(b) Using CO as the response variable and Tar, Nicotine and Weight as explanatory variables, fit a multiple regression model. Experiment with the order in which you fit the explanatory variables in the model and examine the ANOVA table for each model. Why do the models change so dramatically depending on the order in which you include the explanatory variables? Present an appropriate ANOVA table and conduct a nested F test to determine if Nicotine and Weight are a significant addition to a model that already includes Tar? (3 marks)
(c) Find a multiple regression model with CO as the response variable, which includes all three of Tar, Nicotine and Weight as predictors, but in an order where all three have significant sequential F-tests in the ANOVA table. Call this model A. Present the ANOVA table and the summary output for model A. Interpret the coefficients of the explanatory variables in model A and the various F-tests and t-tests shown in the summary output for model A. What does model A suggest is happening to the emitted carbon monoxide, as weight and nicotine increase? Does this make sense? (4 marks)
(d) For model A, present a plot of the internally Studentised residuals against the fitted values, a normal quantile plot and a bar plot of Cooks’ distances. Is there a problem with potential outliers with this model? What other diagnostics can you produce to investigate these potential outliers? (Note this is not an invitation to produce a large amount of R output – choose just one or two additional relevant plots or summaries and discuss any output that you do produce). (3 marks)
(e) Now apply a natural log transformation to all of the variables included in model A and re-fit model A to these log transformed variables. Present the same plots and other diagnostics you produced for model A in part (d), for this new transformed model. Discuss the differences between this new output and the output in part (d). Has the transformation solved the problems with potential outliers? (4 marks)
(f) Finally, without presenting a lot of additional R output, discuss which of the possible models (and including the model we fitted and discussed in Assignment 1) would you recommend for these data and why? What was probably the underlying research question when these data were collected (i.e. what do you think the researchers were really interested in)? Does your chosen model really address this underlying question? (3 marks)
Question 2 (20 marks)
The dataset stroke contains data from a pilot study of twenty patients selected from two large public hospitals in Brisbane. All twenty patients had recently suffered a cerebrovascular accident resulting in hemiplegia lasting at least 24 hours, had not previously been incapacitated from stroke or other disease and were currently receiving occupational therapy. The pilot study collected a number of variables which could be used as evaluation tools for assessing the recovery of patients who had recently suffered a stroke.
Following on from Assignment 1, the client is now interested in building a series of multiple regression models to assess recovery from stroke with Barthel as the response variable. The client believes that a patient’s age (variable Age, measured in years), sex and the side of the brain affected by the stroke are all important factors that potentially affect recovery. These variables must be included in any multiple regression model that you fit, to control for their possible effects and so the client can assess these effects.
The client is also interested in the areas measured by the Bobath Assessment Form, but (following your report on Assignment 1) would prefer to use the related Goteburg Assessment Form, which is divided into seven components (variables Arms, Legs, Hands, Balance, Sensation, JointPain and JointMotion; which are further described in the file stoke.pdf).
(a) Create two new indicator variables, Female (which equals 1 if Sex = “F” and 0 otherwise) and Right (which equals 1 if Side = “R” and 0 otherwise) and fit the multiple regression model of Barthel on Age, Female and Right. Do the residual plots for this model show any obvious problems? Check the plots and answer the question, but only present one of the plots if you want to argue that there is a problem. Present a bar plot of the leverage values for this model. Which observation stands out as having relatively high leverage? What is different about this observation? (3 marks)
(b) Present the ANOVA table and the summary output for the model in part (a). Interpret the various F-tests and t-tests shown in this output (you do not need to present formal hypothesis tests). What do the signs of the estimated regression coefficients suggest about the effects of age, sex and location of the stroke as predictors of Barthel as a measure of recovery from stroke? (3 marks)
(c) We could potentially include any of the variables in the stroke data as explanatory variables in a multiple regression model, but what is the obvious problem with including the variable Subject? Given the results of Assignment 1, is there a problem with including Kenny as a possible predictor for Barthel? If we are going to fit a multiple regression that already includes some demographic variables (Age and Female), some classification variables (Right) and some of the seven components of the Goteburg Assessment as predictors, is there a problem with also including Bobath as a predictor? (3 marks)
(d) The variable Lapse (the time since the occurrence of the stroke in weeks) might be another factor that affects recovery. Present an added variable plot for Lapse as a possible addition to the model in part (a). Also present the added variable plot for log(Lapse). Use these plots to comment on the inclusion of Lapse as a possible predictor. (3 marks)
Do you want your assignment written by the best essay experts? Then look no further. Our teams of experienced writers are on standby to deliver to you a quality written paper as per your specified instructions. Order now, and enjoy an amazing discount!!