PDF Detail

pset1-description.pdf

下方展示该 PDF 的摘要和完整提取文本。

摘要预览

全文内容

pset1-description.pdf

Page 1 of 10
RMHI/ARMP Problem Set 1 2026 Word count total: 1200 Hello everyone! This is the description for the assignment, which is due on Canvas on Monday April 13, 2025 by 11:59pm Melbourne time. You’ll need to submit a Word-knitted version of the completed R Markdown file found in this zip file, according to the following instructions: 1. Rename the document called pset1.Rmd as studentID-pset1.Rmd. (Replace studentID with your student ID number). This is your R Markdown file, where you’ll be putting all your code and answers. 2. Replace “Your ID goes here” in the header of the R Markdown file with your student ID. (Keep the quotes or it won’t knit properly.) Do not additionally include your name in the header of the R Markdown file or the filename as we will be marking papers anonymously. 3. While we encourage collaboration in tutorials and learning in general, you should not be collaborating with anybody AT ALL for this assignment. That means no sharing code privately or publicly; even talking in the abstract about problems will effectively be collusion. You should be completing it independently, with no help from any other person in any capacity. Of course, as always, you are free to use any of the resources from the class to help you, and you're also free to google or look anything up that you like (as long as you aren't asking anybody, including discussion boards or AIs, questions related to this assignment). Note that we do look at places like chegg and will follow up if anything from this problem set is posted there. 4. Plagiarism check is enabled and you can check the similarity report on your submission. In previous years we have found people who tried to cheat, so please don’t risk it! That said, understand that we will not be naively looking at the overall % figure: with this sort of assignment a certain amount of overlap is inevitable, so don’t worry if you get what looks like a high % score as long as you know you didn’t plagiarise or collude. With this sort of assessment, that % overlap is higher than essays and lab reports. We will be using the plagiarism check for the parts of the assignment where we'd expect some variability, and to give a general sense of the overall gestalt. 5. Complete all of the problems below in the R Markdown document. Do not remove any of the arguments to the code chunks, like the names of the code chunks or if it says message=FALSE or whatever. If a problem asks you to display a tibble or variable so it shows up in the knitted version, make sure that you do as the marker cannot evaluate it without seeing it, and if they can't see it then they won’t be able to award you points for it! Remember that to display a tibble (or any variable) you just type its name on a line of its own within the R chunk, or use head(). 6. We've structured this so that, as much as possible, questions do not build on each other. That means that if, say, you can't get Q5a then you can still get Q5b or Q6. Try to do all of them. 7. Go for partial credit! Most of these questions have some form of partial credit possible. What that means is that if it is asking for some R code, break down the problem into pieces. Even if you can only do some of the pieces, or do them part of the way, that will be worth something. [Note that there is no question-by-question rubric available because designing one would mean giving away the answers. In general we will give full credit for responses that correctly address all of the parts of the question.] Short answer questions (SAQs) can also be given partial credit and are generally asking for some thoughtful interpretation. If it is based on a previous graph or test you've done, if you did the first part wrong but discuss it well, you can still get most or all points for the SAQ part. If your code does not run but you want to include it for possible partial credit, just comment it out (using the # sign) or type eval=FALSE in the R chunk so that it shows up in the knitted document but R does not try to run it. If you include a lot of commented-out code and some is correct and some isn’t, we will not give you credit for the commented-out code; put the thing in there that you think is the closest to the correct answer, don’t just include everything you tried.

Page 2 of 10
8. We are not overly worried about to what decimal place you round answers to and you will not lose credit for this unless you round so much that your answer is impossible to discern (e.g., don’t round p-values to the nearest integer!), or unless the question specifically asks for a particular place value. Similarly, you will not lose points for trivial presentation things like the presence or absence of italics. That said, for those who want a guideline, we suggest that you round non-integer numbers to two decimal places. 9. Unless the question specifies otherwise you must only use R content that was taught in this subject. “Taught in this subject” refers to anything that was covered in any lecture videos (i.e., included on the slides), the lecture exercises, or tutorial exercises; if it came up incidentally in a Q&A or was linked to as optional extra information that does not count. Similarly, if your tutor incidentally mentioned a package or function that was not part of the tutorial exercise, that also does not count. While you can google around in order to get ideas or figure things out, it is up to you to double-check that the content was also presented in this subject (and if it wasn't, you will need to figure out how to do it with content from this subject). We do this partly to make it harder to use ChatGPT or simply cut and paste from the internet without understanding. If you want to use something that was in this subject but was obscure enough that you're worried your tutor will have missed it, then just indicate where you got it from in a comment. 10. Some questions specify a word count. In that case you need to either calculate it from the knitted document or type up your answer in Word1 and then cut and paste it into the R Markdown file. (Please put your answer in between the word ANSWER and [Word count: N]; needless to say, those two bits do not count towards your word count.) We know that's annoying; sorry. Anything else we thought of, like specifying a number of sentences or having no limit, was worse in terms of equity across students. The word counts we've specified in each question are designed to give you a guideline about the maximum amount of words you should need answer completely and correctly. So don’t feel like you must use all of the words; if you can answer it fully with less, that’s fine. In fact, the total word count for the solution set I wrote up is around 830, so it’s possible to fully answer the questions while going well under the word limit. That said, it is okay to go over the word limit for individual questions as long as the total word count for all of the questions combined is fewer than 1320 words (i.e., fewer than 1200+10%, with the standard penalty if it is 1200+10% or over. See the student manual for details on word count penalties). 11. There is no word count for code chunks or SAQs that have no [Word Count: N] attached. Word count only applies to the short answer questions as indicated. Remember to report your total word count for the assignment as a whole at the top of the document and replace the N with the word count on each SAQ. Your total word count is the sum of the word counts for all the SAQs. 12. You'll be turning in the knitted output of your R Markdown file. We prefer that you knit to Word but if you can't get Word to knit then html is okay. In the worst case, you can turn in the completed Rmd file. We highly, highly recommend that you knit as you go: (a) knitting can identify problems in your code that you would have otherwise missed; and (b) you do not want to get close to the deadline and think you’re done only to find that you’re having trouble knitting. Save yourself the panic and knit often. 13. Similarly, you can turn in the assignment multiple times before the deadline, so we strongly encourage you to turn it in even before it’s perfectly polished. We will automatically mark the latest submitted assignment. Submitting often will save you last-minute panic or computer issues. Also, take a screenshot for proof of having turned it in just in case you need it. If you submit a corrupted file or the wrong assignment, that is not grounds for waiving any late penalties; it is your responsibility to make sure that the submission is correct. If you run into last-minute issues and can’t even succeed in uploading an Rmd, email us your assignment as soon as possible to demonstrate that it was done at that time (rmhi-armp@unimelb.edu.au). We cannot make promises about whether you will receive any late penalties if you do this, but if you don’t, you almost certainly will get penalised because we have no way to know if the problems were genuine. 1 Different software calculates word count in slightly different ways, so we are using Word as the standard.

Page 3 of 10
Bunzobra Fest! Our friends in Bunnyland are starting to get upset and angry at each other, but luckily it is time for the annual Bunzobra Fest, a delightful day of carnival fun and prizes which culminates in a ritual burning of a giant bunny effigy named Bunzobra. Every year they must first construct Bunzobra out of wood and paper, aiming to create an enormous masterpiece that, when set afire, becomes the backdrop to a night of revelry and fun. Naturally, they need a lot of workers to contribute to the building of Bunzobra. Every year they track who works on this project, what their job is, how much they contribute, and how much they get paid. Some people worked only one year; in that case, most of the values of the variables for the other year are indicated with NA. This data can be found in the tibble d, which has been loaded for you in the R Markdown document. Each row is a person, and Table 1 below describes the columns.
For your convenience, the Markdown creates tibbles dc (which just contains data from the current year) and dp (which contains the data from the previous year). There are also a few other tibbles which you can ignore for now because they are relevant to later questions. Q1 [14% of total mark] (a) Use the table() function to show how many people are doing each job in the current year and assign the result to a variable called myTable. Make sure the table shows up in the knitted Markdown. (b) Use the unique() and length() functions to determine how many distinct individuals are in the dataset in the current year. [Note: we have not taught at least one of these to you. Part of the purpose of this question is for you to figure out what they do and how to use them to answer this question.] Compare this to the total number of people in myTable. Why aren’t these the same? [Suggested word count: 40]

Page 4 of 10
(c) The hrBin variable is calculated based on the quantile of hours spent per day in that year. Use R to calculate what each of the quantile thresholds was in each year (i.e., the actual number of hours per day separating the lowest from the second, and so forth). Enter the results in the XXX spaces in the Markdown. (d) Based only on the quantile thresholds, which year would you say people worked more hours per day on average? What about the quantiles suggested that to you? [Suggested word count: 40] (e) Using only content from Week 1 and Week 2 of this subject, extract the names of the people who make more than $10 per hour from d. You do not need to assign this to a variable; just make sure that it shows up in the knitted document. You do not need to report the names in the Markdown. Why do some names appear twice and others only appear once? Why are there two NA values? [Suggested word count: 60] (f) Extract the same names as in (e) but this time ensure that it does not include the NA values in the list. You may use any content taught in this subject, regardless of what week it was taught in. Q2 [9% of total mark] A tibble dd has been loaded for you in the R Markdown. It is just like d but contains four additional variables, as defined in Table 2 below.
(a) Using function(s) that you were taught in Week 3, create a tibble called ddMine that contains the first three variables in Table 2 above, just like in dd. In other words, you will need to add totalAdded, totalPaid, and valuePerHour to your copy of d and save the result in a new tibble called ddMine. Make it so the top rows of ddMine and only the five columns name, year, totalAdded, totalPaid, and valuePerHour show up in the knitted Markdown, in alphabetical order by name. (b) Add the valueBin variable to ddMine using the function case_when() in combination with function(s) you already know. We have not taught you this function so you will need to use your investigative skills to look it up and play around with it until you have figured it out. Make it so the top rows of ddMine and only the three columns name, year, and valueBin show up in the knitted Markdown, in descending alphabetical order by name. (c) Using function(s) that you were taught in Week 3 in this subject, create a tibble called ddSum which contains two rows, one for each year. It should have five columns: fullTotalAdded (the complete amount added in that year, calculated by adding up all of the totalAdded values for each person); fullWagesPaid (the total amount spent paying everybody that year); mnHrsWorked (the average total number of hours each person worked that year); and mnAge (the mean age of the workers that year). Make it so ddSum shows up in the knitted Markdown.

Page 5 of 10
The tibble ddSumAndy, which has been loaded for you, shows you what your tibble should look like (the order of your rows may be different or the values may round in different ways when knitted or displayed and that is fine if so, but the contents of each row should be the same). Q3 [18% of total mark] (a) Using function(s) that you were taught in Week 3 in this subject, create a tibble called dcSum based on dc (i.e., only the current year) which contains summary statistics on four rows, one for each hrBin. It should have five columns: hrBin, mnAdded (the mean amount added per day over all of the people who were in that quantile of hours per day), sdAdded (the standard deviation of the same calculation), nAdded (the number of people in that calculation), and sderrAdded (the standard error). The tibble dcSumAndy, which has been loaded for you, shows you what your tibble should look like (the order of your rows may be different and that is fine if so, but the contents of each row should be the same). (b) Create the bar plot shown below using dcSum (or dcSumAndy if you couldn’t create dcSum) in combination with whatever tibble(s) you think appropriate. For full credit, your figure should have all the same components as this one (i.e., semi-transparent bars, dots, error bars corresponding to one standard error, title, subtitle, etc.). Your R code should only include material taught in this subject, with the exception that you will need to figure out how to make it so that the jitter of the dots is less than the width of each bar along the x-axis (i.e., as per the bar plot below) and there is no jitter at all along the y-axis. Note: It is okay if your individual data points are not in exactly the same positions as this figure along the x-axis, since the geom may introduce randomness. It’s also fine if your colours aren’t exactly the same (you aren’t expected to guess what palette was used) as long as you use a sensible palette and theme, and the colours of your dots match your bars and vary for each panel. [That said, the background colour is “ivory” and the title/lettering is “grey25”]. Also, if your knitted figure has a slightly different aspect ratio that too is fine, as long as all of the elements are present and correct; different systems knit figures in slightly different ways.
(c) How would you interpret this figure? It reveals something extremely counterintuitive; what is that thing and why is it counterintuitive? This is not a R question but rather a thought question asking you to critically think about the overall pattern(s) in the data. You should not make claims about significance but you should link the operationalisation of our measure(s) to the underlying theoretical idea(s) they capture. [Suggested word count: 100]

Page 6 of 10
(d) The figure below shows the relationship between hours per day worked and mean amount added per day in a different way. The x axis shows hrsPerDay and there are distinct panels for each different job title. This provides a partial explanation of the counterintuitive thing from part (c). How? In your answer, discuss both what is now explained and what is still counterintuitive. For each, be sure to refer to both the pattern in (c) and the relevant parts of the figure below, and explain the logic of how they are connected. [Suggested word count: 125]
Q4 [12% of total mark] (a) Make a figure of your own exploring more about the relationship in Q3 between hrsPerDay and addedPerDay, this time focusing on the laborers only. This figure should illustrate something about how or whether this relationship changes based on at least one other factor: this can be any variable in dc, or any variable that you can calculate based on the existing variables. Your goal is to illustrate something new about the data that we have not seen so far, and to improve our understanding of why we see the patterns we do in Q3. It is therefore worth thinking about what kinds of research questions would be interesting to look at and exploring different possibilities. Your figure can use any geom you like but it should not be a bar graph with error bars like in Q3c. It should also incorporate at least one element that you haven’t been taught in this subject; this can be anything from a new calculation, a new geom, a different palette package, a new argument for a known geom, changing the size or style of your fonts, putting text inside the figure, changing aesthetic properties, etc.; you can do whatever you want as long as it’s new and makes sense. (You can include more than one new thing if you want). The figure should have an informative title and axis labels, and a theme and colour palette other than the default or the one you used in Q3b. The aesthetic choices should add to its clarity rather than detract from it; part of what you are being marked on is if the figure illustrates the data in a clear and useful way. (b) Explain what the new element does and how you made it. (If you have more than one new element, pick whichever one of them you want). Your explanation doesn't need to be extensive – for instance, if you hadn’t already been taught show.legend you might say “I got rid of the legend by adding show.legend=FALSE as an argument to the geom”. [Suggested word count: 40] (c) Explain what your figure suggests about the data. In your explanation be sure to describe the variables on each axis (and panel, if you have multiple panels) as well as what the pattern is and what it suggests about what is going on. You will be evaluated primarily on how clear and appropriate your explanation is given the figure, and secondarily on whether you have identified something new that adds insight to our understanding of the relationship between hrsPerDay and addedPerDay that was not evident before. It is okay to speculate about what might be going on as long as you clearly indicate that it is speculation, and it is grounded in the figure and logic. [Suggested word count: 125]

Page 7 of 10
Q5 [24% of total mark] One thing we care about is whether there have been any interesting changes between the previous year to the current one. There are many ways to look at this but a common and useful one is to calculate a Change variable, which is calculated by subtracting the value of some variable in the previous year from the value in the current year (thus, a positive value means that it is higher in the current year). For instance, if you wanted to calculate ageChange then you would take your current age (say, 20) and subtract your previous age (say, 19) from it, so ageChange is 1. Of course, age always increases by one in a year, but other variables might be much more interesting! (a) Using function(s) that you were taught in Week 3 in this subject, create a tibble called dv that contains 43 rows (one corresponding to each person) and four columns. The first column should be their name, the second (currValPerHr) should be their valuePerHour in the current year, the third (prevValPerHr) should be their valuePerHour in the previous year, and the final should be a valPerHrChange value, calculated in exactly the same way as described above (i.e., subtracting the previous valuePerHour from the current one). Make sure dv shows up in your knitted document. The tibble dvAndy, which has been loaded for you, shows you what this should look like. Note: this involves multiple steps and multiple functions from Week 3 plus one new tidyverse one, called rename(), which you will need to figure out yourself. We suggest you break down the problem into pieces and make sure each step works before going on to the next. You will receive partial credit if you accomplish some but not all of the required steps. The tibble dCh, which has been loaded for you, contains multiple Change variables calculated using some of our main variables in d. They are described in Table 3 below.
In case it is useful, we also provide the tibble dChLong, which is identical to dCh, but in long form (the four change variables are classified as ChangeType, and the values are in ChangeAmount). (b) Create the figure on the next page showing the change over time for each of the four change variables. For full credit, your figure should have all the same components as this one (i.e., semi-transparent bars, four panels, no labels or ticks on the y axis, vertical dotted lines, different scales on the x axes, etc.). Your R code should only include material taught in this subject, with the exception that you will need to figure out how to make the dotted vertical lines and the lack of ticks and labels on the y axis.

Page 8 of 10
Note: It is okay if your colours aren’t exactly the same (you aren’t expected to guess what palette was used) as long as you use a sensible palette and theme. [You can assume the number of bins used in the histogram was 10]. Also, if your knitted figure has a slightly different aspect ratio that too is fine, as long as all of the elements are present and correct; different systems knit figures in slightly different ways.
(c) For each of the variables, explain what the overall trend is (i.e., whether it is mostly negative or positive and what that means – for instance, for our hypothetical ageChange variable I might say that it is positive and always 1, indicating that everyone is one year older in the current year. This is not a R question but rather a thought question asking you to critically think about how to interpret the figure and what the measures mean. You do not need to speculate about reasons here. [Suggested word count: 60] (d) Using any of the tibbles we have created in any part of the problem set, make a figure of your own exploring something new (e.g., relationship(s) involving at least one variable that has not previously been looked at, breaking down the data in a different way, etc). Be creative! There are no strong expectations here – your goal is to investigate something new and different. Use at least one geom that you haven’t used before on any of the previous questions.2 Unlike Q4a, you are not required to incorporate any new elements, but you are welcome to if you wish. The figure should have an informative title and axis labels, a theme and colour palette other than the default, and colours for the lettering and figure background that are different from the default. The aesthetic choices should add to its clarity rather than detract from it; part of what you are being marked on is if it illustrates the data in a clear and useful way. (e) Explain what your figure suggests about the data. In your explanation be sure to describe the relevant aspects of the variables and panels as well as what the pattern is and what it suggests about what is going on. Make sure you explain what it shows that is distinct from previous questions. (It is fine for you to find there is no pattern and it suggests that nothing much is happening if that is what you observe!) You won’t be evaluated on how interesting your result is, but on how clear and appropriate your explanation is given the figure, and whether it actually is different from previous questions. That said, it’s worth thinking about what kinds of research questions would be interesting to look at and to play around exploring different possibilities, since well-motivated questions are more likely to yield interesting patterns which are easier to discuss. [Suggested word count: 125] 2 Note that geom_jitter and geom_point count as the same, as do geom_col and geom_bar(stat = “identity”); we want you to do something qualitatively different than you did before, so you can demonstrate that you are capable of it.

Page 9 of 10
Q6 [11% of total mark] (a) Bunny is grumpy even after spending the night at Bunzobra. This is partly that she was scared of the giant bonfire but also that she keeps thinking about statistics. “I still don’t quite understand something,” she says. “Since a p-value of 0.1 means the probability of the null being true given your data is 10%, and the probability of the alternative being true is 90%, we should just set alpha really low. Like if it was 0.0001 then we would know that anytime we get a significant result, the alternative hypothesis is 99.99% likely to be true. I don’t see any drawbacks to this.” There are some distinct problems with Bunny’s idea. Explain two of them to her. For each, be sure to be clear about what the problem is and why it is a problem. [Suggested word count: 140] (b) The maximum possible number of days a laborer could work was 12. The modal number of days worked in the current year is 10 and in the previous year is 9. Assuming an underlying probability that 80% of the 12-day maximum tend to be worked, what is the probability of observing (i) 9 days worked out of 12; and (ii) 10 days worked out of 12? Given this, which year (current or previous) is the underlying 80% assumption most likely to be true for? (No need to explain why, just fill in the XXX with either the word “current” or “previous”). You should answer this question using the function(s) taught in the subject; you do not need to use the datasets. Report probabilities as percentages, rounded to one decimal place, filling in the spaces provided in the Markdown. (c) If we were to be able to calculate the probability for either the current or previous year in part (b) for each person in the dataset, should we expect that all of the probabilities (for all of the people in a given year) would sum to one? Why or why not? This is not a question that requires any coding at all. In your answer make reference to what you know about what these probability calculations indicate and how that maps onto the situation being described here. [Suggested word count: 125] Q7 [10% of total mark] Gladly thinks that the true underlying distribution of every laborer’s total amount added per year is uniform between 35 and 350. In other words, he thinks that a random laborer is equally likely to add 35 units to Bunzobra one year as they are to add 350 (or 180, or 75, or any value between the two numbers.3 He has helpfully drawn you the picture on the right to illustrate what he thinks it looks like. For the purposes of this question let’s assume that he is correct and this is the true distribution. (a) Suppose we were able to visit 1000 parallel universes during the current year, each of which employs 40 laborers to work on Bunzobra. Gladly is interested in understanding the mean amount added by each pool of 40 people. Consider the six panels U through Z on the next page. Give the letter of the panel that most accurately captures what you would expect the sampling distribution of this mean to be out of the 1000 universes. Explain your answer, making reference to the definition of sampling distribution. [Suggested word count: 80] 3 Let’s not worry about whether Gladly is correct about this. He probably isn’t (because it’s a very silly assumption), but for the purposes of this question we’re going to assume he is and see what we can figure out given that.

Page 10 of 10
(b) Suppose that we consider the same 1000 universes, each with 40 laborers, but now Gladly is interested in understanding the minimum amount added in each pool. In other words, if he could line up all 40 laborers by the amount they added, what would be the amount added by the person who did the least? That is the minimum for that set of 40 laborers. Consider the same panels U through Z and give the letter of the panel that most accurately captures what you would expect the sampling distribution of the minimum to look like. Hint: begin by thinking about what you would expect the minimum value of a set of 40 laborers in a single universe to be, and extrapolate from there. Is this the same as in (a)? Why or why not? [Suggested word count: 140] * Note: You do not need to do any calculations or code in this question! And if your intuitions about minimums are incorrect but your explanation of sampling distributions in general (and the sampling distribution of the mean as it applies here) is correct, you can still get most of the partial credit. Q8 [2% of total mark] These marks are free as long as you say anything! What is your current theory about why everyone in Bunnyland is going hungry? (No word limit here, say as much or as little as you want)