MAS8600/MAS8505 Project: Learning Analytics
James Bentham
Autumn 2025
Overview
This coursework is worth:
• 33% of the overall mark for the 30-credit MAS8600 Graduate Foundations of Statistics and Data Science
module.
• 100% of the overall mark for the 10-credit MAS8505 Graduate Foundations of Statistics and Data
Science (Applications) module.
The analysis report is worth 50% of the mark for the assessment, and the ProjectTemplate Directory is worth
50%.
You should submit your coursework to Canvas by 4pm on Friday 16th January 2026.
If you have any questions regarding the coursework, please ask during a practical session, during office hours,
or by emailing james.bentham@ncl.ac.uk.
Context
Learning Analytics, a rapidly growing application area in Data Science, is defined as “the measurement,
collection, analysis and reporting of data about learners and their contexts, for purposes of understanding
and optimising learning and the environment in which it occurs.”
Existing mechanisms to record student engagement such as attendance monitoring fail to capture the extent
and quality of engagement outside the classroom environment. Further complementary sources of data are
routinely collected about our learners, e.g., use of on-campus facilities, Virtual Learning Environment (VLE)
and ReCap access, as well as student wellbeing referrals. However, this information currently resides in a
number of silos.
Learning Analytics seeks to aggregate these sources of data to derive shared insights, and provide effective
measures of engagement. Insights may inform learning design, inform intervention processes for at-risk
students, and improve student attainment.
Task Description
We have data from 7 runs of a massive open online course (MOOC) entitled “Cyber Security: Safety at Home,
Online, and in Life” developed by Newcastle University and made available to the public by the online skills
provider FutureLearn.
We have raw data collected by FutureLearn on learners as they progressed through the course, along with
some characteristic information collected from their profiles. Learner IDs allow information on different data
sheets to be combined.
To complete the assessment, you will make use of all the tools we have seen in the module:
• ProjectTemplate: to organise your project structure and automate processes of data and package
loading, data cleaning, etc.
1
• dplyr: to clean and preprocess data programmatically to make them ready for analysis.
• ggplot: to produce exploratory data visualisations in order to achieve the desired data insights.
• R Markdown : to produce an analysis report describing the investigation and its findings.
• Git: to provide version control support for the project, enabling us to see the history of the project in
development, and to return the project to a prior version if necessary.
• renv: to aid project reproducibility by fixing the R packages (and versions) used in the analysis.
The task for this project is to investigate aspects of the FutureLearn data that you feel would be of interest to
a provider of the course. There are no restrictions on what you can investigate, and you will not be judged on
the “success” of your analysis (i.e., it does not matter whether your analysis actually produces the interesting
thing you were looking for). Assessment on this project is based on the approach to the analysis that you
take, and how well you observe best practice in your approach.
• The focus of this assignment is on the use of the tools and techniques from the module to support
efficient and reproducible data analysis, rather than the data analysis itself. Feedback on this assessment
will be based on the approach to the analysis that you take, and how well you observe best practice in
your approach. You should choose an area to investigate that you think would be interesting to the
business, but that is achievable within the timeframe of this project.
• We are primarily assessing how well you follow the best practice principles outlined in the module:
using CRISP-DM to structure the investigation, using ProjectTemplate to structure the analysis and
ensure that it is reproducible, usingdplyr and ggplot to carry out the analysis, and R Markdown to
produce the analysis report.
• The data files you are given are the raw data files from FutureLearn, and it is entirely possible that you
will encounter issues around data quality during your analysis. How you choose to account for this is
entirely up to you; we want to see that you’ve checked for data quality issues and acknowledged any
that are there.
Submission Requirements
For the Analysis Report submission, you should submit a maximum 15-page report (made using R
Markdown) describing your analysis, with reference to the phases of CRISP-DM. Your report should describe
two consecutive cycles of CRISP-DM.
For theProjectTemplate Directorysubmission, you should submit a single zip file containing:
• Your complete ProjectTemplate directory. This is the directory that is created when you run
create.project at the start of your work, and will contain your code and output from your research
into the FutureLearn data. Typically, the main subdirectories used in this project will be:
– Data: containing the FutureLearn data files.
– Munge: containing the R scripts which handle data preprocessing.
– Cache: containing the cached preprocessed data.
– Config: containing the settings for ProjectTemplate running the analysis.
– Report: containing the .Rmd file used to create your analysis report.
• renv lockfile: contains the project specific package library created byrenv.
• README file: a plaintext file (called README) which outlines the steps needed to run your analysis
(any extra packages required, etc.), together with the location of the deliverables in your submitted
directory.
• Git log: a plaintext file containing the contents of your local Git repository for the project.
To produce your Git log, you can rungit log > GitLog.txt in the terminal. TheGitLog.txt file should
be included within your ProjectTemplate directory.
Learning Outcome Alignment
• You will be able to identify and apply best practices in data handling, exploratory data analysis, data
visualisation and reproducibility, ensuring robust and reliable research.
• You will apply the scientific method in framing research questions and interpreting results.
2
• You will be aware of the software and data lifecycles.
• You will use advanced techniques for statistical analysis, and will apply best practices in programming.
• You will know how to ensure that analyses can be repeated, modified and shared in a transparent and
collaborative manner.
AI Policy
The assessment requires careful consideration of a specific dataset, which you should carry out yourself rather
than using AI. You should only use AI in a very limited way, such as for debugging of code or spell-checking
of your report.
Marking Rubric
Detailed mark schemes are provided on Canvas, comprising marks for the code and report.
Feedback Expectations
You will receive feedback on the individual items in the mark schemes, as well as overall comments on your
work.
3