全文内容
ProjectDescription.pdf
MATH 11205: Machine Learning in Python 2025-2026
Project Description
We will be using a subset of the data collected by UNICEF, called the Multiple Indicator Cluster
Survey (MICS). A simplified version of the data will be used for this project, provided in the file
unicef malawi.csv, after some initial cleaning steps (described below). This smaller data set
focuses only on data collected in Malawi in the years 2019-2020. The dataset collects a number
of indicators on the well-being of children, including childhood depression. Your goal is to build a
model to predict if children suffer from feelings of depression.
Assignment Goal
For the purpose of the project, consider yourself aData Science Consultantwho has been
hired by UNICEF to analyse childhood depression in low-income countries. The mental health of
the next generation – those aged under 18 years - is a societal priority. Identifying and treating
mental health early in life has lifelong impacts on physical health, education, earning potential,
relationships, identity formation and life satisfaction. Further, the burden of poor mental health
disproportionately falls on lower- and middle-income settings (LMICs), and on women and young
people in particular. Mental health is influenced by various factors, at the level of the child, parent,
and societal environment. For an integrated model of mental health, an approach that combines
these multiple factors together is required, and is possible through the UNICEF’s Multiple Indicator
Cluster Survey (MICS), which collects information on the child, parent, and household environment.
For further details on the data, please see the main webpage for the Multiple Indicator Cluster
Survey (MICS).
Towards this aim, you have been asked to use this data to build a classification model to predict if
a child suffers from depressive feelings. Depression is a heterogeneous condition, and to the question
How often does the child seem very sad or depressed?, survey responses vary fromnever,a few times
a year,monthly,weekly, anddaily. You have been instructed to focusonly on discriminating
between no depression and any form of depression, combining all severities of depression
into one category. In addition, the organization is interested inidentifying important factors
that can have an impact on depression; this is useful not only to improve their understanding of
the condition and the impact of inter-generational experiences and society, but also to assist in
the development of effective social and health policies and interventions. In summary, you need to
develop awell-tuned and validatedclassification model fordepressionas the binary outcome of
interest andinterpret or explainthe model’s predictions. Your model may use as few or as many
of the provided features, transforming and manipulating these features in any way that you see fit,
and you may extract additional information from the extended datasets to create new features.
1
You should start from aninterpretable baselinemodel of your choice, including as few or as
many of the provided variables. At this point, we have covered a number of models in lectures and
workshops, and you may explore a variety of different feature engineering and modelling approaches
for this particular task. However, your ultimate goal is to select and deliver asingle final model.
Thus, your report should focus on describing and motivating your final model choice, along with
a comparison against the baseline model. It is important that any interpretations and conclusions
you draw from your model are well supported and sound and that you understand limitations of
the model and the data.
Working as a team
This project may be completed by a team of up to 4 students (minimum of 1 student).
Feel free to create your own team during workshop hours, building on the pairs for the workshops.
Since we are not assigning teams, if you are a team that is looking for more members or someone
looking for a team please use the pinned post on Piazza to find each other. You are strongly
encouraged to work in teams, as you can learn a lot from discussing together, but you may choose
to work individually if preferred.The marking scheme is the same, regardless of group
size.
After the assignment is completed, we will distribute a brief peer evaluation survey. Completion
of the survey is optional, but if you feel that some members contributed significantly less, this
provides an opportunity for feedback and for such members to potentially have their overall mark
penalized. This will only be done in extreme cases, after discussion with all team members.
Required Structure
A Jupyter notebook template calledproject.ipynbhas been provided. It includes the required
sections along with brief instructions on what should be included in each section. Your completed
assignment must follow this structure -you should not add or remove any of these sections,
but you may add extra subsections to help organize the report. Please remove the
instructions for each section in the final document.
All of your work must be contained in theproject.ipynbnotebook, we will only mark what
is included in this file (both the write-up and relevant coding). You may work on the notebook in
whichever environment you prefer (noteable, locally, colab, codespaces,...).
There is anupper limit of 30 pagesincluding all code and output. Your notebook must
include all of your work, but make sure that you are only retaining required components, e.g.
remove unused code and figures (if a figure is not explicitly discussed in the text it should not
be in the final document). Overall, your project will be partially assessed on your organization /
presentation of the document - it should be as polished and streamlined as possible.Try to be
as concise as possible while creating your write-up. We highly recommend that you
check the appearance of your rendered PDF before submitting, as its appearance can
differ significantly from the notebook.
Please submit your final PDF of project report (generated from a Jupyter notebook) to the Project
assignment on Gradescope. Please ensure that youtag all groups memberson Gradescope, and
also add all group member names at the beginning of the file.
2
Getting Help
•Week 11 Workshop:We will focus on answering any project related questions.
•Piazza:This forum will be used as the central location for all course related discussions and
questions, and should be used over emailing course staff directly. The course lecturer will
monitor and respond to questions, but feel free to provide some constructive responses to
peer’s questions.
Generative AI Policy
Please refer to the University’s Generative AI Policy. Generative AI tools may be used for project,
but always as your helper or co-pilot and used responsibly, and NOT your driver! For example,
GenAI may used for generating explanations for error messages and debugging, providing hints
or suggestions to improve code, enhancing visualizations and the quality of the report. The data
should absolutely NOT be copied and shared directly with GenAI tools, as this in breach of legal
consent for this data. You must include a statement on how GenAI was used.If the project
is suspected to be heavily written by GenAI, the students may be subject to an oral
presentation.
Dataset Details
The data provided inunicef malawi.csvmerges together a subset of variables from the question-
naires on the child, mother, and household.
Child-level variables:from the original data (provided inExtendedDataSources/fs.sav), a
subset of variables have been extracted:
•Child’s background:
–age, education, and health insurance coverage information are provided inCB3, CB4,
CB5A, CB5B, CB7, CB11, with detailed questions and response options in pg. 2 of
Questionnaires/MICS6 Questionnaire for Children Age 5 17.docx
–region of residence, gender, ethnicity, and combined wealth score are provided inHH6,
HH7, HL4, ethnicity, wscore.
•Child labour:
–child labour, hours of labour, household work, and hours of household work are provided
inCL2, CL3, CL12, CL13, with detailed questions and response options in pg. 4-6 of
Questionnaires/MICS6 Questionnaire for Children Age 5 17.docx
•Child discipline:
–discipline methods and physical punishment are provided inFCD2A, FCD2B, FCD2C,
FCD2D, FCD2E, FCD2F, FCD2G, FCD2H, FCD2I, FCD2J, FCD2K, FCD5, with detailed
questions and response options in pg. 7 of
Questionnaires/MICS6 Questionnaire for Children Age 5 17.docx
•Child functioning:
– depression, our target variable, is provided inFCF26, with detailed questions and re-
sponse options in pg. 10 of
Questionnaires/MICS6 Questionnaire for Children Age 5 17.docx
3
Maternal variables:from the original data (provided inExtendedDataSources/wm.sav), a subset
of variables have been extracted:
•Mother’s background:
–age, education, and literacy are provided inWB4, WB5, WB6A, WB6B, WB14, with de-
tailed questions and response options in pg. 2-3 of
Questionnaires/MICS6 Questionnaire for Individual Women.docx
•Domestic violence:
–attitudes towards domestic violence are provided inDV1A, DV1B, DV1C, DV1D, DV1E,
with detailed questions and response options in pg. 29 of
Questionnaires/MICS6 Questionnaire for Individual Women.docx
•Victimization:
–information on attacks, harassment, and safety are provided inVT1, VT9, VT20, VT21,
VT22A, VT22B, VT22C, VT22D, VT22E, VT22F, VT22X, with detailed questions and re-
sponse options in pg. 30-32 of
Questionnaires/MICS6 Questionnaire for Individual Women.docx
•Marriage/Union:
–marital status is provided in’MSTATUS’.
–husband’s age and multiple wives are provided inMA2, MA3, with detailed questions and
response options in pg. 33 of
Questionnaires/MICS6 Questionnaire for Individual Women.docx
•Adult functioning:
–disability is provided in’disability’.
–difficulty in remembering, self-care, and communicating are provided in’AF10’, ’AF11’,
’AF12’, with detailed questions and response options in pg. 34 of
Questionnaires/MICS6 Questionnaire for Individual Women.docx
•Tobacco and alcohol use:
–tobacco and alcohol use are provided in’TA1’, ’TA14’, with detailed questions and
response options in pg. 43-44 of
Questionnaires/MICS6 Questionnaire for Individual Women.docx
•Life satisfaction:
–life satisfaction information is provided in’LS1’, ’LS2’, ’LS3’, ’LS4’, with detailed
questions and response options in pg. 45 of
Questionnaires/MICS6 Questionnaire for Individual Women.docx
•Fertility:
–the number of surviving and deceased children is provided in’CSURV’, ’CDEAD’.
Household variables:from the original data (provided inExtendedDataSources/hh.sav), a
subset of variables have been extracted:
•Household characteristics:
–house material, electronic use, ownership (dwelling, land, and/or livestock), and bank ac-
count possession information are provided in’HC4’, ’HC5’, ’HC8’, ’HC11’, ’HC12’,
’HC13’, ’HC14’, ’HC15’, ’HC17’, ’HC19’, with detailed questions and response op-
tions in pg. 5-8 of
Questionnaires/MICS6 Household Questionnaire.docx
•Insecticide treated nets:
–mosquito net information is provided in’TN1’, with detailed questions and response
options in pg. 13 of
4
Questionnaires/MICS6 Household Questionnaire.docx
•Water and sanitation:
–source, location, and sufficiency of drinking water and toilet facilities information are
provided in’WS1’, ’WS3’, ’WS4’, ’WS7’, ’WS11’, ’WS14’, ’WS15’, with detailed
questions and response options in pg. 15-18 of
Questionnaires/MICS6 Household Questionnaire.docx
•Handwashing:
–availability of soap/detergent for handwashing is provided in’HW5’, with detailed ques-
tions and response options in pg. 19 of
Questionnaires/MICS6 Household Questionnaire.docx
For further details on the variables, questions, and responses, please see theQuestionnaries
folder provided in project materials. As mentioned, the target variable is depression in the child
FCF26, which you need to binarize into the two categories ’no depression’ and ’depression’ (combin-
ing all severities of depression into one category). To predict depression, you may choose to include
as few or as many of the variables. You may also include additional variables from the extended
datasets provided in the folderExtendedDataSources. In this case, please describe the additional
variables and your motivation for including them.
The data provided for this project is subject to data agreements and waivers that I have signed
on your behalf. Thus,the data must NOT be shared publicly(e.g. if your team is using
GitHub, keep the repo private and do NOT share the data with any GenAI tools).
5