-
A virtual environment is necessary when working with Python projects to ensure each project's dependencies are kept separate from each other. You need to create your virtual environment, also called a venv, and then ensure that it is activated any time you return to your workspace. Click the gear icon in the lower left-hand corner of the screen to open the Manage menu and select Command Palette to open the VS Code command palette.
-
In the command palette, type: create environment and select Python: Create Environment…
-
Choose Venv from the dropdown list.
-
Choose the Python version you installed earlier. Currently, we recommend Python 3.12.8
-
DO NOT click the box next to
requirements.txt, as you need to do more steps before you can install your dependencies. Click OK. -
You will see a
.venvfolder appear in the file explorer pane to show that the virtual environment has been created. -
Important: Note that the
.venvfolder is in the.gitignorefile so that Git won't track it. -
Return to the terminal by clicking on the TERMINAL tab, or click on the Terminal menu and choose New Terminal if no terminal is currently open.
-
In the terminal, use the command below to install your dependencies. This may take several minutes.
pip3 install -r requirements.txt-
Open the
jupyter_notebooksdirectory, and click on the notebook you want to open. -
Click the kernel button and choose Python Environments.
Note that the kernel says Python 3.12.8 as it inherits from the venv, so it will be Python-3.12.8 if that is what is installed on your PC. To confirm this, you can use the command below in a notebook code cell.
! python --version- Set the
.python-versionPython version to a Heroku-22 stack currently supported version that closest matches what you used in this project. - The project can be deployed to Heroku using the following steps.
- Log in to Heroku and create an App
- At the Deploy tab, select GitHub as the deployment method.
- Select your repository name and click Search. Once it is found, click Connect.
- Select the branch you want to deploy, then click Deploy Branch.
- The deployment process should happen smoothly if all deployment files are fully functional. Click the button Open App at the top of the page to access your App.
- If the slug size is too large, then add large files not required for the app to the
.slugignorefile.
This project analyses the Open University Learning Analytics Dataset (OULAD) to understand how student demographics, engagement behaviour, and assessment performance relate to final outcomes.
The goal is to produce clear, actionable insights supported by statistical analysis, reproducible Python workflows, and an executive-ready dashboard built in Power BI/Tableau.
This work was completed as part of a 5 day data analytics capstone, demonstrating end to end analytics capability: data cleaning, feature engineering, EDA, statistical testing, visualisation, and communication.
The Open University wishes to improve student retention and academic performance by understanding the drivers of success and what leads to students failing or withdrawing. This project analyses demographic,, engagement, and assessment data to identify early indicators of risk (i.e. students fail or withdraw) and provide actionable insights for early student intervention measures.
- What factors most strongly predict whether a student will succeed (pass or get a distinciton) or fail (fail or withdraw)
- How can we identify at-risk students early?
The Open University Analytics Dataset (OULAD) is a widely used , openly available dataset designed for research in learning analytics, student performance prediction and educational data mining.
It was released by the Open University (UK) and contains anonymised data about students, courses and their interactions with the university's Virtual learning Environment (VLE).
About 20 years ago I undertook two statistics courses with the Open University and so my familiarity with the learning environment led me to choose this dataset.
The dataset can be downloaded from here: https://analyse.kmi.open.ac.uk/open-dataset
| Feature | Description |
|---|---|
| code_module | The module (course) code, e.g., “AAA”. |
| code_presentation | The presentation (term) of the module, e.g., “2013B” (February) or “2013J” (October). |
| module_presentation_length | Length of the module in days. |
| id_student | Unique anonymised student identifier. |
| gender | Student gender (“M”, “F”). |
| region | Geographic region where the student lives (UK regions). |
| highest_education | Highest prior education level (e.g., “A Level”, “HE Qualification”). |
| imd_band | Index of Multiple Deprivation band for socioeconomic status (e.g., “0–10%”). |
| age_band | Age group (e.g., “0–35”, “35–55”, “55+”). |
| disability | Whether the student declared a disability (“Y”, “N”). |
| num_of_prev_attempts | Number of times the student previously attempted this module. |
| studied_credits | Total credits the student has studied before this module. |
| final_result | Final outcome: “Pass”, “Fail”, “Withdrawn”, “Distinction”. |
| date_registration | Day the student registered (relative to module start). |
| date_unregistration | Day the student withdrew (if applicable). |
| id_site | Identifier for a VLE activity/page (links to VLE table). |
| date (click on VLE activity) | Day student interacted with online content. |
| date (due date) | Date when assessment is due. |
| sum_click | Number of clicks the student made on that VLE activity on that day. |
| id_assessment | Unique assessment identifier. |
| date_submitted | Submission date (relative to module start). |
| is_banked | Whether the score was carried over from a previous attempt. |
| score | Student’s score (0–100) per assessment. |
| assessment_type | “TMA” (Tutor Marked), “CMA” (Computer Marked), “Exam”. |
| weight | Percentage weight of the assessment toward final grade. |
| activity_type | Type of VLE resource (e.g., “forum”, “quiz”, “resource”, “oucontent”). |
| week_from | First week the activity is available. |
| week_to | Last week the activity is available. |
| Feature Name | Type | Description | Category |
|---|---|---|---|
| id_student | Numeric | Unique ID used to join tables | — |
| num_assessments_total | Numeric | Total number of assessments the student attempted across all modules | Assessment-Related Behaviour |
| total_score_total | Numeric | Sum of all assessment scores | Assessment-Related Behaviour |
| weighted_score_total | Numeric | Sum of scores weighted by assessment weight | Assessment-Related Behaviour |
| avg_score_mean | Numeric | Mean score across all assessments | Assessment-Related Behaviour |
| total_weight_completed | Numeric | Total assessment weight completed (out of 100 per module) | Assessment-Related Behaviour |
| total_clicks_total | Numeric | Total number of VLE clicks across all activities | VLE Engagement |
| total_active_days | Numeric | Total unique days active on the VLE | VLE Engagement |
| avg_clicks_student | Numeric | Average number of student clicks per module/day | VLE Engagement |
| unique_vle_activities_total | Numeric | Number of distinct VLE activities participated in | VLE Engagement |
| avg_clicks_per_day | Numeric | Average clicks per active day across all modules | VLE Engagement |
| studied_credits | Numeric | Total credits the student is registered for (workload indicator) | History & Workload |
| num_of_prev_attempts | Numeric | Number of previous attempts of the module | History & Workload |
| pass_distinction_binary | Numeric | Target variable: Pass/Distinction = 1; Fail/Withdraw = 0 | Target |
| best_result | Categorical | Best final result across all modules | Performance |
| worst_result | Categorical | Worst final result across all modules | Performance |
| gender | One‑hot | Encoded gender (F, M) | Demographics |
| age_band | One‑hot | Encoded age bands (0–35, 35–55, 55+) | Demographics |
| highest_education | One‑hot | Encoded highest education level (5 bands) | Demographics |
| imd_band | One‑hot | Encoded IMD socio‑economic band (11 bands incl. Missing) | Socioeconomic |
| disability | One‑hot | Encoded disability status (Y/N) | Demographics |
| region | One‑hot | Encoded region (England regions, Scotland, Wales, Ireland) | Region |
| Feature Name | Type | Description | Category |
|---|---|---|---|
| id_student | Numeric | Unique student identifier used to join tables | — |
| code_module | Categorical | Module code (e.g., AAA, BBB) | Module Metadata |
| code_presentation | Categorical | Presentation code (e.g., 2013J, 2014B) | Module Metadata |
| num_assessments | Numeric | Number of assessments in this module | Assessment Behaviour |
| total_score | Numeric | Sum of all assessment scores for this module | Assessment Behaviour |
| avg_score | Numeric | Average score across assessments | Assessment Behaviour |
| total_weight | Numeric | Total assessment weight available in the module | Assessment Behaviour |
| weighted_score_total | Numeric | Sum of scores weighted by assessment weight | Assessment Behaviour |
| num_of_prev_attempts | Numeric | Number of previous attempts at this module | History & Workload |
| studied_credits | Numeric | Credits the student is registered for | History & Workload |
| total_clicks | Numeric | Total VLE clicks for this module | VLE Engagement |
| active_days | Numeric | Number of unique active VLE days | VLE Engagement |
| unique_vle_activities | Numeric | Number of distinct VLE activity types engaged with | VLE Engagement |
| avg_clicks | Numeric | Average clicks per active day | VLE Engagement |
| module_presentation_length | Numeric | Length of the module presentation in days | Module Metadata |
| pass_distinction_binary | Numeric | Target: Pass/Distinction = 1; Fail/Withdraw = 0 | Target |
| result_rank | Numeric | Ordinal ranking of final result (Distinction > Pass > Fail > Withdraw) | Performance |
| gender | One‑hot | Encoded gender (F, M) | Demographics |
| age_band | One‑hot | Encoded age bands (0–35, 35–55, 55+) | Demographics |
| highest_education | One‑hot | Encoded highest education level (5 bands) | Demographics |
| imd_band | One‑hot | Encoded IMD socio‑economic band (11 bands incl. Missing) | Socioeconomic |
| disability | One‑hot | Encoded disability status (Y/N) | Demographics |
| region | One‑hot | Encoded region (England regions, Scotland, Wales, Ireland) | Region |
| final_result | One‑hot | Encoded final result (Distinction, Pass, Fail, Withdrawn) | Performance |
| assessment_type | One‑hot | Encoded assessment type combinations (CMA, TMA, Exam, combos) | Assessment Type |
- Data loading:
. 7 CSV datasets were downloaded and included in this project.
-
Data cleaning: • Standardized categorical fields • Converted date offsets to numeric timelines • Removed invalid or impossible values • Handled missing values across tables
-
Data Merged: 7 dat files were merged into 2 files • StudentInfo • StudentVLE • VLE
- Assessments • StudentAssessment • Courses • StudentRegistration
- Feature Engineering
Created high impact features such as: • Total clicks • Clicks per week • Engagement before assessments • Submission lateness • Weighted average score • Engagement drop off rate
- Exploratory Data Analysis • Demographic comparisons
- Distribution analysis
- Statistical Analysis:
Statistical testing using Man Whitneyu test, linear regression, logistic regression
- Visualisation::
Genearting bar charts and histograms in both Python and PowerBi/Tableau
- Export:
Save analysis results in various formats including CSV, PPT, SQL, Word, Text, Jupyter Source File
- Python 3.7 or higher: pandas, numpy, seaborn, matplotlib, scikit learn, statsmodels, ipykernel
- pip
- Trello: Project Management
- VS Code for development
- Power BI / Tableau for dashboarding
- GitHub for version control and portfolio presentation
- Data set: 7 CSV files
Engagement Hypothesis: Ho: There is no difference in total VLE engagement between students who pass and those who fail/withdraw. H1: Students who pass have significantly higher engagement than those who fail/withdraw. Region Hypothesis (Anova): Ho: Average assessment scores do not differ across regions H1: Average assessment scores differ by region. Gender Hypothesis (Chi-Square)?? Ho: Gender is independent of the final result H1: Gender is associated with the final result. Lateness Hypothesis Ho: Submission lateness is independent of the final result H1: Submission lateness is associated with the final result. Registration Timing Hypothesis Ho: Registration date does not impact the final result H1: Late registration increased the likelihood of failing or withdrawing. Your Final Hypothesis Framework (Portfolio Ready) Cluster Hypothesis Test Variables Engagement Volume Passers have higher total clicks Mann–Whitney U total_clicks × final_result Engagement Consistency Passers access VLE on more days Mann–Whitney U vle_days_accessed × final_result Engagement Intensity Passers have higher avg clicks/day Mann–Whitney U avg_clicks × final_result Demographics Final result differs by highest education Chi square highest_education × final_result Prior Attempts More previous attempts → lower pass rate Mann–Whitney U num_of_prev_attempts × final_result
xxxxx
OULAD is an openly available and anonymised dataset. However, even here there may potentially be ethical considerations should as:
-
Privacy and aonymisation limits: there is a risk of re-identifation should external data sources be included or where subgroups are so small that re-identification is feasible
-
Fairness and bias: Risk that the database might under or over-represent certain demogrphics leading to inequalities (e.g. socioeconomic status, disability, gender ) that are then amplified though analysis
-
Responsbile USe: Any analysis does not harm or disadvantage students
-
Interpretability: Complex models e.g. RandomForest can lack transparency and so any predictions may need to be verified using alternative algorithms
-
Data Context: OULAD is the dataset from the Open University and any conclusions and recommendations derived may not be transferable to other contexts/institutions
-
Respect for students: The OULAD dataset represents real students not simply data points. Hence any conclusions or recommendations derived from this dataset should actively seek to offer solutions and recommendations that support students, treating them in an ethical manner, and not penalise them.
The OULAD dataset is anonymised but still contains sensitive demographic and behavioural information, so it must be handled with care to avoid privacy risks.
Because some groups may be underrepresented and historical outcomes may reflect structural inequalities at the time of the creation of the dataset, models trained on this data may inherit or amplify bias unless fairness is explicitly monitored.
The OULAD dataset represents real educational practices within a specific institution (The Open Univeristy), so analyses based on it may unintentionally reinforce existing societal inequalities or misinform policy if interpreted without context.
Although the dataset is openly licensed, its use still carries ethical responsibilities: researchers should avoid overgeneralising findings to other institutions or populations and remain aware that modelling educational outcomes can influence perceptions of fairness, access, and student support at a societal level.
The key findings from this research are: xx xx xx xx
List all dashboard pages and their content, either blocks of information or widgets, like buttons, checkboxes, images, or any other item that your dashboard library supports. Later, during the project development, you may revisit your dashboard plan to update a given feature (for example, at the beginning of the project you were confident you would use a given plot to display an insight but subsequently you used another plot type). How were data insights communicated to technical and non-technical audiences? Explain how the dashboard was designed to communicate complex data insights to different audiences.
Overview: The OU student population shows a higher likelihood of students being male, residing in England, having a secondary level education, belonging to lower socio-economic bands, being younger in age and not reporting a disability.
- final result: (Pass, Distinction, Fail & Withdrawn)
- ongoing assessments: Computer Marked Assessments (CMA), Tutor MArked Assessments (TMA) and Exam
- engagement levels
Overview: The OU student population shows success is most likely from those with post-graduate qualifications, higher socio economic groups and older or middle aged students.
** Overview: CMAs have the lowest weighting relative to the final result but produce the highest average scores, whereas TMAs and Exams have higher weights but lower average scores**
** Overview: Assessment success (liek final result success) most likely from those with post-graduate qualifications, higher socio economic groups, older or middle aged students. and those in the South East**
See Testing file.
Heroku The App live link is: https://YOUR_APP_NAME.herokuapp.com/ Set the runtime.txt Python version to a Heroku-20 stack currently supported version. The project was deployed to Heroku using the following steps. Log in to Heroku and create an App From the Deploy tab, select GitHub as the deployment method. Select your repository name and click Search. Once it is found, click Connect. Select the branch you want to deploy, then click Deploy Branch. The deployment process should happen smoothly if all deployment files are fully functional. Click now the button Open App on the top of the page to access your App. If the slug size is too large then add large files not required for the app to the .slugignore file. Main Data Analysis Libraries Here you should list the libraries you used in the project and provide an example(s) of how you used these libraries. Credits In this section, you need to reference where you got your content, media and extra help from. It is common practice to use code from other repositories and tutorials, however, it is important to be very specific about these sources to avoid plagiarism. You can break the credits section up into Content and Media, depending on what you have included in your project. Content The text for the Home page was taken from Wikipedia Article A Instructions on how to implement form validation on the Sign-Up page was taken from Specific YouTube Tutorial The icons in the footer were taken from Font Awesome Media The photos used on the home and sign-up page are from This Open-Source site The images used for the gallery page were taken from this other open-source site Acknowledgements (optional) Thank the people who provided support through this project.
-
Clone this repository
-
Install dependencies:
-
Code pip install -r requirements.txt
-
Place OULAD raw CSVs into /data_raw
-
Run notebooks in numerical order
-
Open the dashboard file in Power BI/Tableau
The requirements for this project are:
- Trello for project management
- Python (pandas, numpy, seaborn, matplotlib, scikit learn, statsmodels)
- VS Code for development
- Power BI / Tableau for dashboarding
- GitHub for version control and portfolio presentation
- Data sets
- PostGres
Teresa McGarry Analytics Professional




