Authors: Sophia Moyen & Emanuel Iwanow
Date: 30.07.2024
The notebook provides a deep dive into the 2018 ENEM dataset as well as hypothesis testing analysis and a prediction model for the scores.
- 1. Introduction
- 2. Preprocessing
- 3. Exploratory Data Analysis
- 4. Feature Extraction
- 5. Hypothesis Testing
- 6. Modelling and Prediction
- 7. Conclusion
- References
ENEM (Exame Nacional do Ensino Médio) is a non-mandatory, standardized national exam held in Brazil for college admissions. Entry to federal universities, considered the best in the country, takes place solely based on the results of this test. More and more non-federal public universities and private universities have also accepted the results of this test as a form of admission. On the matter of relevevance, in 2016, there were 8.6 million people signed up to take it, which makes it the second largest in the world after China's National Higher Education Entrance Examination.
The ENEM is a two-day exam held annually in November, simultaneously across the entire country. Each day of exam lasts 5 hours. The exam consists of multiple 180 multiple choice questions and one essay. A special system of normalization is applied, called TRI (Teoria de Resposta ao Item). Following the TRI, each multiple choice question has a different value, calculated after normalization across the performance of all candidates to avoid awarding students who "guessed" the right question. A question that only a few candidates got right will have a higher value, but only if the candidate that got that question right also got "easier" questions right. This follows the logic that if a candidate is able to solve a complicated question, he/she should also have been able to solve an "easy" one, that many people got right. What in practice happens, e.g, it that two candidates may have scored exactly 144 questions out of the 180 and have had the same grade on the essay, but their final grades may be completely different because of the TRI.
The TRI normalization also means that every year the maximum and minimum punctuation for each subject can only be known after the results, as shown above for the year 2018. For example,in a certain year the maximum grade in Maths could be 850 and the other year 950. But the maximum range is always around 800-1100. This is of course doesn't apply to the critical essay grading, that always ranges betweeen 0 and 1000.
The implementation of this mathematical "trick" is an important part of the exam and gives the standardized test score more credibility to evaluate the candidate knowledge and, in our case, analyse relationships between the candidates' grades and other socioeconomical features.
Beyond its role in selecting students for universities, the ENEM offers a rich dataset to investigate the interplay between socioeconomic factors and academic performance.
Understanding the socioeconomic determinants of ENEM performance is crucial for several reasons. First, it allows for a comprehensive assessment of the exam’s fairness and its ability to accurately measure students’ potential. Second, by identifying socioeconomic gaps in performance, policymakers and educators can develop targeted strategies to mitigate inequalities and provide more educational opportunities for disadvantaged students. Finally, exploring the predictive power of the ENEM in relation to a candidate's probability of success can inform the use of the exam as a selection tool for higher education.
The Brazilian Ministry of Education provides the microdata of the each year's event, available in their website. We were specially interested in analysing the ENEM from the year 2018 because that was the year in which we wrote the exam ourselves.
All data has been collected was stored either as numerical or alphanumerical variables. A dictionary was provided to identify the corrspondance of the numbers/letters. For example, the column TP_COR_RACA stores the candidate's ethnicity as numerals. The column Q024, that corresponds the 24th question of the socioeconomical questionnaire, categorizes the answers with letters from A to E:
| TP_COR_RACA | Q024 | ||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
The microdata from the ENEM 2018 provides us many features related to each candidate, which enables us to investigate many tendencies regarding the exam. Our main goal will be to understand what factors influence a good grade at the exam and how socioeconomical factors influence a candidate's results. We can bring our goals for this report down to the following points:
- Exploratory Data Analysis: Who are the ENEM 2018 candidates? How did they perform in the exam?
- Feature Extraction: What features are important to predict a candidate's grade?
- Statistical Hypothesis Testing: Based on the feature extraction, does feature X really influence a candidate's grade or is it just chance?
- Modeling and Prediction: Can we create a model that is able to predict a candidate's grade? How well does it perform?
...
The contents for the Hypothesis Testing were taken from the tutorials from the lecture:
Debes, C. (2024, Summer). Data Science I - 18-zo-2110-vl. Lecture course. Technische Universität Darmstadt, Darmstadt, Germany.
We were inspired by the visual features used in SmartPay notebook from Felipe Gomes available in Kaggle:
Gomes, F. (2022). Smartpay - Calculadora de Salário Rio (GAMLSS.nb). Kaggle
Dataset: The public data from the Brazilian Ministry of Education for the microdata for the ENEM 2018:
Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira. Microdados do Enem 2018. [online]. Brasília: Inep, 2018: http://portal.inep.gov.br/web/guest/microdados
Python libraries
- Seaborn: Waskom, M. L., (2021). seaborn: statistical data visualization. Journal of Open Source Software, 6(60), 3021, https://doi.org/10.21105/joss.03021.
- Matplotlib: J. D. Hunter, "Matplotlib: A 2D Graphics Environment", Computing in Science & Engineering, vol. 9, no. 3, pp. 90-95, 2007.
- Sklearn>Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
- Numpy: Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). DOI: 10.1038/s41586-020-2649-2. (Publisher link).
- Pandas: McKinney, W., & others. (2010). Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference (Vol. 445, pp. 51–56).