Implementation of machine learning models to predict student dropout at Tecsup, 2024
DOI:
https://doi.org/10.71701/n2tes416Keywords:
Student dropout, data preprocessing, classification models, prediction, accuracy, machine learning, Data miningAbstract
The main objective of this study is to predict whether a student will drop out at Tecsup in 2024 by implementing different machine learning classification models and selecting the best one, while also identifying the relevant variables that cause student dropout. The justification for this study is that, according to the literature review, dropout remains a persistent challenge for Peruvian educational institutions, and for this reason, preventive measures are needed to prevent students from abandoning their studies at Tecsup.
The scope of this study is descriptive; the design is non-experimental, cross-sectional, and descriptive. The population consists of 38,835 student records from the 2019-2022 period, comprising personal, academic, and financial data. No sampling was performed in order to utilize the entire dataset and maximize prediction accuracy. Additionally, exploratory data analysis employed heat maps, histograms, distribution graphs, box plots, bar charts, double bar charts, and tables; eight different classification models were implemented using Python and Jupyter Notebook for processing.
Notably, a high correlation (0.92) was found between the variables "number of courses taken" and "number of courses passed." Therefore, the former was eliminated because it is the sum of the number of courses passed and failed. A discretization process was carried out for the variables "number of courses passed," "number of courses failed," "age," and "on-time tuition payment status," resulting in 4, 4, 9, and 2 categories, respectively. Of the total of 50 numerical variables obtained after generating dummy variables, 36 were selected as the most relevant to dropout rates. Of the eight proposed classification models (logistic regression, k-NN, decision tree, random forest, XGBoost, LightGBM, CatBoost, and multilayer perceptron), LightGBM was ultimately chosen with an accuracy of 0.9512 on the training set and 0.8892 on the test set.
Consequently, the LightGBM model can be considered suitable for predicting dropout due to its high generalization capacity—evidenced by its high accuracy on the test set—and the absence of overfitting, indicated by the minimal difference between the accuracy values on the training and test sets (0.0619). Furthermore, this model has advantages such as faster training speed, lower memory usage, and higher accuracy compared to other classification models.
Downloads
References
[1] Alania, P. (2018). Aplicación de técnicas de minería de datos para predecir la deserción estudiantil de la facultad de ingeniería de la Universidad Nacional Daniel Alcides Carrión [Tesis para obtener el grado de magíster]. Repositorio Institucional UNDAC.
[2] Aleans, K. (2012). Determinantes de la deserción estudiantil universitaria por niveles de formación en instituciones de educación superior de la ciudad de Medellín. Universidad EAFIT.
[3] Amaya, Y., Barrientos, E., & Heredia, D. (2014). Modelo predictivo de deserción estudiantil utilizando técnicas de minería de datos. RedCLARA. https://documentos.redclara.net/bitstream/10786/759/1/124-22-3-2014-Modelo%20predictivo%20de%20deserci%C3%B3n%20estudiantil%20utilizando%20t%C3%A9cnicas%20de%20miner%C3%ADa%20de%20datos.pdf
[4] Arias-Gómez, J., Villasís-Keever, M., & Miranda, M. (2016). El protocolo de investigación III: la población de estudio. Alergia México, 201-206.
[5] Banerjee, P. (2020). LightGBM classifier in Python. Kaggle. https://www.kaggle.com/code/prashant111/lightgbm-classifier-in-python
[6] Berens, J., Schneider, K., Görtz, S., Oster, S., & Burghoff, J. (2019). Early detection of students at risk–predicting student dropouts using administrative student data and machine learning methods. Journal of Educational Data Mining, 1-41.
[7] Camargo, A. (2020). Modelo para la predicción de la deserción de estudiantes de pregrado, basado en técnicas de minería de datos [Tesis para obtener el grado de magíster]. Repositorio Universidad de La Costa.
[8] Cuji, B., Gavilanes, W., & Sánchez, R. (2017). Modelo predictivo de deserción estudiantil basado en arboles de decisión. Revista Espacios, 17-25.
[9] Díaz, K., Chindoy, B., & Rosado, A. (2019). Review of techniques, tools, algorithms and attributes for data. En Journal of Physics: Conference Series (pp. 1-6). IOP Publishing.
[10] Escalante, J., Medina, C., & Vásquez, A. (2023). La deserción universitaria: un problema no resuelto en el Perú. Revista Hacedor, 60-72.
[11] Fernández, X., & Silva, E. (2014). Deserción estudiantil universitaria en el primer semestre. El caso de una institución de educación superior ecuatoriana. Cuadernos del Contrato Social por la Educación, 34-48.
[12] González, F., & Arismendi, K. (2018). Deserción estudiantil en la educación superior técnico-profesional: Explorando los factores que inciden en alumnos de primer año. Revista de la Educación Superior, 109-137.
[13] Han, J., Kamber, M., & Pei, J. (2012). Data mining: Concepts and techniques. Elsevier Inc.
[14] Hellas, A. et al. (2018). Predicting academic performance: A systematic literature review. En Proceedings Companion of the 23rd Annual ACM Conference on Innovation and Technology in Computer Science Education (ITiCSE '18 Companion) (pp. 175-199).
[15] Hernández, R., Fernández, C., & Baptista, M. (2014). Metodología de la investigación. McGraw-Hill Education.[
[16] Iljin, V. (2023, 4 de mayo). Comparing the Titans of Machine Learning: XGBoost, CatBoost and LightGBM. LinkedIn. https://www.linkedin.com/pulse/comparing-titans-machine-learning-xgboost-catboost-lightgbm-iljin/
[17] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). The elements of statistical learning with applications in R. Springer.
[18] Ministerio de Educación (Minedu). (2024). Resolución Viceministerial N° 095-2024-MINEDU. https://cdn.www.gob.pe/uploads/document/file/6894408/5957002-rvm_n-_095-2024-minedu.pdf
[19] Mori, J. (2021). Factores asociados al riesgo en la deserción estudiantil en un instituto de educación superior tecnológico público. Revista de Investigación de la Universidad Norbert Wiener, 59-72.
[20] Rivera, K. (2021). Modelo predictivo para la detección temprana de estudiantes con alto riesgo de deserción académica. Revista Innovación y Software, 6-13.
[21] scikit-learn. (s. f.). Feature selection. scikit-learn. https://scikit-learn.org/stable/modules/feature_selection.html
[22] Sifuentes, O. (2018). Modelos predictivos de la deserción estudiantil en una universidad privada peruana. Revista Industrial Data, 47.52.
[23] Spositto, O., Etcheverry, M., Ryckeboer, H., & Bossero, J. (2010). Aplicación de técnicas de minería de datos para la evaluación del rendimiento académico y la deserción estudiantil. https://repositoriocyt.unlam.edu.ar/handle/123456789/1267
[24] Tam, J., Vega, G., & Oliveros, R. (2008). Tipos. métodos y estrategias de investigación científica. Pensamiento y Acción, 145-154.
[25] Vásquez, J. (2016). Modelo predictivo para estimar la deserción de estudiantes en una institución de educación superior [Tesis para obtener el grado de magíster]. Repositorio Académico Universidad de Chile.
[26] Viale, H. (2014). Una aproximación teórica a la deserción estudiantil universitaria. Revista Digital de Investigación en Docencia Universitaria, 59-75.
[27] Viera, D., Flores, M., & Pachari-Vera, E. (2020). Factores de deserción estudiantil: Un estudio exploratorio desde Perú. Interciencia, 586-591.
[28] Villegas, B., & Núñez, L. (2024). Factores asociados a la deserción estudiantil en el ámbito universitario. Una revisión sistemática 2018-2023. Revista Iberoamericana para la Investigación y el Desarrollo Educativo, 14(28).
Downloads
Published
Versions
- 2025-12-23 (3)
- 2025-12-23 (2)
- 2025-12-23 (1)
Issue
Section
License
Copyright (c) 2025 Mg. José Espinoza Melgarejo (Autor/a)

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.