A lack of diversity and representativeness within training data causes bias in the machine learning pipeline by influencing the performance of many machine learning models to favor the majority of samples that are most similar. It is necessary to have diverse and representative training data, especially for application domains in which people of varying demographics will be impacted by the outcomes produced by the machine learning model. Therefore, we propose the use of Applications Quest (AQ), an algorithm originally used for increasing diversity within college admissions to mitigate sample bias, as an under-sampling technique to combat the challenge of non-diverse and non-representative training data. AQ leverages the class distribution as well as the features of each sample in the dataset during the sampling procedure. We compare AQ with common under-sampling techniques such as random under-sampling, Edited Nearest Neighbor (ENN), Tomek Links, and Instance Hardness Threshold (IHT) on three imbalanced datasets: (1) Students’ Academic Performance; (2) Pima Indians Diabetes; and (3) Online Shoppers’ Purchasing Intention. Results indicate that applying AQ achieves comparable classification performance while also maintaining diversity and representativeness within the majority class of the datasets.



D. Prioleau, K. Alikhademi, A. Roberts, J. Peeples, A. Zare and J. Gilbert, "Application of Divisive Clustering for Reducing Bias in Imbalanced Data," in  2021 International Conference on Machine Learning and Data Mining (MLDM), P-ISSN 1864-9734 ,
E-ISSN 2699-5220, ISBN 978-3-942952-81-1,  pp. 115-129, 2021.
Title = {Application of Divisive Clustering for Reducing Bias in Imbalanced Data}, 
Author = {Diandra Prioleau and Kiana Alikhademi and Armisha Roberts and Joshua Peeples and Alina Zare and Juan Gilbert},  
Journal = {2021 International Conference on Machine Learning and Data Mining (MLDM)}, 
P-ISSN= {1864-9734 },
E-ISSN= {2699-5220}, 
ISBN= {978-3-942952-81-1},
Page= {115-129},
Year = {2021},