Keywords: Gender classification, Fairness, Model biases
Abstract: Gender bias in facial recognition systems is a critical issue that affects the accuracy and fairness of these technologies. This paper investigates the performance of gender classification models across diverse datasets, focusing on the CNN-based Face-Gender-Classification-PyTorch model. We evaluate the model on three key datasets: Kaggle Gender Classification, UTKFace and All-Age-Faces. Initial experiments reveal that while the model performs well on Kaggle dataset, its accuracy drops on Asian data with notable gender performance disparities. To address this, we apply fine-tuning, among other strategies, using the FairFace dataset and a mixed dataset approach. While FairFace alone improves overall accuracy, combining it with the mixed dataset produces more balanced results—reducing the gender gap from 30 to 6 percentage points and achieving near-optimal accuracy. The findings provide evidence of racial and gender bias and show it can be mitigated through approachable data balancing techniques. We further analyze model behavior by evaluating racial group performance using UTKFace and applying Grad-CAM to interpret decision-making. Finally, we test the best-performing model on Japanese TV data demonstrating its potential for large-scale gender fairness monitoring.
Submission Number: 2
Loading