Identifying Age and Gender and Analysing their Relation to Programming Behaviour of Scratch Users

  • J. Golsteijn

Student thesis: Master's Thesis


Informal online programming communities form a first introduction to computer science for a lot of people around the world. One such community is Scratch, which is a popular block-based programming environment that enables its users to share projects they create with fellow Scratchers. Most educational research on Scratch focuses on analysing the programming behaviour of its users, which mostly consist of children. The age and gender of these children are important factors as understanding the capabilities and interests of children of different ages and genders makes it possible to further refine programming education practices and tools to their needs. This thesis presents a way of automatically eliciting age and gender information of Scratch users on a large scale using machine learning models. The proposed methods were used to enrich an existing dataset of Scratch users and projects with age and gender information. We then quantitatively analysed the programming behaviour of Scratch users in the enriched dataset.
In order to deploy our machine learning models, we first scraped user data using the Scratch API, such as user profile texts and social network data. From these profile texts, we identified and manually verified more than 6,000 users who disclose their age and gender in order to construct a training set for our machine learning models. We then validated the performance of several models on the training data. This resulted in the selection of a network-based Node2Vec model for gender identification, and a text-based Transformer model with selective classification for age identification. Cross-validation results revealed that both of these models achieve an F1-score of around 0.80 on the training set. We used these models to automatically elicit age and gender information for the rest of the dataset. This allowed us to quantitatively analyse block type and programming concept usage in relation to age and gender.
The use of our selected machine learning models resulted in gender information for 336.394 Scratch users, which is 82.64% of all users in the utilised dataset. Age information was elicited for 14.993 Scratch users, which is 3.68% of all users in the utilised dataset. Furthermore, our gender distribution was more similar to that of the entire Scratch population than our age distribution, which was skewed towards higher age groups. Our block type and programming concept analyses revealed some differences related to gender. Male Scratchers use 7 out of 11 block types and the programming concepts of conditionals, coordination, iteration, and variables in a larger percentage of their projects than female users. Looks, Control, and Events blocks are used more frequently in projects by female users. There were hardly any age-related differences regarding the usage of block types and programming concepts.
The proposed age and gender identification methods open up several directions for future work. These involve further exploration of the gender-related differences in programming behaviour that were observed in this study. This can be achieved by applying our machine learning methods to other datasets. More advanced analysis frameworks can also be used to deepen the understanding of gender- and age-related differences in programming behaviour.
Date of Award17 May 2022
Original languageEnglish
SupervisorEfthimia Aivaloglou (Examiner) & Stefano Bromuri (Supervisor)

Master's Degree

  • Master Software Engineering

Cite this