Abstract
Promoting engagement and participation is vital for online social networks such as community Question-Answering (cQA) sites. One way of increasing the contribution of their members is by connecting their content with the right target audience. To achieve this goal, demographic analysis is pivotal in deciphering the interest of each community fellow. Indeed, demographic factors such as gender are fundamental in reducing the gender disparity across distinct topics. This work assesses the classification rate of assorted state-of-the-art transformer-based models (e.g., BERT and FNET) on the task of gender identification across cQA fellows. For this purpose, it benefited from a massive text-oriented corpus encompassing 548,375 member profiles including their respective full-questions, answers and self-descriptions. This assisted in conducting large-scale experiments considering distinct combinations of encoders and sources. Contrary to our initial intuition, in average terms, self-descriptions were detrimental due to their sparseness. In effect, the best transformer models achieved an AUC of 0.92 by taking full-questions and answers into account (i.e., DeBERTa and MobileBERT). Our qualitative results reveal that fine-tuning on user-generated content is affected by pre-training on clean corpora, and that this adverse effect can be mitigated by correcting the case of words.
Original language | English |
---|---|
Pages (from-to) | 3983-3995 |
Number of pages | 13 |
Journal | IEEE Access |
Volume | 11 |
DOIs | |
Publication status | Published - 2023 |
Keywords
- Gender identification
- community question-answering sites
- engagement and participation in online communities
- transformers
ASJC Scopus subject areas
- General Computer Science
- General Materials Science
- General Engineering
- Electrical and Electronic Engineering