TY - JOUR
T1 - Refining fine-tuned transformers with hand-crafted features for gender screening on question-answering communities
AU - Figueroa, Alejandro
N1 - Publisher Copyright:
© 2022 Elsevier B.V.
PY - 2023/4
Y1 - 2023/4
N2 - Machine learning and demographic analysis are a cornerstone for making community Question Answering (cQA) platforms more egalitarian and vibrant, safer as well. For instance, the two cooperate on successfully detecting suspicious/malicious activity and on stirring up the interest of community fellows to learn by exploring new topics. In this sense, both research fields play a vital role in reducing gender disparity across categories, when promoting unresolved questions to potential answerers. Current state-of-the-art artificial intelligence architectures, such as pre-trained transformers, train complex goals and million of parameters as a means of inferring and encoding knowledge from massive corpora. Fine-tuning is the process that allows later to transfer this encrypted information to a downstream task (e.g., gender classification). Needless to say, these pre-trained encoders also suffer from multiple disadvantages. To give an example, they are sensitive to irrelevant and misleading words, bringing about overfitting, usually on small datasets. This work offers a fresh look at this kind of technique by introducing PTM-SFFS, a novel approach that effectively pairs frontier transformers with linguistic properties via the use of traditional classifiers. Based on a feature wrapper (SFFS), PTM-SFFS refines the scores produced by a fine-tuned model via seeking for an array of mostly linguistic features to build a conventional statistical classifier (e.g., Bayes and MaxEnt). And as a result, this new discriminant function enhances the overall prediction rate by optimizing the synergy between both sorts of strategies. When applied to automatic gender recognition on cQA sites, PTM-SFFS increased the accuracy of seven fine-tuned state-of-the-art encoders up to 10% (XLNet). Thanks to its interpretability, we discover that it capitalizes on dependency parsing and metadata for improving the transference of lexicalized information to the target domain.
AB - Machine learning and demographic analysis are a cornerstone for making community Question Answering (cQA) platforms more egalitarian and vibrant, safer as well. For instance, the two cooperate on successfully detecting suspicious/malicious activity and on stirring up the interest of community fellows to learn by exploring new topics. In this sense, both research fields play a vital role in reducing gender disparity across categories, when promoting unresolved questions to potential answerers. Current state-of-the-art artificial intelligence architectures, such as pre-trained transformers, train complex goals and million of parameters as a means of inferring and encoding knowledge from massive corpora. Fine-tuning is the process that allows later to transfer this encrypted information to a downstream task (e.g., gender classification). Needless to say, these pre-trained encoders also suffer from multiple disadvantages. To give an example, they are sensitive to irrelevant and misleading words, bringing about overfitting, usually on small datasets. This work offers a fresh look at this kind of technique by introducing PTM-SFFS, a novel approach that effectively pairs frontier transformers with linguistic properties via the use of traditional classifiers. Based on a feature wrapper (SFFS), PTM-SFFS refines the scores produced by a fine-tuned model via seeking for an array of mostly linguistic features to build a conventional statistical classifier (e.g., Bayes and MaxEnt). And as a result, this new discriminant function enhances the overall prediction rate by optimizing the synergy between both sorts of strategies. When applied to automatic gender recognition on cQA sites, PTM-SFFS increased the accuracy of seven fine-tuned state-of-the-art encoders up to 10% (XLNet). Thanks to its interpretability, we discover that it capitalizes on dependency parsing and metadata for improving the transference of lexicalized information to the target domain.
KW - Community question answering
KW - Gender recognition
KW - Natural language processing
KW - Pre-trained models
KW - Statistical classifiers
KW - User analysis
UR - http://www.scopus.com/inward/record.url?scp=85143860709&partnerID=8YFLogxK
U2 - 10.1016/j.inffus.2022.12.003
DO - 10.1016/j.inffus.2022.12.003
M3 - Article
AN - SCOPUS:85143860709
SN - 1566-2535
VL - 92
SP - 256
EP - 267
JO - Information Fusion
JF - Information Fusion
ER -