Skip to main navigation Skip to search Skip to main content

Semantic similarity in community forum questions: case study on Quora dataset

Research output: Contribution to journalArticlepeer-review

43 Downloads (Pure)

Abstract

Duplicate questions on crowd-sourced question and answer websites such as Quora create redundancy and make information retrieval inefficient. This research conducts a systematic comparative analysis of machine learning and deep learning models for detecting semantic similarity in questions. Using the Quora Question Pairs dataset, we evaluate a spectrum of models: a classical TF-IDF baseline, feature-engineered Random Forest and XGBoost, a Siamese Manhattan LSTM (MaLSTM), and a fine-tuned BERT model. The study reveals a clear performance hierarchy. A key finding is that classical models with a limited set of hand-crafted linguistic features underperformed the simple TF-IDF baseline. While the MaLSTM network showed moderate improvement, the fine-tuned BERT model was unequivocally superior, achieving a statistically significant accuracy of 86.26%. This highlights the critical role of deep contextual embeddings for this task. However, BERT’s state-of-the-art performance comes at a significant computational cost, revealing a crucial trade-off between accuracy and resource efficiency. These findings provide a pragmatic guide for designing effective and scalable duplicate question detection systems.
Original languageEnglish
Pages (from-to)1719-1728
Number of pages10
JournalJournal of Umm Al-Qura University for Engineering and Architecture
Volume16
Issue number4
Early online date1 Sept 2025
DOIs
Publication statusPublished - 1 Sept 2025

Cite this