The Performance Evaluation of Naïve Bayes, Logistic Regression, and Calibrated SVC for Cross-Lingual Ham and Spam Detection
Keywords:
Spam classification, TF-IDF, Logistic Regression, Na¨ıve Bayes, Calibrated Linear SVC, Calibrated Classifier CV, machine translation, MarianMT, multilingual NLP, Streamlit.Abstract
This paper presents a multilingual spam email classification framework capable of processing both German and En
glish texts using classical machine learning models combined with a translation-enhanced pipeline. The system integrates TF-IDF vectorization, Logistic Regression, Multinomial Na¨ıve Bayes, and a probability-calibrated Linear Support Vector Classifier (SVC). A key contribution is the incorporation of an automatic German-to-English translation layer using MarianMT, allow ing cross-lingual evaluation and robustness analysis. Multiple models are trained on a manually curated dataset of 3,790 emails, achieving up to 98.55% accuracy and 0.9860 AUC
ROC on the evaluated dataset. A Streamlit-based application is implemented for real-time inference, supporting auto-detection of input language, dual-model evaluation for German emails, and calibrated probability scores. Experimental results demonstrate that Logistic Regression provides the most consistent overall performance, while Calibrated Linear SVC delivers the highest AUC-ROCandstable decision boundaries. The system represents a practical, extendable multilingual spam detection pipeline suitable for lightweight deployment scenarios.