DETECTING CRIMINAL CONTENT ON SOCIAL MEDIA USING MACHINE LEARNING MODELS: A CASE STUDY ON KAZAKH-LANGUAGE CONTENT
Keywords:
cybersecurity, online social networks, crime detection, deep learning, neural networksAbstract
With the increasing prevalence of harmful content on social media, there is a growing need for automated systems capable of detecting criminal discourse online. This study focuses on the detection of crime-related content in Kazakh-language social media posts using machine learning and natural language processing (NLP) techniques. A multilingual corpus was compiled from social networks, annotated into seven categories: Noncrime, Assault, Burglary, Drugs, Homicide, Sex Offense, and Extremist. Both classical machine learning classifiers (e.g., Logistic Regression, Naive Bayes, Random Forest) and deep learning models were trained and evaluated using various text vectorization methods (TF-IDF, Word2Vec, CountVectorizer). Among traditional models, Logistic Regression achieved the highest performance with an F1-score of 0.9681. BERT, used as the primary deep learning model, demonstrated strong capability in identifying nuanced criminal content, especially in under-resourced languages like Kazakh. The study underscores the effectiveness of modern NLP techniques for multilingual crime detection and contributes valuable resources for future research on content moderation in low-resource linguistic environments
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.