Forensic Identification of Biological Paper Mills: Utilizing Random Forest Classifiers to Detect Structural and Technical Fingerprints

Nurkalam Huseynli; Orkhan Mammadli

Authors

Nurkalam Huseynli Azerbaijan State Oil and Industry University, Master’s degree graduate of Department of Mechatronics and Robotics
Orkhan Mammadli Azerbaijan State Oil and Industry University, Master’s degree graduate of Department of Mechatronics and Robotics. SOCAR Downstream, ERP Department

Keywords:

Research integrity, Paper mills, Forensic scientometrics, Random Forest, Biopython, Life sciences, Automated fraud detection

Abstract

The industrialization of research misconduct through "paper mills" represents a systemic threat to the integrity of the global scientific record. In this paper, we evaluate a forensic machine learning approach to distinguish between authentic biological abstracts and fraudulent, template-generated manuscripts. We sought to quantify the accuracy of a Random Forest classifier while prioritizing high precision to minimize the risk of false accusations against legitimate researchers. We conducted experiments on a balanced dataset of 382 abstracts—comprising 195 authentic life-science papers retrieved via Biopython and 187 fraudulent papers verified by the Retraction Watch database—to identify the "invisible" structural and technical fingerprints of automated fraud.
The classification strategy operates by transforming text into numerical vectors utilizing TF-IDF (Term Frequency-Inverse Document Frequency) with a range of (1,2) n-grams. The Random Forest model demonstrated robust performance, achieving a mean cross-validation accuracy of 92.07% with a stability margin of ± 1.37%. On unseen test data, the model achieved an exceptional precision of 98.0% for the fraudulent class, correctly identifying 159 paper-mill products while producing only 4 false positives. This prioritized precision highlights the model's suitability as a conservative "gatekeeper" in editorial workflows, where the ethical cost of a false positive significantly outweighs the necessity for total recall.
Feature importance analysis identified a fundamental stylistic divergence between the two classes. Fraudulent abstracts were characterized by document-centric templates and generic academic markers such as "article," "paper," and "conclusion." Furthermore, the model identified "domain bleeding," where computational terms like "fuzzy" and "decision" were misapplied within biological contexts. In contrast, authentic biological abstracts were distinguished by high-density biological nomenclature and the preservation of technical metadata, specifically "sup" and "sub" XML tags, which serve as a proxy for professional database integration and typesetting.
Our findings point towards a balance between forensic sensitivity and editorial safety. While the model showed a 15% rate of false negatives (missed frauds), its near-perfect precision ensures that flagged manuscripts are identified with high certainty. These results provide insights into the industrialized footprints of paper mills and inform the development of automated screening tools that can be integrated into the initial stages of peer review. Future research can explore the application of this framework to full-text manuscripts to identify further inconsistencies in methodology and data reporting.

Forensic Identification of Biological Paper Mills: Utilizing Random Forest Classifiers to Detect Structural and Technical Fingerprints

Authors

Keywords:

Abstract

Published

How to Cite

Issue

Section

License