Draft:Imbalanced datasets in malware detection
This may take 2 months or more, since drafts are reviewed in no specific order. There are 2,782 pending submissions waiting for review.
- If the submission is accepted, then this page will be moved into the article space.
- If the submission is declined, then the reason will be posted here.
- In the meantime, you can continue to improve this submission by editing normally.
- If you need help editing or submitting your draft, please ask us a question at the AfC Help Desk or get live help from experienced editors. These venues are only for help with editing and the submission process, not to get reviews.
- If you need feedback on your draft, or if the review is taking a lot of time, you can try asking for help on the talk page of a relevant WikiProject. Some WikiProjects are more active than others so a speedy reply is not guaranteed.
- Wikipedia:Contributing to Wikipedia – a basic overview on how to edit Wikipedia.
- Help:Wikitext – how to use the markup
- Help:Referencing for beginners – how to include references
- Wikipedia:Article development – how to develop your article
- Wikipedia:Writing better articles – how to improve your article
- Wikipedia:Verifiability – make sure your article includes reliable third-party sources
You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article.
To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags.
- Easy tools: Citation bot (help) | Advanced: Fix bare URLs
- Instructions · What links here · Imbalanced datasets in malware detection (talk: + · bio) · (log) · Copyvios report · reFill · Citation Bot · (Search: Google, Wikipedia) · Submitted 26 days ago by Zduj (talk: D · +) · Last edited 25 days ago by Citation bot
- in-depth (not just passing mentions about the subject)
- reliable
- secondary
- independent of the subject
- If you would like to continue working on the submission, click on the "Edit" tab at the top of the window.
- If you have not resolved the issues listed above, your draft will be declined again and potentially deleted.
- If you need extra help, please ask us a question at the AfC Help Desk or get live help from experienced editors.
- Please do not remove reviewer comments or this notice until the submission is accepted.
- If you need help editing or submitting your draft, please ask us a question at the AfC Help Desk or get live help from experienced editors. These venues are only for help with editing and the submission process, not to get reviews.
- If you need feedback on your draft, or if the review is taking a lot of time, you can try asking for help on the talk page of a relevant WikiProject. Some WikiProjects are more active than others so a speedy reply is not guaranteed.
- Wikipedia:Contributing to Wikipedia – a basic overview on how to edit Wikipedia.
- Help:Wikitext – how to use the markup
- Help:Referencing for beginners – how to include references
- Wikipedia:Article development – how to develop your article
- Wikipedia:Writing better articles – how to improve your article
- Wikipedia:Verifiability – make sure your article includes reliable third-party sources
You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article.
To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags.
- Easy tools: Citation bot (help) | Advanced: Fix bare URLs
In the field of cybersecurity, imbalanced datasets pose a major challenge for training machine learning and deep learning models to detect malware.[1] In real-world security environments, the proportion of malicious samples is very small compared to benign, ranging from 0.01% to 2% of observed data.[2] This imbalance may cause traditional classifiers to become biased towards the majority (benign) class, achieving high overall accuracy but failing to correctly identify malicious samples.[1]
Problem
[edit ]Traditional machine learning models trained on imbalanced datasets tend to exhibit bias towards the majority class, resulting in poor precision and recall for malware detection.[2] [1]
Approaches
[edit ]Prior to transformer-based solutions, several methods have been examined to address class imbalance in software samples. These methods include sequence-based long short-term memory (LSTM) models, as well as statistical approaches such as n-gram language models. These approaches work well when the dataset is balanced, but their performance quickly drops when malware samples were proportioned realistically.[2]
BERT-Based Solution
[edit ]Recent research has explored the use of BERT (language model), originally developed for natural language processing, to address highly imbalanced datasets in malware detection.[2] [3] By treating application activity sequences as natural language data, BERT based methods have reported improved performance. One study found BERT achieved an F1 Score of 0.919 on datasets with only 0.5% malware samples, significantly outperforming traditional approaches.[2]
This approach works by:
- Analyzing sequences of application activities rather than individual features
- Using BERT's pre-trained language model capabilities
- Fine-tuning on android activity sequence data
This method addresses the fundamental problem of oversampling and undersampling in data analysis specific to cybersecurity, where malicious samples are extremely rare.[2]
References
[edit ]- ^ a b c Almajed, Hussain; Alsaqer, Abdulrahman; Frikha, Mounir (2025). "Imbalance Datasets in Malware Detection: A Review of Current Solutions and Future Directions". International Journal of Advanced Computer Science and Applications. 16 (1). doi:10.14569/IJACSA.2025.01601126.
- ^ a b c d e f Oak, Rajvardhan; Du, Min; Yan, David; Takawale, Harshvardhan; Amit, Idan (11 November 2019). "Malware Detection on Highly Imbalanced Data through Sequence Modeling". Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. ACM. pp. 37–48. doi:10.1145/3338501.3357374. ISBN 978-1-4503-6833-9.
- ^ Demirkıran, Ferhat; Çayır, Aykut; Ünal, Uğur; Dağ, Hasan (2022年06月22日), An Ensemble of Pre-trained Transformer Models For Imbalanced Multiclass Malware Classification, arXiv:2112.13236