GitHub - WusamX/DetectMaliciousURL: Applying text model to Detection Task

Name	Name	Last commit message	Last commit date
Latest commit History 27 Commits
data	data
model	model
server	server
LICENSE	LICENSE
README.md	README.md
REQUIREMENTS	REQUIREMENTS

Characterstic

Using Word2Vec+CNN to detect the Malicious URL and it's a really exquisite structure!
Finially result about 96.2% precision
High scalability supporting for Distributed System
Supporting for Online Learning

Requirements

Tensorflow 1.1.0
Numpy
Gensim 2.0.0

Training

python train.py --help
usage: train.py [-h] [--data_file DATA_FILE] [--num_labels NUM_LABELS]
 [--embedding_dim EMBEDDING_DIM] [--filter_sizes FILTER_SIZES]
 [--num_filters NUM_FILTERS]
 [--dropout_keep_prob DROPOUT_KEEP_PROB]
 [--l2_reg_lambda L2_REG_LAMBDA] [--batch_size BATCH_SIZE]
 [--num_epochs NUM_EPOCHS] [--evaluate_every EVALUATE_EVERY]
 [--checkpoint_every CHECKPOINT_EVERY]
 [--num_checkpoints NUM_CHECKPOINTS]
 [--allow_soft_placement [ALLOW_SOFT_PLACEMENT]]
 [--noallow_soft_placement]
 [--log_device_placement [LOG_DEVICE_PLACEMENT]]
 [--nolog_device_placement]
 [--noreplicas] [--is_sync [IS_SYNC]] [--nois_sync]
 [--ps_hosts PS_HOSTS] [--worker_hosts WORKER_HOSTS]
 [--job_name JOB_NAME] [--task_index TASK_INDEX]
 [--log_dir LOG_DIR]
 optional arguments:
 -h, --help show this help message and exit
 --data_file DATA_FILE
 Data source
 --num_labels NUM_LABELS
 Number of labels for data. (default: 2)
 --embedding_dim EMBEDDING_DIM
 Dimensionality of character embedding (default: 128)
 --filter_sizes FILTER_SIZES
 Comma-spearated filter sizes (default: '3,4,5')
 --num_filters NUM_FILTERS
 Number of filters per filter size (default: 128)
 --dropout_keep_prob DROPOUT_KEEP_PROB
 Dropout keep probability (default: 0.5)
 --l2_reg_lambda L2_REG_LAMBDA
 L2 regularization lambda (default: 0.0)
 --batch_size BATCH_SIZE
 Batch Size (default: 64)
 --num_epochs NUM_EPOCHS
 Number of training epochs (default: 200)
 --evaluate_every EVALUATE_EVERY
 Evalue model on dev set after this many steps
 (default: 100)
 --checkpoint_every CHECKPOINT_EVERY
 Save model after this many steps (defult: 100)
 --num_checkpoints NUM_CHECKPOINTS
 Number of checkpoints to store (default: 5)
 --allow_soft_placement [ALLOW_SOFT_PLACEMENT]
 Allow device soft device placement
 --noallow_soft_placement
 --log_device_placement [LOG_DEVICE_PLACEMENT]
 Log placement of ops on devices
 --nolog_device_placement
 --replicas [REPLICAS]
 Use the dirstribution mode
 --noreplicas
 --is_sync [IS_SYNC] Use the async or sync mode
 --nois_sync
 --ps_hosts PS_HOSTS comma-separated lst of hostname:port pairs
 --worker_hosts WORKER_HOSTS
 comma-separated lst of hostname:port pairs
 --job_name JOB_NAME job name:worker or ps
 --task_index TASK_INDEX
 Worker task index,should be >=0, task=0 is the master
 worker task the performs the variable initialization
 --log_dir LOG_DIR parameter and log info

Distribution

Let's take 192.168.0.107 as ps server , 10.211.55.13 and 10.211.55.14 as training server.
Make every machine has a copy of the code.

Async-parallelism mode:

image

 On 192.168.0.107:
 python train.py --replicas=True --job_name=ps --task_index=0 --ps_hosts=192.168.0.107:2222\
 --worker_hosts=10.211.55.13:2222,10.211.55.14:2222
 On 10.211.55.13:
 python train.py --replicas=True --job_name=worker --task_index=0 --ps_hosts=192.168.0.107:2222\
 --worker_hosts=10.211.55.13:2222,10.211.55.14:2222 
 On 10.211.55.14:
 python train.py --replicas=True --job_name=worker --task_index=1 --ps_hosts=192.168.0.107:2222\
 --worker_hosts=10.211.55.13:2222,10.211.55.14:2222

Sync-parallelism mode:

image

 On 192.168.0.107:
 python train.py --replicas=True --is_sync=True --job_name=ps --task_index=0 --ps_hosts=192.168.0.107:2222\
 --worker_hosts=10.211.55.13:2222,10.211.55.14:2222
 On 10.211.55.13:
 python train.py --replicas=True --is_sync=True --job_name=worker --task_index=0 --ps_hosts=192.168.0.107:2222\
 --worker_hosts=10.211.55.13:2222,10.211.55.14:2222 
 On 10.211.55.14:
 python train.py --replicas=True --is_sync=True --job_name=worker --task_index=1 --ps_hosts=192.168.0.107:2222\
 --worker_hosts=10.211.55.13:2222,10.211.55.14:2222

Evaluation

 python eval.py --help 
 usage: eval.py [-h] [--input_text_file INPUT_TEXT_FILE][--single_url SINGLE_URL]
 [--input_label_file INPUT_LABEL_FILE] [--batch_size BATCH_SIZE]
 [--checkpoint_dir CHECKPOINT_DIR] [--eval_train [EVAL_TRAIN]]
 [--noeval_train]
 [--allow_soft_placement [ALLOW_SOFT_PLACEMENT]]
 [--noallow_soft_placement]
 [--log_device_placement [LOG_DEVICE_PLACEMENT]]
 [--nolog_device_placement]
python eval.py --checkpoint_dir ./runs/{TIME_DIR}/checkpoints}

Single URL Detection

python eval.py --checkpoint_dir ./runs/{TIME_DIR}/checkpoints} --single_url=hottraveljobs.com/forum/docs/info.php

Here I use the defualt checkpoint_dir to detection single_url

python eval.py --single_url=hottraveljobs.com/forum/docs/info.php

Panel Testing

python eval.py --checkpoint_dir ./runs/{TIME_DIR}/checkpoints} --input_text_file="../data/data2.csv"

HTTP Server API

This is the HTTP service to load TensorFlow model and inference to predict malicious url.

Usage

Run HTTP server with [Django] and use HTTP client under /server

 ./manage.py runserver 0.0.0.0:8000

Inference to predict url

Use url as your GET parameter

 127.0.0.1:8000/detection/predict/?url=appst0re.net/upload.aspx

And you will get

Success to predict appst0re.net/upload.aspx, result: bad

Implementation

django-admin startproject server
python manage.py startapp detection
#Add customized urls and views.

References

[1] Cnn-Text-Classification-TF

[2] Convolutional Neural Networks for Sentence Classification

[3] Using Word2Vec+ CNN to Detect Malicious URL

[4] deep_recommend_system

[5] using-machine-learning-detect-malicious-urls

[6] Malware URLs

[7]Malicious URL Detection using Machine Learning

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WusamX/DetectMaliciousURL

Folders and files

Latest commit

History

Repository files navigation

Characterstic

Requirements

Training

Distribution

Async-parallelism mode:

Sync-parallelism mode:

Evaluation

Single URL Detection

Panel Testing

HTTP Server API

Usage

Inference to predict url

Implementation

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Characterstic

Requirements

Training

Distribution

Async-parallelism mode:

Sync-parallelism mode:

Evaluation

Single URL Detection

Panel Testing

HTTP Server API

Usage

Inference to predict url

Implementation

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages