Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

WusamX/DetectMaliciousURL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

27 Commits

Repository files navigation

PyPI

Characterstic

  • Using Word2Vec+CNN to detect the Malicious URL and it's a really exquisite structure!

  • Finially result about 96.2% precision

  • High scalability supporting for Distributed System

  • Supporting for Online Learning

Requirements

  • Tensorflow 1.1.0
  • Numpy
  • Gensim 2.0.0

Training

python train.py --help
usage: train.py [-h] [--data_file DATA_FILE] [--num_labels NUM_LABELS]
 [--embedding_dim EMBEDDING_DIM] [--filter_sizes FILTER_SIZES]
 [--num_filters NUM_FILTERS]
 [--dropout_keep_prob DROPOUT_KEEP_PROB]
 [--l2_reg_lambda L2_REG_LAMBDA] [--batch_size BATCH_SIZE]
 [--num_epochs NUM_EPOCHS] [--evaluate_every EVALUATE_EVERY]
 [--checkpoint_every CHECKPOINT_EVERY]
 [--num_checkpoints NUM_CHECKPOINTS]
 [--allow_soft_placement [ALLOW_SOFT_PLACEMENT]]
 [--noallow_soft_placement]
 [--log_device_placement [LOG_DEVICE_PLACEMENT]]
 [--nolog_device_placement]
 [--noreplicas] [--is_sync [IS_SYNC]] [--nois_sync]
 [--ps_hosts PS_HOSTS] [--worker_hosts WORKER_HOSTS]
 [--job_name JOB_NAME] [--task_index TASK_INDEX]
 [--log_dir LOG_DIR]
 optional arguments:
 -h, --help show this help message and exit
 --data_file DATA_FILE
 Data source
 --num_labels NUM_LABELS
 Number of labels for data. (default: 2)
 --embedding_dim EMBEDDING_DIM
 Dimensionality of character embedding (default: 128)
 --filter_sizes FILTER_SIZES
 Comma-spearated filter sizes (default: '3,4,5')
 --num_filters NUM_FILTERS
 Number of filters per filter size (default: 128)
 --dropout_keep_prob DROPOUT_KEEP_PROB
 Dropout keep probability (default: 0.5)
 --l2_reg_lambda L2_REG_LAMBDA
 L2 regularization lambda (default: 0.0)
 --batch_size BATCH_SIZE
 Batch Size (default: 64)
 --num_epochs NUM_EPOCHS
 Number of training epochs (default: 200)
 --evaluate_every EVALUATE_EVERY
 Evalue model on dev set after this many steps
 (default: 100)
 --checkpoint_every CHECKPOINT_EVERY
 Save model after this many steps (defult: 100)
 --num_checkpoints NUM_CHECKPOINTS
 Number of checkpoints to store (default: 5)
 --allow_soft_placement [ALLOW_SOFT_PLACEMENT]
 Allow device soft device placement
 --noallow_soft_placement
 --log_device_placement [LOG_DEVICE_PLACEMENT]
 Log placement of ops on devices
 --nolog_device_placement
 --replicas [REPLICAS]
 Use the dirstribution mode
 --noreplicas
 --is_sync [IS_SYNC] Use the async or sync mode
 --nois_sync
 --ps_hosts PS_HOSTS comma-separated lst of hostname:port pairs
 --worker_hosts WORKER_HOSTS
 comma-separated lst of hostname:port pairs
 --job_name JOB_NAME job name:worker or ps
 --task_index TASK_INDEX
 Worker task index,should be >=0, task=0 is the master
 worker task the performs the variable initialization
 --log_dir LOG_DIR parameter and log info 

Distribution

Let's take 192.168.0.107 as ps server , 10.211.55.13 and 10.211.55.14 as training server.
Make every machine has a copy of the code.

Async-parallelism mode:

image

 On 192.168.0.107:
 python train.py --replicas=True --job_name=ps --task_index=0 --ps_hosts=192.168.0.107:2222\
 --worker_hosts=10.211.55.13:2222,10.211.55.14:2222
 On 10.211.55.13:
 python train.py --replicas=True --job_name=worker --task_index=0 --ps_hosts=192.168.0.107:2222\
 --worker_hosts=10.211.55.13:2222,10.211.55.14:2222 
 On 10.211.55.14:
 python train.py --replicas=True --job_name=worker --task_index=1 --ps_hosts=192.168.0.107:2222\
 --worker_hosts=10.211.55.13:2222,10.211.55.14:2222 

Sync-parallelism mode:

image

 On 192.168.0.107:
 python train.py --replicas=True --is_sync=True --job_name=ps --task_index=0 --ps_hosts=192.168.0.107:2222\
 --worker_hosts=10.211.55.13:2222,10.211.55.14:2222
 On 10.211.55.13:
 python train.py --replicas=True --is_sync=True --job_name=worker --task_index=0 --ps_hosts=192.168.0.107:2222\
 --worker_hosts=10.211.55.13:2222,10.211.55.14:2222 
 On 10.211.55.14:
 python train.py --replicas=True --is_sync=True --job_name=worker --task_index=1 --ps_hosts=192.168.0.107:2222\
 --worker_hosts=10.211.55.13:2222,10.211.55.14:2222 

Evaluation

 python eval.py --help 
 usage: eval.py [-h] [--input_text_file INPUT_TEXT_FILE][--single_url SINGLE_URL]
 [--input_label_file INPUT_LABEL_FILE] [--batch_size BATCH_SIZE]
 [--checkpoint_dir CHECKPOINT_DIR] [--eval_train [EVAL_TRAIN]]
 [--noeval_train]
 [--allow_soft_placement [ALLOW_SOFT_PLACEMENT]]
 [--noallow_soft_placement]
 [--log_device_placement [LOG_DEVICE_PLACEMENT]]
 [--nolog_device_placement]
python eval.py --checkpoint_dir ./runs/{TIME_DIR}/checkpoints}

Single URL Detection

python eval.py --checkpoint_dir ./runs/{TIME_DIR}/checkpoints} --single_url=hottraveljobs.com/forum/docs/info.php

Here I use the defualt checkpoint_dir to detection single_url

python eval.py --single_url=hottraveljobs.com/forum/docs/info.php

Panel Testing

python eval.py --checkpoint_dir ./runs/{TIME_DIR}/checkpoints} --input_text_file="../data/data2.csv"

HTTP Server API

This is the HTTP service to load TensorFlow model and inference to predict malicious url.

Usage

Run HTTP server with [Django] and use HTTP client under /server

 ./manage.py runserver 0.0.0.0:8000

Inference to predict url

Use url as your GET parameter

 127.0.0.1:8000/detection/predict/?url=appst0re.net/upload.aspx

And you will get

Success to predict appst0re.net/upload.aspx, result: bad

Implementation

django-admin startproject server
python manage.py startapp detection
#Add customized urls and views.

References

[1] Cnn-Text-Classification-TF

[2] Convolutional Neural Networks for Sentence Classification

[3] Using Word2Vec+ CNN to Detect Malicious URL

[4] deep_recommend_system

[5] using-machine-learning-detect-malicious-urls

[6] Malware URLs

[7]Malicious URL Detection using Machine Learning

About

Applying text model to Detection Task

Resources

License

Stars

Watchers

Forks

Packages

Contributors

Languages

  • Python 100.0%

AltStyle によって変換されたページ (->オリジナル) /