Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

chrisPiemonte/url2vec

Repository files navigation

alt url2vec Url2vec

Abstract

In this thesis a new methodology for clustering Web pages is discussed, using Random Walks between pages, together with their textual content, to learn vector representations for nodes in the web graph. Url2vec is implemented to extract clusters of pages of the same semantic type. Unlike the clustering algorithms proposed in literature, Url2Vec does not consider a website as a collection of text documents independent from each other, but tries to combine information about the content of the pages and the structure of the website.

The experimental results produced proved to be discreet and encouraged to follow the studies in this direction to identify new ways to improve the results achieved in terms of quality.

Setup

I suggest to setup a virtual environment using miniconda

  1. Create an environment with python 2.7:
conda create --name url2vec python=2.7
  1. Install requirements:
pip install -r ./requirements.txt
  1. To check the examples:
jupyter-notebook ./notebooks

Releases

No releases published

Packages

No packages published

Contributors 2

Languages

AltStyle によって変換されたページ (->オリジナル) /