Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 91fc7bd

Browse files
Merge pull request avinashkranjan#421 from vybhav72954/iss_414
Added Text Rank method for summarization
2 parents 661ff75 + 0c675d5 commit 91fc7bd

File tree

5 files changed

+195
-0
lines changed

5 files changed

+195
-0
lines changed

‎Text_Summary/README.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Text_Summary (Text Rank Approach)
2+
3+
[![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)
4+
5+
Text Summarization is an advanced project and comes under the umbrella of Natural Language Processing.
6+
There are multiple methods people use in order to summarize text.
7+
8+
They can be affectively clubbed under 2 methods:
9+
10+
- Abstractive: Understand the true context of text before summarization (like a human).
11+
- Extractive: Rank the text within the file and identify the impactful terms.
12+
13+
While both these approaches are under research, extractive summarization is presently used across multiple platform.
14+
There are multiple methods by which text is summarized under extractive approach as well.
15+
16+
In this script we will use Text Rank approach for text summarization.
17+
18+
## Dependencies
19+
20+
- nltk
21+
- numpy
22+
- networkx
23+
24+
## NLTK models
25+
26+
- `stopwords` - Stopwords are the English words which does not add much meaning to a sentence.
27+
28+
## Setup
29+
30+
- Setup a `python 3.x` virtual environment.
31+
- `Activate` the environment
32+
- Install the dependencies using
33+
34+
```bash
35+
pip3 install -r requiremnts.txt
36+
```
37+
38+
- Set up the models by running the following commands,
39+
40+
```bash
41+
$ python -m nltk.downloader stopwords
42+
```
43+
44+
- Run the `text_summary.py` file
45+
- Enter the source path.
46+
47+
## Results
48+
49+
The code generates the tokens (same as weights) of set of words, it shows the relative importance of words according to
50+
the summarizer, just uncomment the _l-112_
51+
52+
Results can be found [here](assets).
53+
54+
## Author(s)
55+
56+
Made by [Vybhav Chaturvedi](https://www.linkedin.com/in/vybhav-chaturvedi-0ba82614a/)

‎Text_Summary/assets/random.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
Music retrieval involves searching for music that is played over loudspeakers in public places such as a coffee shop or shopping mall, or even on the street. However, in such environments, the music is accompanied by background noise such as people’s voices, vehicular noises, or the sound of machinery.
2+
Recently, content-based music information retrieval (MIR) systems for mobile devices have attracted great interest. MIR systems perform various functionalities such as music recommendation and music recognition. MIR applications such as Shazam, SoundHound, and Gracenote have already been developed for the iPhone, iPad, and other such mobile devices.
3+
To develop a music retrieval system, first, it is necessary to create an audio fingerprint that can be matched against those stored in a music database. An audio fingerprint contains short summary information of an audio or a perceptual piece of audio content.
4+
Then, to improve the retrieval rate, when a query is input in a noisy environment, first, it is necessary to find candidate audio matching the query from a lookup table (LUT). This increases the probability of correct music identification. In this study, we evaluate various pre-processing methods for a hash-based fingerprint system and determine the best one by determining the accuracy of searching for a query from an LUT.
5+
Pre-processing can be carried out using various approaches such as normalization, noise reduction, and filtering. In this study, we evaluate the search accuracy of various such methods by calculating the number of exact matches when searching from robust fingerprints

‎Text_Summary/assets/random_textRank.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
MIR systems perform various functionalities such as music recommendation and music recognition
2+
MIR applications such as Shazam, SoundHound, and Gracenote have already been developed for the iPhone, iPad, and other such mobile devices.To develop a music retrieval system, first, it is necessary to create an audio fingerprint that can be matched against those stored in a music database
3+
Music retrieval involves searching for music that is played over loudspeakers in public places such as a coffee shop or shopping mall, or even on the street
4+
However, in such environments, the music is accompanied by background noise such as people�s voices, vehicular noises, or the sound of machinery.Recently, content-based music information retrieval (MIR) systems for mobile devices have attracted great interest
5+
This increases the probability of correct music identification

‎Text_Summary/requirements.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
nltk==3.2.4
2+
numpy==1.19.5
3+
networkx==2.5

‎Text_Summary/text_summary.py

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
#!/usr/bin/env python
2+
# coding: utf-8
3+
4+
# Imports
5+
from nltk.corpus import stopwords
6+
from nltk.cluster.util import cosine_distance
7+
import numpy as np
8+
import networkx as nx
9+
10+
# Enter the File path
11+
file_name = input("Enter the Source File: ")
12+
print("This script requires 'stopwords' from NLTK, see README"
13+
"Quick Download Command: ```python -m nltk.downloader stopwords```")
14+
15+
def read_article(file_name):
16+
"""
17+
Reads the Text file, and coverts them into sentences.
18+
:param file_name: Path of text file (line 12)
19+
:return: sentences
20+
"""
21+
file = open(file_name, 'r', encoding="utf-8")
22+
filedata = file.readlines()
23+
article = filedata[0].split(". ")
24+
sentences = []
25+
26+
for sentence in article:
27+
# Uncomment if you want to print the whole file on screen.
28+
# print(sentence)
29+
sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))
30+
sentences.pop()
31+
32+
return sentences
33+
34+
35+
def sentence_similarity(sent1, sent2, stopwords=None):
36+
"""
37+
To determine the Cosine Similarity between sentences
38+
:param sent1: Vector of sentence 1
39+
:param sent2: Vector of sentence 2
40+
:param stopwords: Words to be ignored in Vectors (Read README.md)
41+
:return: Cosine Similarity score
42+
"""
43+
if stopwords is None:
44+
stopwords = []
45+
46+
sent1 = [w.lower() for w in sent1]
47+
sent2 = [w.lower() for w in sent2]
48+
49+
all_words = list(set(sent1 + sent2))
50+
51+
vector1 = [0] * len(all_words)
52+
vector2 = [0] * len(all_words)
53+
54+
# build the vector for the first sentence
55+
for w in sent1:
56+
if w in stopwords:
57+
continue
58+
vector1[all_words.index(w)] += 1
59+
60+
# build the vector for the second sentence
61+
for w in sent2:
62+
if w in stopwords:
63+
continue
64+
vector2[all_words.index(w)] += 1
65+
66+
return 1 - cosine_distance(vector1, vector2)
67+
68+
69+
def build_similarity_matrix(sentences, stop_words):
70+
"""
71+
Build the similarity index of words in sentences
72+
:param sentences: Clean sentences
73+
:param stop_words: Words to be ignored in Vectors (Read README.md)
74+
:return: Similarity index (Tokenized words)
75+
"""
76+
# Create an empty similarity matrix
77+
similarity_matrix = np.zeros((len(sentences), len(sentences)))
78+
79+
for idx1 in range(len(sentences)):
80+
for idx2 in range(len(sentences)):
81+
if idx1 == idx2: # ignore if both are same sentences
82+
continue
83+
similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)
84+
85+
return similarity_matrix
86+
87+
88+
def generate_summary(file_name, top_n=5):
89+
"""
90+
Generate Summary of the text file
91+
:param file_name: Path of text file (line 12)
92+
:param top_n: Number of Sentence to be vectorized (tokenized)
93+
:return: Summary of text
94+
"""
95+
stop_words = stopwords.words('english')
96+
summarize_text = []
97+
98+
# Step 1 - Read text anc split it
99+
sentences = read_article(file_name)
100+
101+
# Step 2 - Generate Similarity Matrix across sentences
102+
sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)
103+
104+
# Step 3 - Rank sentences in similarity matrix
105+
sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
106+
scores = nx.pagerank(sentence_similarity_graph)
107+
108+
# Step 4 - Sort the rank and pick top sentences
109+
ranked_sentence = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)
110+
111+
# Print the index of the statements
112+
# print("Indexes of top ranked_sentence order are ", ranked_sentence)
113+
114+
for i in range(top_n):
115+
summarize_text.append(" ".join(ranked_sentence[i][1]))
116+
117+
# Step 5 - Output of the text file
118+
filepath_index = file_name.find('.txt')
119+
outputpath = file_name[:filepath_index]+'_textRank.txt'
120+
121+
with open(outputpath, 'w') as w:
122+
for sentence in summarize_text:
123+
w.write(str(sentence)+'\n')
124+
125+
126+
generate_summary(file_name, 5)

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /