Commit 91fc7bd

authored

Merge pull request avinashkranjan#421 from vybhav72954/iss_414

Added Text Rank method for summarization

2 parents 661ff75 + 0c675d5 commit 91fc7bdCopy full SHA for 91fc7bd

File tree

5 files changed

+195

-0

lines changed

Text_Summary

5 files changed

+195

-0

lines changed

`‎Text_Summary/README.md`

Lines changed: 56 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,56 @@`
	`1`	`+# Text_Summary (Text Rank Approach)`
	`2`	`+`
	`3`	`+[![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)`
	`4`	`+`
	`5`	`+Text Summarization is an advanced project and comes under the umbrella of Natural Language Processing.`
	`6`	`+There are multiple methods people use in order to summarize text.`
	`7`	`+`
	`8`	`+They can be affectively clubbed under 2 methods:`
	`9`	`+`
	`10`	`+- Abstractive: Understand the true context of text before summarization (like a human).`
	`11`	`+- Extractive: Rank the text within the file and identify the impactful terms.`
	`12`	`+`
	`13`	`+While both these approaches are under research, extractive summarization is presently used across multiple platform.`
	`14`	`+There are multiple methods by which text is summarized under extractive approach as well.`
	`15`	`+`
	`16`	`+In this script we will use Text Rank approach for text summarization.`
	`17`	`+`
	`18`	`+## Dependencies`
	`19`	`+`
	`20`	`+- nltk`
	`21`	`+- numpy`
	`22`	`+- networkx`
	`23`	`+`
	`24`	`+## NLTK models`
	`25`	`+`
	`26`	+- `stopwords` - Stopwords are the English words which does not add much meaning to a sentence.
	`27`	`+`
	`28`	`+## Setup`
	`29`	`+`
	`30`	+- Setup a `python 3.x` virtual environment.
	`31`	+- `Activate` the environment
	`32`	`+- Install the dependencies using`
	`33`	`+`
	`34`	+```bash
	`35`	`+pip3 install -r requiremnts.txt`
	`36`	+```
	`37`	`+`
	`38`	`+- Set up the models by running the following commands,`
	`39`	`+`
	`40`	+```bash
	`41`	`+$ python -m nltk.downloader stopwords`
	`42`	+```
	`43`	`+`
	`44`	+- Run the `text_summary.py` file
	`45`	`+- Enter the source path.`
	`46`	`+`
	`47`	`+## Results`
	`48`	`+`
	`49`	`+The code generates the tokens (same as weights) of set of words, it shows the relative importance of words according to`
	`50`	`+the summarizer, just uncomment the _l-112_`
	`51`	`+`
	`52`	`+Results can be found [here](assets).`
	`53`	`+`
	`54`	`+## Author(s)`
	`55`	`+`
	`56`	`+Made by [Vybhav Chaturvedi](https://www.linkedin.com/in/vybhav-chaturvedi-0ba82614a/)`

`‎Text_Summary/assets/random.txt`

Lines changed: 5 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,5 @@`
	`1`	`+Music retrieval involves searching for music that is played over loudspeakers in public places such as a coffee shop or shopping mall, or even on the street. However, in such environments, the music is accompanied by background noise such as people’s voices, vehicular noises, or the sound of machinery.`
	`2`	`+Recently, content-based music information retrieval (MIR) systems for mobile devices have attracted great interest. MIR systems perform various functionalities such as music recommendation and music recognition. MIR applications such as Shazam, SoundHound, and Gracenote have already been developed for the iPhone, iPad, and other such mobile devices.`
	`3`	`+To develop a music retrieval system, first, it is necessary to create an audio fingerprint that can be matched against those stored in a music database. An audio fingerprint contains short summary information of an audio or a perceptual piece of audio content.`
	`4`	`+Then, to improve the retrieval rate, when a query is input in a noisy environment, first, it is necessary to find candidate audio matching the query from a lookup table (LUT). This increases the probability of correct music identification. In this study, we evaluate various pre-processing methods for a hash-based fingerprint system and determine the best one by determining the accuracy of searching for a query from an LUT.`
	`5`	`+Pre-processing can be carried out using various approaches such as normalization, noise reduction, and filtering. In this study, we evaluate the search accuracy of various such methods by calculating the number of exact matches when searching from robust fingerprints`

`‎Text_Summary/assets/random_textRank.txt`

Lines changed: 5 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,5 @@`
	`1`	`+MIR systems perform various functionalities such as music recommendation and music recognition`
	`2`	`+MIR applications such as Shazam, SoundHound, and Gracenote have already been developed for the iPhone, iPad, and other such mobile devices.To develop a music retrieval system, first, it is necessary to create an audio fingerprint that can be matched against those stored in a music database`
	`3`	`+Music retrieval involves searching for music that is played over loudspeakers in public places such as a coffee shop or shopping mall, or even on the street`
	`4`	`+However, in such environments, the music is accompanied by background noise such as people�s voices, vehicular noises, or the sound of machinery.Recently, content-based music information retrieval (MIR) systems for mobile devices have attracted great interest`
	`5`	`+This increases the probability of correct music identification`

`‎Text_Summary/requirements.txt`

Lines changed: 3 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+nltk==3.2.4`
	`2`	`+numpy==1.19.5`
	`3`	`+networkx==2.5`

`‎Text_Summary/text_summary.py`

Lines changed: 126 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,126 @@`
	`1`	`+#!/usr/bin/env python`
	`2`	`+# coding: utf-8`
	`3`	`+`
	`4`	`+# Imports`
	`5`	`+from nltk.corpus import stopwords`
	`6`	`+from nltk.cluster.util import cosine_distance`
	`7`	`+import numpy as np`
	`8`	`+import networkx as nx`
	`9`	`+`
	`10`	`+# Enter the File path`
	`11`	`+file_name = input("Enter the Source File: ")`
	`12`	`+print("This script requires 'stopwords' from NLTK, see README"`
	`13`	+ "Quick Download Command: ```python -m nltk.downloader stopwords```")
	`14`	`+`
	`15`	`+def read_article(file_name):`
	`16`	`+ """`
	`17`	`+ Reads the Text file, and coverts them into sentences.`
	`18`	`+ :param file_name: Path of text file (line 12)`
	`19`	`+ :return: sentences`
	`20`	`+ """`
	`21`	`+ file = open(file_name, 'r', encoding="utf-8")`
	`22`	`+ filedata = file.readlines()`
	`23`	`+ article = filedata[0].split(". ")`
	`24`	`+ sentences = []`
	`25`	`+`
	`26`	`+ for sentence in article:`
	`27`	`+ # Uncomment if you want to print the whole file on screen.`
	`28`	`+ # print(sentence)`
	`29`	`+ sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))`
	`30`	`+ sentences.pop()`
	`31`	`+`
	`32`	`+ return sentences`
	`33`	`+`
	`34`	`+`
	`35`	`+def sentence_similarity(sent1, sent2, stopwords=None):`
	`36`	`+ """`
	`37`	`+ To determine the Cosine Similarity between sentences`
	`38`	`+ :param sent1: Vector of sentence 1`
	`39`	`+ :param sent2: Vector of sentence 2`
	`40`	`+ :param stopwords: Words to be ignored in Vectors (Read README.md)`
	`41`	`+ :return: Cosine Similarity score`
	`42`	`+ """`
	`43`	`+ if stopwords is None:`
	`44`	`+ stopwords = []`
	`45`	`+`
	`46`	`+ sent1 = [w.lower() for w in sent1]`
	`47`	`+ sent2 = [w.lower() for w in sent2]`
	`48`	`+`
	`49`	`+ all_words = list(set(sent1 + sent2))`
	`50`	`+`
	`51`	`+ vector1 = [0] * len(all_words)`
	`52`	`+ vector2 = [0] * len(all_words)`
	`53`	`+`
	`54`	`+ # build the vector for the first sentence`
	`55`	`+ for w in sent1:`
	`56`	`+ if w in stopwords:`
	`57`	`+ continue`
	`58`	`+ vector1[all_words.index(w)] += 1`
	`59`	`+`
	`60`	`+ # build the vector for the second sentence`
	`61`	`+ for w in sent2:`
	`62`	`+ if w in stopwords:`
	`63`	`+ continue`
	`64`	`+ vector2[all_words.index(w)] += 1`
	`65`	`+`
	`66`	`+ return 1 - cosine_distance(vector1, vector2)`
	`67`	`+`
	`68`	`+`
	`69`	`+def build_similarity_matrix(sentences, stop_words):`
	`70`	`+ """`
	`71`	`+ Build the similarity index of words in sentences`
	`72`	`+ :param sentences: Clean sentences`
	`73`	`+ :param stop_words: Words to be ignored in Vectors (Read README.md)`
	`74`	`+ :return: Similarity index (Tokenized words)`
	`75`	`+ """`
	`76`	`+ # Create an empty similarity matrix`
	`77`	`+ similarity_matrix = np.zeros((len(sentences), len(sentences)))`
	`78`	`+`
	`79`	`+ for idx1 in range(len(sentences)):`
	`80`	`+ for idx2 in range(len(sentences)):`
	`81`	`+ if idx1 == idx2: # ignore if both are same sentences`
	`82`	`+ continue`
	`83`	`+ similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)`
	`84`	`+`
	`85`	`+ return similarity_matrix`
	`86`	`+`
	`87`	`+`
	`88`	`+def generate_summary(file_name, top_n=5):`
	`89`	`+ """`
	`90`	`+ Generate Summary of the text file`
	`91`	`+ :param file_name: Path of text file (line 12)`
	`92`	`+ :param top_n: Number of Sentence to be vectorized (tokenized)`
	`93`	`+ :return: Summary of text`
	`94`	`+ """`
	`95`	`+ stop_words = stopwords.words('english')`
	`96`	`+ summarize_text = []`
	`97`	`+`
	`98`	`+ # Step 1 - Read text anc split it`
	`99`	`+ sentences = read_article(file_name)`
	`100`	`+`
	`101`	`+ # Step 2 - Generate Similarity Matrix across sentences`
	`102`	`+ sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)`
	`103`	`+`
	`104`	`+ # Step 3 - Rank sentences in similarity matrix`
	`105`	`+ sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)`
	`106`	`+ scores = nx.pagerank(sentence_similarity_graph)`
	`107`	`+`
	`108`	`+ # Step 4 - Sort the rank and pick top sentences`
	`109`	`+ ranked_sentence = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)`
	`110`	`+`
	`111`	`+ # Print the index of the statements`
	`112`	`+ # print("Indexes of top ranked_sentence order are ", ranked_sentence)`
	`113`	`+`
	`114`	`+ for i in range(top_n):`
	`115`	`+ summarize_text.append(" ".join(ranked_sentence[i][1]))`
	`116`	`+`
	`117`	`+ # Step 5 - Output of the text file`
	`118`	`+ filepath_index = file_name.find('.txt')`
	`119`	`+ outputpath = file_name[:filepath_index]+'_textRank.txt'`
	`120`	`+`
	`121`	`+ with open(outputpath, 'w') as w:`
	`122`	`+ for sentence in summarize_text:`
	`123`	`+ w.write(str(sentence)+'\n')`
	`124`	`+`
	`125`	`+`
	`126`	`+generate_summary(file_name, 5)`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 91fc7bd

File tree

5 files changed

5 files changed

`‎Text_Summary/README.md`

`‎Text_Summary/assets/random.txt`

`‎Text_Summary/assets/random_textRank.txt`

`‎Text_Summary/requirements.txt`

`‎Text_Summary/text_summary.py`

0 commit comments