Commit b58c72c

authored

Merge pull request avinashkranjan#836 from zaverisanya/master

Bag of words model

2 parents 2ac8f5e + 7705143 commit b58c72cCopy full SHA for b58c72c

File tree

2 files changed

+65

-0

lines changed

Bag of words model
- README.md
- bow.py

2 files changed

+65

-0

lines changed

`‎Bag of words model/README.md`

Lines changed: 32 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,32 @@`
	`1`	`+# Package/Script Name`
	`2`	`+`
	`3`	`+-->Package installed- NLKT`
	`4`	`+- NLTK stands for 'Natural Language Tool Kit'. It consists of the most common algorithms such as tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition. NLTK helps the computer to analysis, preprocess, and understand the written text.`
	`5`	`+`
	`6`	`+--> Pandas`
	`7`	`+- pandas is a library where your data can be stored, analyzed and processed in row and column representation`
	`8`	`+`
	`9`	`+--> from sklearn.feature_extraction.text import CountVectorizer`
	`10`	`+- Scikit-learn's CountVectorizer is used to convert a collection of text documents to a vector of term/token counts. It also enables the pre-processing of text data prior to generating the vector representation. This functionality makes it a highly flexible feature representation module for text.`
	`11`	`+`
	`12`	`+## Setup instructions`
	`13`	`+`
	`14`	`+1) Input the sentences you would like to vectorize.`
	`15`	`+2) The script will tokenize the sentences.`
	`16`	`+3) It will transform the text to vectors where each word and its count is a feature.`
	`17`	`+4) Then the bag of word model is ready.`
	`18`	`+5) create dataframe where dataFrame is an analogy to excel-spreadsheet.`
	`19`	`+6) Open excel and check the 'bowp.xlsx' where sheet name is 'data'. The dataframe will be stored over there.`
	`20`	`+`
	`21`	`+`
	`22`	`+## Output`
	`23`	`+`
	`24`	`+![Image](https://i.postimg.cc/pLQq8Vdc/output.png)`
	`25`	`+`
	`26`	`+## Author(s)`
	`27`	`+`
	`28`	`+- This code is written by [Sanya Devansh Zaveri](https://github.com/zaverisanya)`
	`29`	`+`
	`30`	`+## Disclaimers, if any`
	`31`	`+`
	`32`	`+There are no disclaimers for this script.`

`‎Bag of words model/bow.py`

Lines changed: 33 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,33 @@`
	`1`	`+from sklearn.feature_extraction.text import CountVectorizer`
	`2`	`+import nltk`
	`3`	`+import pandas as pd #pandas is a library where your data can be stored, analyzed and processed in row and column representation`
	`4`	`+from openpyxl import Workbook`
	`5`	`+sentences=input("Enter your sentences: ")`
	`6`	`+#eg. My name is sanya. I am caring and loving. I am generous.`
	`7`	`+#converting to lower case (normalization)`
	`8`	`+sentences=sentences.lower()`
	`9`	`+#sentence tokenized`
	`10`	`+tokenized_sentences=nltk.tokenize.sent_tokenize(sentences)`
	`11`	`+print(tokenized_sentences)`
	`12`	`+tokenized_sentences1=[]`
	`13`	`+for x in tokenized_sentences:`
	`14`	`+ x=x.replace(".","") #removed .`
	`15`	`+ tokenized_sentences1.append(x)`
	`16`	`+print(tokenized_sentences1) #list of word can be converted to set to get unique words`
	`17`	`+#instantiating CountVectorizer()`
	`18`	`+countVectorizer=CountVectorizer() #BOW`
	`19`	`+#transforming text from to vectors where each word and its count is a feature`
	`20`	`+tmpbow=countVectorizer.fit_transform(tokenized_sentences1)#pass list of sentences as arguments`
	`21`	`+print("tmpbow \n",tmpbow) #bag of word model is ready`
	`22`	`+`
	`23`	`+bow=tmpbow.toarray()`
	`24`	`+print("Vocabulary = ",countVectorizer.vocabulary_)`
	`25`	`+print("Features = ",countVectorizer.get_feature_names())`
	`26`	`+#Features in machine learning are nothing but names of the columns`
	`27`	`+print("BOW ",bow)`
	`28`	`+`
	`29`	`+#create dataframe #DataFrame is an analogy to excel-spreadsheet`
	`30`	`+cv_dataframe=pd.DataFrame(bow,columns=countVectorizer.get_feature_names())`
	`31`	`+`
	`32`	`+print("cv_dataframe is below\n",cv_dataframe)`
	`33`	`+cv_dataframe.to_excel('./Bag of words model/bowp.xlsx', sheet_name='data')`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit b58c72c

File tree

2 files changed

2 files changed

`‎Bag of words model/README.md`

`‎Bag of words model/bow.py`

0 commit comments