Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit b58c72c

Browse files
Merge pull request avinashkranjan#836 from zaverisanya/master
Bag of words model
2 parents 2ac8f5e + 7705143 commit b58c72c

File tree

2 files changed

+65
-0
lines changed

2 files changed

+65
-0
lines changed

‎Bag of words model/README.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Package/Script Name
2+
3+
-->Package installed- NLKT
4+
- NLTK stands for 'Natural Language Tool Kit'. It consists of the most common algorithms such as tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition. NLTK helps the computer to analysis, preprocess, and understand the written text.
5+
6+
--> Pandas
7+
- pandas is a library where your data can be stored, analyzed and processed in row and column representation
8+
9+
--> from sklearn.feature_extraction.text import CountVectorizer
10+
- Scikit-learn's CountVectorizer is used to convert a collection of text documents to a vector of term/token counts. It also enables the ​pre-processing of text data prior to generating the vector representation. This functionality makes it a highly flexible feature representation module for text.
11+
12+
## Setup instructions
13+
14+
1) Input the sentences you would like to vectorize.
15+
2) The script will tokenize the sentences.
16+
3) It will transform the text to vectors where each word and its count is a feature.
17+
4) Then the bag of word model is ready.
18+
5) create dataframe where dataFrame is an analogy to excel-spreadsheet.
19+
6) Open excel and check the 'bowp.xlsx' where sheet name is 'data'. The dataframe will be stored over there.
20+
21+
22+
## Output
23+
24+
![Image](https://i.postimg.cc/pLQq8Vdc/output.png)
25+
26+
## Author(s)
27+
28+
- This code is written by [Sanya Devansh Zaveri](https://github.com/zaverisanya)
29+
30+
## Disclaimers, if any
31+
32+
There are no disclaimers for this script.

‎Bag of words model/bow.py

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
from sklearn.feature_extraction.text import CountVectorizer
2+
import nltk
3+
import pandas as pd #pandas is a library where your data can be stored, analyzed and processed in row and column representation
4+
from openpyxl import Workbook
5+
sentences=input("Enter your sentences: ")
6+
#eg. My name is sanya. I am caring and loving. I am generous.
7+
#converting to lower case (normalization)
8+
sentences=sentences.lower()
9+
#sentence tokenized
10+
tokenized_sentences=nltk.tokenize.sent_tokenize(sentences)
11+
print(tokenized_sentences)
12+
tokenized_sentences1=[]
13+
for x in tokenized_sentences:
14+
x=x.replace(".","") #removed .
15+
tokenized_sentences1.append(x)
16+
print(tokenized_sentences1) #list of word can be converted to set to get unique words
17+
#instantiating CountVectorizer()
18+
countVectorizer=CountVectorizer() #BOW
19+
#transforming text from to vectors where each word and its count is a feature
20+
tmpbow=countVectorizer.fit_transform(tokenized_sentences1)#pass list of sentences as arguments
21+
print("tmpbow \n",tmpbow) #bag of word model is ready
22+
23+
bow=tmpbow.toarray()
24+
print("Vocabulary = ",countVectorizer.vocabulary_)
25+
print("Features = ",countVectorizer.get_feature_names())
26+
#Features in machine learning are nothing but names of the columns
27+
print("BOW ",bow)
28+
29+
#create dataframe #DataFrame is an analogy to excel-spreadsheet
30+
cv_dataframe=pd.DataFrame(bow,columns=countVectorizer.get_feature_names())
31+
32+
print("cv_dataframe is below\n",cv_dataframe)
33+
cv_dataframe.to_excel('./Bag of words model/bowp.xlsx', sheet_name='data')

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /