Batch processing large text files for sentiment analysis

Question 1

I have written this script in Python 2.7 and I was wondering where I could improve it or efficient. This is my first script so I don't now much. The script takes a text file from a folder removes blank spaces with list comprehension and then breaks it into smaller 20 elements and saves it each 20 elements as a definition into a dictionary with iterative numbered keys. Each key is then used to fetch the definition from the dictionary and sends each through the API one by one saving the output into another dictionary. The result dictionary is then converted to text and saved into a text file.

import glob
import os
import json
import paralleldots
api_key = "iyuDsKbgZSsCxVj6Moe37MzdqCUuvkpf33t6qS3X3cH"
paralleldots.set_api_key(api_key)
output = input(r"C:\Users\User\Desktop\trial")
txt_files = os.path.join("D:\\english\\stou\\check", '*.txt')
dic = {} #dictionary for storing list elements
res = {} #results dictionary for api output
def for_high (list_e): #defining a function for taking list elements and putting them into dic
 no_keys = list_e / 20
 dic = {
 1: l[1:21]
 }
 i = 1
 start_key_value = 21
 while i <= no_keys:
 i += 1
 end_key_value = start_key_value + 20
 dic.update({i: ''.join(l[start_key_value:end_key_value])})
 start_key_value += 20
 for x in dic: #creating a for loop for getting and saving the output and creating a new file
 output = paralleldots.emotion(dic.get(x))
 res.update({x: output})
 with open(os.path.join("C:\\Users\\User\\Desktop\\trial", filename), 'w') as text_file:
 text_file.write(json.dumps(res))
for txt_file in glob.glob(txt_files): # for loop for going through all the text files in the input directory
 with open(txt_file, "r") as input_file:
 filename = os.path.splitext(os.path.basename(txt_file))[0] + '.txt'
 l = [l for l in input_file.readlines() if l.strip()]
 list_e = int(len(l)) #no. of list elements variable
 if list_e > 20: #checking if list has more than 20 elements
 if list_e % 2 != 0: #checking if list has an odd no. of elements
 list_e += 1
 for_high(list_e)
 else:
 for_high(list_e)
 else:
 in_txt = paralleldots.emotion(l)
 filename = os.path.splitext(os.path.basename(txt_file))[0] + '.txt'
 with open(os.path.join("C:\\Users\\User\\Desktop\\trial", filename), 'w') as text_file:
 text_file.write(str(in_txt))

Question 2

Please capitalize correctly. It makes it much easier to read your post.

Question 3

@FreezePhoenix Edited :)

Question 4

I understand you are a beginner and you had the guts to post your code. You deserve some feedback and I will try.

Your code is hard to read

This is most probably why you didn't get answers till now. There is code repetition, global and reused variables, no clear control flow, and bad variable names. It is really hard to find the core functionality. So I start in no particular order.

Bad names

dic = {} #dictionary for storing list elements

Well I can see this is a dict. Bad name, bad comment. Use names describing the meaning, not the type. Comments shoud describe the hidden, not so obvious parts, that could not be expressed by names. Your whole comment #dictionary for storing list elements can be split into #dictionary for storing ... elements and ... lists .... the first one is completely obvious, the other is just some type information again. No hints to the sort of content (should be in the name), no hints as to the key. You could safely replace the whole line with guesswhat={}.

Do not use globals or nonlocals for no reason

def for_high (list_e): #defining a function for taking list elements and putting them into dic

Uses filename, res, and l. While one could expect that on res and as it was declared at global scope, it is a complete no-go to use variables that look like temporaries in their scope. This renders your function absolutely untestable and your function will easily break if you alter the outer code. Pass required variables, so the reader/maintainer is aware they are used by the function. Unlike what the comment suggests, your function currently only takes a list length. And unless for_high is a well known / defined term in your community, it is a bad name again.

Do not repeat yourself (DNRY)

Your function has some code:

output = paralleldots.emotion(dic.get(x))
res.update({x: output})
with open(os.path.join("C:\\Users\\User\\Desktop\\trial", filename), 'w') as text_file:
 text_file.write(json.dumps(res))

which is duplicated outside (slightly modified)

in_txt = paralleldots.emotion(l)
filename = os.path.splitext(os.path.basename(txt_file))[0] + '.txt'
with open(os.path.join("C:\\Users\\User\\Desktop\\trial", filename), 'w') as text_file:
 text_file.write(str(in_txt))

How can something be either output or in_txt? Of course a function has a return value or output or result. None of these make a good name. What is returned? For one code snippet filename is explicitly set, for the other it magically pre-exists? So the function accesses a value from outside and the outer code overwrites the existing value with the very same value before using it. This is pure horror. Also one of this file writes is done in a loop, the file is overwritten in each iteration. It shall either be written once or appended. Create a function handling one element.

Simplify control flow

If both execution paths of an if clause nearly do the same, check whether you can simplify it. Your

if list_e % 2 != 0: #checking if list has an odd no. of elements
 list_e += 1
 for_high(list_e)
else:
 for_high(list_e)

can be simplified to (again add a useful comment):

if list_e % 2 != 0: # list length may not be odd because ...
 list_e += 1
for_high(list_e)

but: are you sure it is a good idea to increase a list-length variable while the list itself stays odd?

Define numeric constants once

you have

no_keys = list_e / 20
dic = {1: l[1:21]}
start_key_value = 21
end_key_value = start_key_value + 20
start_key_value += 20
if list_e > 20: #checking if list has more than 20 elements

I do not think these are independent constants. So define this constant once and use that name throughout the code (again change the comment to have some meaning)

max_emoticon_len = 20
no_keys = list_e / max_emoticon_len
dic = {1: l[1:max_emoticon_len + 1]}
start_key_value = max_emoticon_len + 1
end_key_value = start_key_value + max_emoticon_len
start_key_value += max_emoticon_len
if list_e > max_emoticon_len: #checking if list has more than elements than desired to fit for ...

Other

Also list_e is a bad name. e mostly is used as name/pre-/postfix for elements for e in l:. for no. of elements the term length as in len() is used. stick to those conventions to avoid irritation.

Also dic = {1: l[1:21]} - are you sure you want to skip the first element? In Python, indexing starts at 0.

Also is it desired, that when having a list of length 40, the first sequence has length 20, and the last one has length 19?

Also dic = {1: l[1:21]} hides the outer dic which therefor is never used and may be deleted

Also dic = {1: l[1:21]} - why do you set this explicitly? Why not inside the loop?

Also dic.update({i: 'some'}) may be done by dic[i] = 'some'. The same goes for res.

Also you handle lists with less than 20 elements in your main code. However, if the list is bigger than 20 elements, you do batches of 20 and drop the last ones. Desired?

Question 5

I did my best to edit this post to improve its quality (spelling and grammar wise). I understand you may not be a native English speaker (neither am I), but please at least do a quick spell-check before posting. Thanks.

stefan stefan 2,9948 silver badges17 bronze badges · Accepted Answer · 2018-08-10 12:15:35Z

I understand you are a beginner and you had the guts to post your code. You deserve some feedback and I will try.

Your code is hard to read

This is most probably why you didn't get answers till now. There is code repetition, global and reused variables, no clear control flow, and bad variable names. It is really hard to find the core functionality. So I start in no particular order.

Bad names

dic = {} #dictionary for storing list elements

Well I can see this is a dict. Bad name, bad comment. Use names describing the meaning, not the type. Comments shoud describe the hidden, not so obvious parts, that could not be expressed by names. Your whole comment #dictionary for storing list elements can be split into #dictionary for storing ... elements and ... lists .... the first one is completely obvious, the other is just some type information again. No hints to the sort of content (should be in the name), no hints as to the key. You could safely replace the whole line with guesswhat={}.

Do not use globals or nonlocals for no reason

def for_high (list_e): #defining a function for taking list elements and putting them into dic

Uses filename, res, and l. While one could expect that on res and as it was declared at global scope, it is a complete no-go to use variables that look like temporaries in their scope. This renders your function absolutely untestable and your function will easily break if you alter the outer code. Pass required variables, so the reader/maintainer is aware they are used by the function. Unlike what the comment suggests, your function currently only takes a list length. And unless for_high is a well known / defined term in your community, it is a bad name again.

Do not repeat yourself (DNRY)

Your function has some code:

output = paralleldots.emotion(dic.get(x))
res.update({x: output})
with open(os.path.join("C:\\Users\\User\\Desktop\\trial", filename), 'w') as text_file:
 text_file.write(json.dumps(res))

which is duplicated outside (slightly modified)

in_txt = paralleldots.emotion(l)
filename = os.path.splitext(os.path.basename(txt_file))[0] + '.txt'
with open(os.path.join("C:\\Users\\User\\Desktop\\trial", filename), 'w') as text_file:
 text_file.write(str(in_txt))

How can something be either output or in_txt? Of course a function has a return value or output or result. None of these make a good name. What is returned? For one code snippet filename is explicitly set, for the other it magically pre-exists? So the function accesses a value from outside and the outer code overwrites the existing value with the very same value before using it. This is pure horror. Also one of this file writes is done in a loop, the file is overwritten in each iteration. It shall either be written once or appended. Create a function handling one element.

Simplify control flow

If both execution paths of an if clause nearly do the same, check whether you can simplify it. Your

if list_e % 2 != 0: #checking if list has an odd no. of elements
 list_e += 1
 for_high(list_e)
else:
 for_high(list_e)

can be simplified to (again add a useful comment):

if list_e % 2 != 0: # list length may not be odd because ...
 list_e += 1
for_high(list_e)

but: are you sure it is a good idea to increase a list-length variable while the list itself stays odd?

Define numeric constants once

you have

no_keys = list_e / 20
dic = {1: l[1:21]}
start_key_value = 21
end_key_value = start_key_value + 20
start_key_value += 20
if list_e > 20: #checking if list has more than 20 elements

I do not think these are independent constants. So define this constant once and use that name throughout the code (again change the comment to have some meaning)

max_emoticon_len = 20
no_keys = list_e / max_emoticon_len
dic = {1: l[1:max_emoticon_len + 1]}
start_key_value = max_emoticon_len + 1
end_key_value = start_key_value + max_emoticon_len
start_key_value += max_emoticon_len
if list_e > max_emoticon_len: #checking if list has more than elements than desired to fit for ...

Other

Also list_e is a bad name. e mostly is used as name/pre-/postfix for elements for e in l:. for no. of elements the term length as in len() is used. stick to those conventions to avoid irritation.

Also dic = {1: l[1:21]} - are you sure you want to skip the first element? In Python, indexing starts at 0.

Also is it desired, that when having a list of length 40, the first sequence has length 20, and the last one has length 19?

Also dic = {1: l[1:21]} hides the outer dic which therefor is never used and may be deleted

Also dic = {1: l[1:21]} - why do you set this explicitly? Why not inside the loop?

Also dic.update({i: 'some'}) may be done by dic[i] = 'some'. The same goes for res.

Also you handle lists with less than 20 elements in your main code. However, if the list is bigger than 20 elements, you do batches of 20 and drop the last ones. Desired?

I did my best to edit this post to improve its quality (spelling and grammar wise). I understand you may not be a native English speaker (neither am I), but please at least do a quick spell-check before posting. Thanks.

Stack Exchange Network

Batch processing large text files for sentiment analysis

1 Answer 1

Your code is hard to read

Bad names

Do not use globals or nonlocals for no reason

Do not repeat yourself (DNRY)

Simplify control flow

Define numeric constants once

Other

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Batch processing large text files for sentiment analysis

1 Answer 1

Your code is hard to read

Bad names

Do not use globals or nonlocals for no reason

Do not repeat yourself (DNRY)

Simplify control flow

Define numeric constants once

Other

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions