1
\$\begingroup\$

I've been attempting to teach myself Python over the past couple of months and have been setting myself practical challenges to learn different aspects of the language.

I have a deep structure of subdirectories containing .rtf files. The objectives of the following code are as follows:

  1. Examine every file in every subdirectory to identify "flagged words" defined in expression.
  2. For each instance a flagged word is identified in a file, capture the file name and the line within that file in which the flagged word appears.
  3. Construct a dictionary, in which flagged words are keys and filenames and lines are stored as values.
  4. Insert each key/value pair in the dictionary into a MongoDB collection

The code is as follows:

from pymongo import MongoClient
import os
import re
import json
# connect to mongodb
client = MongoClient('localhost', 27017)
db = client['test-database']
restrictions = db['restrictions']
# create empty dictionary
d = {}
# setup regular expressions
expression = "sexual offences|reporting restriction|reporting restrictions|anonymous|anonymously|secret"
pattern = re.compile(expression)
# read the source files
for dname, dirs, files in os.walk("path/to/top/level/directory"):
 for fname in files:
 fpath = os.path.join(dname, fname)
 with open(fpath, 'r') as f:
 source = f.readlines()
 for i, line in enumerate(source):
 for match in re.finditer(pattern, line):
 # set the matched flagword as the key and the filename and line as the value
 d.setdefault(match.group(0), [])
 file_instance = fname, line
 print file_instance
 d[match.group(0)].append(file_instance)
# Update the restrictions database
for key, value in d.iteritems():
 xref_id = db.restrictions.insert_one({'flagged_word': key, 'instance': value})

An example document in MongoDB generated by this code:

 {
 "_id" : ObjectId("5925e0d94fb263cb1417bb73"),
 "flagged_word" : "restriction",
 "instance" : [ 
 [ 
 "Family3.txt ", 
 "ii)\tAn application for a reporting restriction order (\"reporting restriction order\") to restrict or prohibit the publication of:\n"
 ] 
 }

The code works as expected, but I have a nagging feeling that the same objectives could be achieved with a far less clunky approach in the code (particularly where all of the for loops are concerned).

In the spirit of wishing to learn how to write clearer, more Pythonic code, I wonder if anyone could offer some advice on how this code could be improved. Many thanks in advance for taking the time to read my code.

asked May 25, 2017 at 18:36
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$
  • f.readlines() is redundant. Just iterate over the file: for i, line in enumerate(f):, or, as you are not using i at all: for line in f:
  • d.setdefault becomes redundant if you use d = collections.defaultdict(list)
answered May 26, 2017 at 6:50
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.