Python - A Regex Text File Searcher by Content Project

Question 1

I would like some advice on efficiency and ways to further remove redundant code.

Regex Text File Search:

enter a valid directory path, and a string to search for
makes a list containing all the matched paths, and a dictionary with the matches of the matched paths
get a list of all the text files and go through them
for each text file, search for a match, line by line
if a match is found, then the path is appended to the matched_paths list
update the matched_dict to include the lines and the start index and end index for each match in the matched path
format the output into an appropriate form and display the results of the find

from pathlib import Path
from pyinputplus import inputCustom, inputStr
import re
# 2)
def check_path(path):
 """
 returns the directory if the path is a directory, and it exists otherwise asks to input again
 """
 directory = Path(path)
 if directory.is_dir():
 return directory
 else:
 raise Exception('Path is not a valid directory')
def format_output(matched_paths, matched_dict):
 """
 - outputs the directories and matches for the string to be searched, only outputs the matched text files
 Formats the output into this template:
 file1.txt
 >>> --------------------------- <<<
 ------ Line 280 ------
 (start_index, end_index) # match 1
 (start_index, end_index) # match 2 if it exists
 <<< --------------------------- >>>
 file2.txt
 ...
 """
 for path in matched_paths:
 print()
 print(path.name)
 print('>>> --------------------------- <<<')
 for line, matches in matched_dict[path].items():
 print(f' ------ Line {line} ------')
 for match in matched_dict[path][line]:
 print(' ', match)
 print()
 print('<<< --------------------------- >>>')
def search_for_string(path, string):
 """
 1) opens the string
 2) makes a check on if a match exists in the file
 3) if it does, goes line by line
 4) appending all the matches (start_index, end_index) to the line number in the dict of path
 returns True if match was found, so it will be appended to the matched_paths list
 matched_paths = [path1, path2, path3]
 matched_dict = {path1: {line1: [(s1, e1), (s2, e2)] }, path2: ... }
 """
 global matched_dict
 with open(path, 'r') as text_file:
 if re.search(string, path.read_text()):
 matched_dict[path] = dict()
 for i, line in enumerate(text_file.readlines()): # i refers to line number
 matches = list(re.finditer(string, line))
 if matches:
 matched_dict[path][i] = []
 for m in matches:
 matched_dict[path][i].append(m.span())
 return True
 return False
# 1)
path = inputCustom(check_path, prompt='Enter Path: ')
string = inputStr(prompt='Enter string: ')
# 2)
matched_paths = []
matched_dict = dict()
# 3)
for text_file in path.glob('*.txt'):
 # 4) and 6)
 if search_for_string(text_file, string):
 # 5)
 matched_paths.append(text_file)
# 7)
format_output(matched_paths, matched_dict)

Question 2

Documentation

"""
returns the directory if the path is a directory, and it exists otherwise asks to input again
"""

OK; but that's not what this function does at all. The "asking to input again" is done elsewhere.

Type hints

Guessing for this signature; type hints would remove the guesswork:

def format_output(matched_paths, matched_dict):

could become

def format_output(
 matched_paths: Iterable[str],
 matched_dict: Dict[
 str,
 Dict[str, str]
 ],
):

If this is true, the second type is complex enough that it could benefit from external declaration, i.e.

MatchesDict = Dict[
 str,
 Dict[str, str]
]

Dictionary iteration

for path in matched_paths:
 for line, matches in matched_dict[path].items():
 for match in matched_dict[path][line]:

should be

for path in matched_paths:
 for line, matches in matched_dict[path].items():
 for match in matches:

In other words, items gets you a key and a value; when you have the value, use it.

Logic inversion

I find that this:

 if re.search(string, path.read_text()):
 matched_dict[path] = dict()
 for i, line in enumerate(text_file.readlines()): # i refers to line number
 matches = list(re.finditer(string, line))
 if matches:
 matched_dict[path][i] = []
 for m in matches:
 matched_dict[path][i].append(m.span())
 return True
 return False

is more legible as

 if not re.search(string, path.read_text()):
 return False
 matched_dict[path] = dict()
 for i, line in enumerate(text_file.readlines()): # i refers to line number
 matches = list(re.finditer(string, line))
 if matches:
 matched_dict[path][i] = []
 for m in matches:
 matched_dict[path][i].append(m.span())
 return True

Question 3

On your point about dictionary iteration, matched paths isn't a dictionary but a list of paths to the matched text files, so how am I supposed to iterate through its items() method

Question 4

My mistake; edited.

Reinderien Reinderien 71k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2020-06-19 20:17:48Z

Documentation

"""
returns the directory if the path is a directory, and it exists otherwise asks to input again
"""

OK; but that's not what this function does at all. The "asking to input again" is done elsewhere.

Type hints

Guessing for this signature; type hints would remove the guesswork:

def format_output(matched_paths, matched_dict):

could become

def format_output(
 matched_paths: Iterable[str],
 matched_dict: Dict[
 str,
 Dict[str, str]
 ],
):

If this is true, the second type is complex enough that it could benefit from external declaration, i.e.

MatchesDict = Dict[
 str,
 Dict[str, str]
]

Dictionary iteration

for path in matched_paths:
 for line, matches in matched_dict[path].items():
 for match in matched_dict[path][line]:

should be

for path in matched_paths:
 for line, matches in matched_dict[path].items():
 for match in matches:

In other words, items gets you a key and a value; when you have the value, use it.

Logic inversion

I find that this:

 if re.search(string, path.read_text()):
 matched_dict[path] = dict()
 for i, line in enumerate(text_file.readlines()): # i refers to line number
 matches = list(re.finditer(string, line))
 if matches:
 matched_dict[path][i] = []
 for m in matches:
 matched_dict[path][i].append(m.span())
 return True
 return False

is more legible as

 if not re.search(string, path.read_text()):
 return False
 matched_dict[path] = dict()
 for i, line in enumerate(text_file.readlines()): # i refers to line number
 matches = list(re.finditer(string, line))
 if matches:
 matched_dict[path][i] = []
 for m in matches:
 matched_dict[path][i].append(m.span())
 return True

On your point about dictionary iteration, matched paths isn't a dictionary but a list of paths to the matched text files, so how am I supposed to iterate through its items() method

Stack Exchange Network

Python - A Regex Text File Searcher by Content Project

1 Answer 1

Documentation

Type hints

Dictionary iteration

Logic inversion

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python - A Regex Text File Searcher by Content Project

1 Answer 1

Documentation

Type hints

Dictionary iteration

Logic inversion

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions