I would like some advice on efficiency and ways to further remove redundant code.
Regex Text File Search:
enter a valid directory path, and a string to search for
makes a list containing all the matched paths, and a dictionary with the matches of the matched paths
get a list of all the text files and go through them
for each text file, search for a match, line by line
if a match is found, then the path is appended to the matched_paths list
update the matched_dict to include the lines and the start index and end index for each match in the matched path
format the output into an appropriate form and display the results of the find
from pathlib import Path
from pyinputplus import inputCustom, inputStr
import re
# 2)
def check_path(path):
"""
returns the directory if the path is a directory, and it exists otherwise asks to input again
"""
directory = Path(path)
if directory.is_dir():
return directory
else:
raise Exception('Path is not a valid directory')
def format_output(matched_paths, matched_dict):
"""
- outputs the directories and matches for the string to be searched, only outputs the matched text files
Formats the output into this template:
file1.txt
>>> --------------------------- <<<
------ Line 280 ------
(start_index, end_index) # match 1
(start_index, end_index) # match 2 if it exists
<<< --------------------------- >>>
file2.txt
...
"""
for path in matched_paths:
print()
print(path.name)
print('>>> --------------------------- <<<')
for line, matches in matched_dict[path].items():
print(f' ------ Line {line} ------')
for match in matched_dict[path][line]:
print(' ', match)
print()
print('<<< --------------------------- >>>')
def search_for_string(path, string):
"""
1) opens the string
2) makes a check on if a match exists in the file
3) if it does, goes line by line
4) appending all the matches (start_index, end_index) to the line number in the dict of path
returns True if match was found, so it will be appended to the matched_paths list
matched_paths = [path1, path2, path3]
matched_dict = {path1: {line1: [(s1, e1), (s2, e2)] }, path2: ... }
"""
global matched_dict
with open(path, 'r') as text_file:
if re.search(string, path.read_text()):
matched_dict[path] = dict()
for i, line in enumerate(text_file.readlines()): # i refers to line number
matches = list(re.finditer(string, line))
if matches:
matched_dict[path][i] = []
for m in matches:
matched_dict[path][i].append(m.span())
return True
return False
# 1)
path = inputCustom(check_path, prompt='Enter Path: ')
string = inputStr(prompt='Enter string: ')
# 2)
matched_paths = []
matched_dict = dict()
# 3)
for text_file in path.glob('*.txt'):
# 4) and 6)
if search_for_string(text_file, string):
# 5)
matched_paths.append(text_file)
# 7)
format_output(matched_paths, matched_dict)
1 Answer 1
Documentation
"""
returns the directory if the path is a directory, and it exists otherwise asks to input again
"""
OK; but that's not what this function does at all. The "asking to input again" is done elsewhere.
Type hints
Guessing for this signature; type hints would remove the guesswork:
def format_output(matched_paths, matched_dict):
could become
def format_output(
matched_paths: Iterable[str],
matched_dict: Dict[
str,
Dict[str, str]
],
):
If this is true, the second type is complex enough that it could benefit from external declaration, i.e.
MatchesDict = Dict[
str,
Dict[str, str]
]
Dictionary iteration
for path in matched_paths:
for line, matches in matched_dict[path].items():
for match in matched_dict[path][line]:
should be
for path in matched_paths:
for line, matches in matched_dict[path].items():
for match in matches:
In other words, items
gets you a key and a value; when you have the value, use it.
Logic inversion
I find that this:
if re.search(string, path.read_text()):
matched_dict[path] = dict()
for i, line in enumerate(text_file.readlines()): # i refers to line number
matches = list(re.finditer(string, line))
if matches:
matched_dict[path][i] = []
for m in matches:
matched_dict[path][i].append(m.span())
return True
return False
is more legible as
if not re.search(string, path.read_text()):
return False
matched_dict[path] = dict()
for i, line in enumerate(text_file.readlines()): # i refers to line number
matches = list(re.finditer(string, line))
if matches:
matched_dict[path][i] = []
for m in matches:
matched_dict[path][i].append(m.span())
return True
-
\$\begingroup\$ On your point about dictionary iteration, matched paths isn't a dictionary but a list of paths to the matched text files, so how am I supposed to iterate through its items() method \$\endgroup\$Anonymous– Anonymous2020年06月19日 20:35:23 +00:00Commented Jun 19, 2020 at 20:35
-
1\$\begingroup\$ My mistake; edited. \$\endgroup\$Reinderien– Reinderien2020年06月19日 22:08:35 +00:00Commented Jun 19, 2020 at 22:08