I want to count the occurrences of a multi line ASCII pattern within a file.
The pattern looks like this:
a1
b2
c3
The file I want to search looks like this: (The _ represent whitespaces but I thought it's easier to understand with this notation)
_ _ _ a1
_ _ _ b2
_ _ _ c3 _ _ _ _ a1
a1_ _ _ _ _ _ _ _ b2 _ _ _ _ a1
_ _ _ _ _ _ _ _ _ c3 _ _ _ _ b2
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ c3
The desired output should be 3 in this case. I solved this with a bunch of loops and counting symbols till finding the first part of the pattern. Then checking if the second part hast the same count of numbers before to ensure the second part is underneath the first one and so on.
_ _ _ _ a1 _ _ _ a1
_ _ _ _ b2 _ _ b2 _
_ _ _ _ c3 _ _ _ c3
In this example there is only one valid pattern found since the parts of the second aren't exactly under each other.
The regarding code is as following:
import sys
def load_data(path):
data = []
try:
fp = open(path, 'r')
for f_line in fp:
data.append(f_line.rstrip())
finally:
fp.close()
return data
bug_path = 'data/bug.txt'
landscape_path = 'data/landscape.txt'
bug = load_data(bug_path)
landscape = load_data(landscape_path)
findings = [[] for x in landscape]
min_len_bug = min([len(x) for x in bug])
for cnt, l_line in enumerate(landscape):
if len(l_line) < min_len_bug:
continue
for bCnt, bLine in enumerate(bug):
findings[cnt] = [(bCnt, ind) for ind in range(len(l_line)) if l_line.startswith(bLine, ind)] + findings[cnt]
def bugInLine(line, bug_part, whitespace):
for entry in line:
if entry[0] == bug_part and entry[1] == whitespace:
return True
complete_bugs_cnt = 0
for cnt, l_line in enumerate(findings):
for found in l_line:
if found[0] == 0 and len(findings) > (cnt + len(bug) - 1):
check = 1
for i in range(1, len(bug)):
if bugInLine(findings[cnt + i], i, found[1]):
check = check + 1
if check == len(bug):
complete_bugs_cnt = complete_bugs_cnt + 1
print complete_bugs_cnt
Since this isn't the most elegant solution I'm wondering if there is a possibility to solve this problem by using some regex code and combine this with the re.findall()
-method.
I'd highly appreciate any help.
1 Answer 1
data = []
try:
fp = open(path, 'r')
for f_line in fp:
data.append(f_line.rstrip())
finally:
fp.close()
return data
This can be simplified to:
data = []
with open(path) as fp:
f_line = fp.readline()
data.append(f_line.rstrip())
return data
The with
statement in Python will automatically close the file after the code is done with it. It also handles the finally
exception by closing it.
findings = [[] for x in landscape]
This initialization can be skipped if the algorithm is transformed.
for cnt, l_line in enumerate(findings):
for found in l_line:
if found[0] == 0 and len(findings) > (cnt + len(bug) - 1):
check = 1
for i in range(1, len(bug)):
if bugInLine(findings[cnt + i], i, found[1]):
check = check + 1
if check == len(bug):
complete_bugs_cnt = complete_bugs_cnt + 1
This is close to text = [''.join(arr) for arr in zip(*text[::1])]
(rotate the text by -90 degrees and search) and sum([word.count(pattern_to_find) for word in text])
.
The current algorithm is too complicated. Consider an algorithm which checks if a word is present in a single line. Given the line abc c d e a abc
the word abc
occurs twice, and .count()
can be used to find how many times that pattern occurs, or, a while loop with the string split on whitespace (then check if the item is equal to that item.)
However, the text is in columns instead of rows. The text can be rotated -90 degrees which turns it into the easy-to-parse row format. From there it becomes very easy to parse.
After applying these transformations, the code becomes the following (without file reading):
def find_text_column(haystack, pattern_to_find):
text = haystack.splitlines()
maxlen = len(max(text, key=len))
# make every line the same length so that zip() works, as it does not work with irregular arrays
text = [line.ljust(maxlen, ' ') for line in text]
# rotate string by -90 degrees
text = [''.join(arr) for arr in zip(*text[::1])]
return sum([word.count(pattern_to_find) for word in text])
_ _ _ a1\n _ _ _ b2
. What should be the result then? \$\endgroup\$