Match a multi line text block

Question 1

I want to count the occurrences of a multi line ASCII pattern within a file.

The pattern looks like this:

a1
b2
c3

The file I want to search looks like this: (The _ represent whitespaces but I thought it's easier to understand with this notation)

_ _ _ a1
_ _ _ b2
_ _ _ c3 _ _ _ _ a1
a1_ _ _ _ _ _ _ _ b2 _ _ _ _ a1 
_ _ _ _ _ _ _ _ _ c3 _ _ _ _ b2
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ c3

The desired output should be 3 in this case. I solved this with a bunch of loops and counting symbols till finding the first part of the pattern. Then checking if the second part hast the same count of numbers before to ensure the second part is underneath the first one and so on.

_ _ _ _ a1 _ _ _ a1
_ _ _ _ b2 _ _ b2 _ 
_ _ _ _ c3 _ _ _ c3

In this example there is only one valid pattern found since the parts of the second aren't exactly under each other.

The regarding code is as following:

import sys
def load_data(path):
 data = []
 try:
 fp = open(path, 'r')
 for f_line in fp:
 data.append(f_line.rstrip())
 finally:
 fp.close()
 return data
bug_path = 'data/bug.txt'
landscape_path = 'data/landscape.txt'
bug = load_data(bug_path)
landscape = load_data(landscape_path)
findings = [[] for x in landscape]
min_len_bug = min([len(x) for x in bug])
for cnt, l_line in enumerate(landscape):
 if len(l_line) < min_len_bug:
 continue
 for bCnt, bLine in enumerate(bug):
 findings[cnt] = [(bCnt, ind) for ind in range(len(l_line)) if l_line.startswith(bLine, ind)] + findings[cnt]
def bugInLine(line, bug_part, whitespace):
 for entry in line:
 if entry[0] == bug_part and entry[1] == whitespace:
 return True
complete_bugs_cnt = 0
for cnt, l_line in enumerate(findings):
 for found in l_line:
 if found[0] == 0 and len(findings) > (cnt + len(bug) - 1):
 check = 1
 for i in range(1, len(bug)):
 if bugInLine(findings[cnt + i], i, found[1]):
 check = check + 1
 if check == len(bug):
 complete_bugs_cnt = complete_bugs_cnt + 1
print complete_bugs_cnt

Since this isn't the most elegant solution I'm wondering if there is a possibility to solve this problem by using some regex code and combine this with the re.findall()-method.

I'd highly appreciate any help.

Question 2

What are the edge cases? If the 2nd file does not contain the first 2 lines (as you posted) _ _ _ a1\n _ _ _ b2. What should be the result then?

Question 3

If the first 2 lines are missing in the 2nd file the output should be 2 since there are only 2 occurrences of the full pattern. I the last two lines are missing the output should be 1 since there is only one full pattern there.

Question 4

The idea formulated as "Then checking if the second part hast the same count of numbers before to ensure the second part is underneath the first one and so on" is unclear to me.

Question 5

I wanted to describe that it's important for this problem that the pattern is there with each of it's parts underneath each other. I'll try to edit my question to make it more clear.

Question 6

@RomanPerekhrest I added another example to make my description more clear.

Question 7

 data = []
 try:
 fp = open(path, 'r')
 for f_line in fp:
 data.append(f_line.rstrip())
 finally:
 fp.close()
 return data

This can be simplified to:

data = []
with open(path) as fp:
 f_line = fp.readline()
 data.append(f_line.rstrip())
return data

The with statement in Python will automatically close the file after the code is done with it. It also handles the finally exception by closing it.

findings = [[] for x in landscape]

This initialization can be skipped if the algorithm is transformed.

for cnt, l_line in enumerate(findings):
 for found in l_line:
 if found[0] == 0 and len(findings) > (cnt + len(bug) - 1):
 check = 1
 for i in range(1, len(bug)):
 if bugInLine(findings[cnt + i], i, found[1]):
 check = check + 1
 if check == len(bug):
 complete_bugs_cnt = complete_bugs_cnt + 1

This is close to text = [''.join(arr) for arr in zip(*text[::1])] (rotate the text by -90 degrees and search) and sum([word.count(pattern_to_find) for word in text]).

The current algorithm is too complicated. Consider an algorithm which checks if a word is present in a single line. Given the line abc c d e a abc the word abc occurs twice, and .count() can be used to find how many times that pattern occurs, or, a while loop with the string split on whitespace (then check if the item is equal to that item.)

However, the text is in columns instead of rows. The text can be rotated -90 degrees which turns it into the easy-to-parse row format. From there it becomes very easy to parse.

After applying these transformations, the code becomes the following (without file reading):

def find_text_column(haystack, pattern_to_find):
 text = haystack.splitlines()
 maxlen = len(max(text, key=len))
 # make every line the same length so that zip() works, as it does not work with irregular arrays
 text = [line.ljust(maxlen, ' ') for line in text]
 # rotate string by -90 degrees
 text = [''.join(arr) for arr in zip(*text[::1])]
 return sum([word.count(pattern_to_find) for word in text])

alexyorke alexyorke 1,2116 silver badges9 bronze badges · Accepted Answer · 2019-12-05 23:32:02Z

 data = []
 try:
 fp = open(path, 'r')
 for f_line in fp:
 data.append(f_line.rstrip())
 finally:
 fp.close()
 return data

This can be simplified to:

data = []
with open(path) as fp:
 f_line = fp.readline()
 data.append(f_line.rstrip())
return data

The with statement in Python will automatically close the file after the code is done with it. It also handles the finally exception by closing it.

findings = [[] for x in landscape]

This initialization can be skipped if the algorithm is transformed.

for cnt, l_line in enumerate(findings):
 for found in l_line:
 if found[0] == 0 and len(findings) > (cnt + len(bug) - 1):
 check = 1
 for i in range(1, len(bug)):
 if bugInLine(findings[cnt + i], i, found[1]):
 check = check + 1
 if check == len(bug):
 complete_bugs_cnt = complete_bugs_cnt + 1

This is close to text = [''.join(arr) for arr in zip(*text[::1])] (rotate the text by -90 degrees and search) and sum([word.count(pattern_to_find) for word in text]).

The current algorithm is too complicated. Consider an algorithm which checks if a word is present in a single line. Given the line abc c d e a abc the word abc occurs twice, and .count() can be used to find how many times that pattern occurs, or, a while loop with the string split on whitespace (then check if the item is equal to that item.)

However, the text is in columns instead of rows. The text can be rotated -90 degrees which turns it into the easy-to-parse row format. From there it becomes very easy to parse.

After applying these transformations, the code becomes the following (without file reading):

def find_text_column(haystack, pattern_to_find):
 text = haystack.splitlines()
 maxlen = len(max(text, key=len))
 # make every line the same length so that zip() works, as it does not work with irregular arrays
 text = [line.ljust(maxlen, ' ') for line in text]
 # rotate string by -90 degrees
 text = [''.join(arr) for arr in zip(*text[::1])]
 return sum([word.count(pattern_to_find) for word in text])

Stack Exchange Network

Match a multi line text block

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Match a multi line text block

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions