I have few filenames in a list called data. I want to read the contents of the file and check if a given text (example - orange) appears in the file. My filenames are appended to the list in a sequential order i.e if given text "orange", appears in file pi.txt (index 2), it will be present in all the files after index 2 as well and off course i want to get the index or filename where text "orange" appeared first.
I have more than thousand files in a list, therefore i want to use binary search.
data = ['ae.txt', 'ac.txt', 'pi.txt', 'ad.txt', 'mm.txt', 'ab.txt']
target = "orange"
def binary_search(a, x):
lo = 0
hi = len(a)
while lo < hi:
mid = (lo + hi) // 2
if not x in open(a[mid]).read():
lo = mid + 1
elif x in open(a[mid]).read():
hi = mid
elif mid > 0 and x in open(a[mid-1]).read():
hi = mid
else:
return mid
return -1
print(binary_search(data, target))
$ cat ae.txt
papaya
guava
$ cat ac.txt
mango
durian
papaya
guava
$ cat pi.txt
orange
papaya
guava
$ cat ad.txt
orange
papaya
guava
$ cat mm.txt
orange
papaya
guava
$ cat ab.txt
orange
papaya
guava
-
What is your question?Priyatham– Priyatham2019年06月26日 11:19:38 +00:00Commented Jun 26, 2019 at 11:19
-
if you run the code i pasted above, it does not give the expected result. In my case , result should be 2 because "orange" first appears in "pi.txt". Thanks.Sun– Sun2019年06月26日 11:23:34 +00:00Commented Jun 26, 2019 at 11:23
3 Answers 3
I think there is a bit too many if conditions, you can manage to get the expected result like this :
data = ['ae.txt', 'ac.txt', 'pi.txt', 'ad.txt', 'mm.txt', 'ab.txt']
target = "orange"
def binary_search(a, x):
lo = 0
hi = len(a)
while lo < hi:
mid = (lo + hi) // 2
print(mid)
if not x in open(a[mid]).read():
lo = mid + 1
elif x in open(a[mid]).read():
hi = mid
if lo == hi:
return lo
print("low : {}; high : {}".format(lo,hi))
return -1
index = binary_search(data, target)
print("The index where we first found the word orange is {}, the file name is {}".format(index,data[index]))
The index where we first found the word orange is 2, the file name is pi.txt
1 Comment
Your binary search is not really looking for equality so it could be simplified a bit:
def binary_search(files, string):
lo,hi = 0,len(files)-1
while hi>=lo:
mid = (hi+lo)//2
if string in open(files[mid]).read():
hi = mid-1
else:
lo = mid+1
return lo
Since there is no equality check, hi and lo will hit the stop condition (hi>=lo) at which point lo will be on the index of the first match or at len(files) if there are no matches.
1 Comment
Binary search on files only works if your files are already sorted by the search key you're using, meaning that file X[n+1] mustn't have data lexicographically smaller than file X[n]. In this case, you'll have to go through every file manually until you go thorough all of them... or build a dictionary file like this:
'ae.txt', 'ac.txt', 'pi.txt', 'ad.txt', 'mm.txt', 'ab.txt'
durian 1
guava 0-5
mango 1
orange 2-5
papaya 0-5
First line denotes files included and gives them indices via positioning(e.g. 'ae.txt' is at position 0). Other lines denote indices of files that include each word.
From here, you can read the first line for file names, binary search your way through the lines to find the word you're looking for (you should probably find a way to read a specific line in O(1), though - or put them in separate dictionary files for e.g. individual letters if there's too much of them). Then, determine if the file name's index (its location in the first line) is denoted in the word's line.
Making code that writes the initial file and updates it accordingly seems simple enough, but I can help you with that if you want to.