search text using binary search in python

Question 1

I have few filenames in a list called data. I want to read the contents of the file and check if a given text (example - orange) appears in the file. My filenames are appended to the list in a sequential order i.e if given text "orange", appears in file pi.txt (index 2), it will be present in all the files after index 2 as well and off course i want to get the index or filename where text "orange" appeared first.

I have more than thousand files in a list, therefore i want to use binary search.

data = ['ae.txt', 'ac.txt', 'pi.txt', 'ad.txt', 'mm.txt', 'ab.txt']
target = "orange"
def binary_search(a, x):
 lo = 0
 hi = len(a)
 while lo < hi:
 mid = (lo + hi) // 2
 if not x in open(a[mid]).read():
 lo = mid + 1
 elif x in open(a[mid]).read():
 hi = mid
 elif mid > 0 and x in open(a[mid-1]).read():
 hi = mid
 else:
 return mid
 return -1
print(binary_search(data, target))
$ cat ae.txt
papaya
guava
$ cat ac.txt 
mango
durian
papaya
guava
$ cat pi.txt 
orange
papaya
guava
$ cat ad.txt 
orange
papaya
guava
$ cat mm.txt 
orange
papaya
guava
$ cat ab.txt 
orange
papaya
guava

Question 2

What is your question?

Question 3

if you run the code i pasted above, it does not give the expected result. In my case , result should be 2 because "orange" first appears in "pi.txt". Thanks.

Question 4

I think there is a bit too many if conditions, you can manage to get the expected result like this :

data = ['ae.txt', 'ac.txt', 'pi.txt', 'ad.txt', 'mm.txt', 'ab.txt']
target = "orange"
def binary_search(a, x):
 lo = 0
 hi = len(a)
 while lo < hi:
 mid = (lo + hi) // 2
 print(mid)
 if not x in open(a[mid]).read():
 lo = mid + 1
 elif x in open(a[mid]).read():
 hi = mid
 if lo == hi:
 return lo
 print("low : {}; high : {}".format(lo,hi))
 return -1
index = binary_search(data, target)
print("The index where we first found the word orange is {}, the file name is {}".format(index,data[index]))

The index where we first found the word orange is 2, the file name is pi.txt

Question 5

your solution works as expected. Thank you for your help !

Question 6

Your binary search is not really looking for equality so it could be simplified a bit:

def binary_search(files, string):
 lo,hi = 0,len(files)-1
 while hi>=lo:
 mid = (hi+lo)//2
 if string in open(files[mid]).read(): 
 hi = mid-1
 else: 
 lo = mid+1
 return lo

Since there is no equality check, hi and lo will hit the stop condition (hi>=lo) at which point lo will be on the index of the first match or at len(files) if there are no matches.

Question 7

your solution works as expected. Thank you for your help !

Question 8

Binary search on files only works if your files are already sorted by the search key you're using, meaning that file X[n+1] mustn't have data lexicographically smaller than file X[n]. In this case, you'll have to go through every file manually until you go thorough all of them... or build a dictionary file like this:

'ae.txt', 'ac.txt', 'pi.txt', 'ad.txt', 'mm.txt', 'ab.txt'
durian 1
guava 0-5
mango 1
orange 2-5
papaya 0-5

First line denotes files included and gives them indices via positioning(e.g. 'ae.txt' is at position 0). Other lines denote indices of files that include each word.

From here, you can read the first line for file names, binary search your way through the lines to find the word you're looking for (you should probably find a way to read a specific line in O(1), though - or put them in separate dictionary files for e.g. individual letters if there's too much of them). Then, determine if the file name's index (its location in the first line) is denoted in the word's line.

Making code that writes the initial file and updates it accordingly seems simple enough, but I can help you with that if you want to.

vlemaistre 3,34115 silver badges32 bronze badges · Accepted Answer · 2019-06-26 12:05:08Z

I think there is a bit too many if conditions, you can manage to get the expected result like this :

data = ['ae.txt', 'ac.txt', 'pi.txt', 'ad.txt', 'mm.txt', 'ab.txt']
target = "orange"
def binary_search(a, x):
 lo = 0
 hi = len(a)
 while lo < hi:
 mid = (lo + hi) // 2
 print(mid)
 if not x in open(a[mid]).read():
 lo = mid + 1
 elif x in open(a[mid]).read():
 hi = mid
 if lo == hi:
 return lo
 print("low : {}; high : {}".format(lo,hi))
 return -1
index = binary_search(data, target)
print("The index where we first found the word orange is {}, the file name is {}".format(index,data[index]))

The index where we first found the word orange is 2, the file name is pi.txt

CollectivesTM on Stack Overflow

search text using binary search in python

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related