Looking for specific words in a string and noting the order

Question 1

The problem:

I have some text that I go through line-by-line, if I find a line containing the keyword DATA_TYPE_1 and/or DATA_TYPE_2, I know all the following values are that data type. What I do right now is keep track of if the last keyword I saw was DATA_TYPE_1 or DATA_TYPE_2 (or in the case where I see both in the same line, I stop checking to see if the other data type keyword has come up, I just keep track of the order the keywords came up from left to right)

ALL VALUES BELOW ARE DATA_TYPE_1
1
2
3
4
NOW ALL VALUES ARE DATA_TYPE_2
5
6
7
8

data_type_1 values = 1, 2, 3, and 4

data_type_2 values = 5, 6, 7, and 8

ALL VALUES ARE DATA_TYPE_2
1
2
3
NOW VALUES ARE DATA_TYPE_1
4
5
6

data_type_2 values = 1, 2, and 3

data_type_1 values = 4, 5, 6

ALL VALUES BELOW ARE DATA_TYPE_2 AND DATA_TYPE_1, RESPECTIVELY
1 2
3 4
5 6
7 8

data_type_1 values = 2, 4, 6, and 8

data_type_2 values = 1, 3, 5, and 7

The approach

Essentially, I look at a given line and use regular expressions to identify DATA_TYPE_1 and DATA_TYPE_2. If both are present, I want to know what order they are in. I want to clean up some of the logic statements if possible of the following function determine_data_type:

edit: did not realize I should include the regexes -- for explaining what I am trying to so I referred to the two keywords I am looking for as data_type_1 and data_type_2 but they are actually achiral and chiral; I've included the regexes below

import re
REGEX_1 = re.compile(r'(?i)(\bachiral\b)')
REGEX_2 = re.compile(r'(?i)(\bchiral\b)')
def determine_data_type(text, type_array):
 '''
 Determines which keywords are present in a given string
 Parameters:
 text: str
 Line of text to examine for DATA_TYPE_1 and DATA_TYPE_2
 type_array: List[bool]
 Initial type of data 
 Returns:
 type_array: List[bool]
 Updated type_array
 A list of bools: [type_1, type_2, both_types]
 If only type_1 is present: [True, False, False] 
 If only type_2 is present: [False, True, False]
 If type_1 and type_2 appear: [True, False, True] or [False, True, True] 
 depending on which appears first
 '''
 type_1, type_2, both_types = type_array
 # only see if data type needs updating if 
 # (1) haven't found data type keywords yet or
 # (2) I expect the data type to switch from 1 to 2 or vice versa
 if not both_types:
 if not type_1 and not type_2:
 if re.search(REGEX_1, text):
 type_1 = True
 if re.search(REGEX_2, text):
 type_2 = True
 if type_1 and type_2:
 both_types = True
 # get the positions of both words
 type_1_pos = re.search(REGEX_1, text).start()
 type_2_pos = re.search(REGEX_2, text).start()
 if type_1_pos > type_2_pos:
 # type_1 is not first
 type_1 = False
 else:
 type_2 = False
 elif type_1 and not type_2:
 if re.search(REGEX_2, text):
 type_2 = True
 if not re.search(REGEX_1, text):
 type_1 = False
 else:
 both_types = True
 elif type_2 and not type_1:
 if re.search(REGEX_1, text):
 type_1 = True
 if not re.search(REGEX_2, text):
 type_1 = False
 else:
 both_types = True
 return type_array
if __name__ == '__main__':
 text = ["blah, blah blah", 
 "OTHER COLUMN HEADER achiral chiral", 
 "blah blah blah"]
 # at first we don't know what type of data we are looking at
 data_type = [False, False, False] # [type_1, type_2, both_types]
 for line in text:
 data_type = determine_data_type(line, data_type)
 print(data_type) #[True, False, True]

Advice needed

As you can see, I repeat a lot of the code above - like when I search twice for both type_1 and type_2 in order to get the position of each word when they appear on the same line. Also, in the cases where only type_1 is True or only type_2 is True, I was trying to think of a way to only search for the "Falsey" one, and if it's found, quickly check if the "Truthy" one happens to be there as well

I am using Python 3.6

Question 2

What version of Python are you using? Add the "Python-2.x" or "Python-3.x" tag to your question. But also, please tell us what the exact version is: 2.7, 3.4, 3.8

Question 3

regex_1 and regex_2 are not included in your code. You are also missing the import re which you must be using. Include your entire code or your question may be put on hold.

Question 4

@AJNeufeld thanks, updated with all relevant information (including that I am using Python 3.6)

Question 5

Data Representation

I think [False, True, True] is a very confusing representation of both data types in "reverse" order. Let's revisit that.

You have 2 types, and a line which may contain none, one, or both types.

types = {'DATA_TYPE_1', 'DATA_TYPE_2'}
line = "ALL VALUES BELOW ARE DATA_TYPE_2 AND DATA_TYPE_1, RESPECTIVELY";

Let's use a regex that will split the line up into individual words.

import re
word_re = re.compile(r'\w+')

Now, what we want to do is extract all the words from the line, keeping only the ones that represent the types we are looking for, keeping the words in the order they were in the line:

order = [word for word in word_re.findall(line) if word in types]
>>> order
['DATA_TYPE_2', 'DATA_TYPE_1']

Or, with your updated question, it looks like there aren't commas or other punctuation to get in the way of a simple line.split(), so we can omit the regular expression:

types = {"achiral", "chiral"}
line = "OTHER COLUMN HEADER achiral chiral"
order = [word for word in line.split() if word in types]
>>> order
['achiral', 'chiral']

If you produced this, it is quite clear what the field order is. If you also maintained a list of all the types which have been found, adding new types as they are found, when the list size reaches the number of types (2) then you've encountered all (both) of the types.

def determine_data_type(text, found):
 found.extend(word for word in text.split() if word in types)
 return len(found) == len(types)
types = {"achiral", "chiral"}
line = "OTHER COLUMN HEADER achiral chiral"
found = []
all_found = determine_data_type(line, found)
>>> found
['achiral', 'chiral']
>>> all_found
True

Enums

Using strings to represent data types is awkward. When you have a finite set of named items, enum should be the tool you reach for.

from enum import Enum
Type = Enum('Type', 'ACHIRAL CHIRAL')
def determine_data_type(text, found):
 found.extend(Type[word] for word in text.upper().split() if word in Type.__members__)
 return len(found) == len(Type)
line = "OTHER COLUMN HEADER achiral chiral"
found = []
all_found = determine_data_type(line, found)
>>> found
[<Type.ACHIRAL: 1>, <Type.CHIRAL: 2>]
>>> all_found
True

Being able to use Type.ACHIRAL or Type.CHIRAL as named constants in your program, instead of using strings which can be mistyped, will result in safer and faster programs.

From comment:

Let's say that the keywords I am looking for are not exactly always the same. Instead of just chiral and achiral the words I am looking for could also be chirality and achirality or chiral and not chiral or chiral and a-chiral. With these two keywords it's hard to come up with examples, but in the case where maybe it's more difficult to keep a list of finite set of words to look for but the keywords all have a similar 'root' word, how would you modify your approach?

With chiral/chirality and achiral/achirality, you could just use the Enum type's ability to have type-aliases.

from enum import Enum
class Type(Enum):
 CHIRAL = 1
 ACHIRAL = 2
 CHIRALITY = 1
 ACHIRALITY = 2
def determine_data_type(text, found):
 found.extend(Type[word] for word in text.upper().split() if word in Type.__members__)
 return len(found) == len(Type)
line = "OTHER COLUMN HEADER achiral chirality"
found = []
all_found = determine_data_type(line, found)
>>> found
[<Type.ACHIRAL: 2>, <Type.CHIRAL: 1>]
>>> all_found
True

len(Type) == 2 because there are only two enum values, but len(Type.__members__) == 4 because there are 4 names for those two values, so you can safely use variants of the name.

For not chiral or a-chiral, you'll have to use a regex that detects the whole term, with spaces and/or special characters.

regex = re.compile(r"(?i)\b(not |a-|a)?chiral(ity)?\b")
for term in regex.findall(text):
 ...

You can't use Type[term] to map those terms to the Type(Enum) directly, since the enum identifiers can't have spaces or special characters. But you could create your own dictionary to map the terms to the enum types.

Types = {'not chiral': Type.ACHIRAL,
 'a-chiral': Type.ACHIRAL,
 ...
 }

Question 6

I learned a lot from this. Thank you! I have a follow up question - let's say that the keywords I am looking for are not exactly always the same. Instead of just chiral and achiral the words I am looking for could also be chirality and achirality or chiral and not chiral or chiral and a-chiral. With these two keywords it's hard to come up with examples, but in the case where maybe it's more difficult to keep a list of finite set of words to look for but the keywords all have a similar 'root' word, how would you modify your approach?

AJNeufeld AJNeufeld 35.2k5 gold badges41 silver badges103 bronze badges · Accepted Answer · 2020-02-28 06:16:15Z

Data Representation

I think [False, True, True] is a very confusing representation of both data types in "reverse" order. Let's revisit that.

You have 2 types, and a line which may contain none, one, or both types.

types = {'DATA_TYPE_1', 'DATA_TYPE_2'}
line = "ALL VALUES BELOW ARE DATA_TYPE_2 AND DATA_TYPE_1, RESPECTIVELY";

Let's use a regex that will split the line up into individual words.

import re
word_re = re.compile(r'\w+')

Now, what we want to do is extract all the words from the line, keeping only the ones that represent the types we are looking for, keeping the words in the order they were in the line:

order = [word for word in word_re.findall(line) if word in types]
>>> order
['DATA_TYPE_2', 'DATA_TYPE_1']

Or, with your updated question, it looks like there aren't commas or other punctuation to get in the way of a simple line.split(), so we can omit the regular expression:

types = {"achiral", "chiral"}
line = "OTHER COLUMN HEADER achiral chiral"
order = [word for word in line.split() if word in types]
>>> order
['achiral', 'chiral']

If you produced this, it is quite clear what the field order is. If you also maintained a list of all the types which have been found, adding new types as they are found, when the list size reaches the number of types (2) then you've encountered all (both) of the types.

def determine_data_type(text, found):
 found.extend(word for word in text.split() if word in types)
 return len(found) == len(types)
types = {"achiral", "chiral"}
line = "OTHER COLUMN HEADER achiral chiral"
found = []
all_found = determine_data_type(line, found)
>>> found
['achiral', 'chiral']
>>> all_found
True

Enums

Using strings to represent data types is awkward. When you have a finite set of named items, enum should be the tool you reach for.

from enum import Enum
Type = Enum('Type', 'ACHIRAL CHIRAL')
def determine_data_type(text, found):
 found.extend(Type[word] for word in text.upper().split() if word in Type.__members__)
 return len(found) == len(Type)
line = "OTHER COLUMN HEADER achiral chiral"
found = []
all_found = determine_data_type(line, found)
>>> found
[<Type.ACHIRAL: 1>, <Type.CHIRAL: 2>]
>>> all_found
True

Being able to use Type.ACHIRAL or Type.CHIRAL as named constants in your program, instead of using strings which can be mistyped, will result in safer and faster programs.

From comment:

Let's say that the keywords I am looking for are not exactly always the same. Instead of just chiral and achiral the words I am looking for could also be chirality and achirality or chiral and not chiral or chiral and a-chiral. With these two keywords it's hard to come up with examples, but in the case where maybe it's more difficult to keep a list of finite set of words to look for but the keywords all have a similar 'root' word, how would you modify your approach?

With chiral/chirality and achiral/achirality, you could just use the Enum type's ability to have type-aliases.

from enum import Enum
class Type(Enum):
 CHIRAL = 1
 ACHIRAL = 2
 CHIRALITY = 1
 ACHIRALITY = 2
def determine_data_type(text, found):
 found.extend(Type[word] for word in text.upper().split() if word in Type.__members__)
 return len(found) == len(Type)
line = "OTHER COLUMN HEADER achiral chirality"
found = []
all_found = determine_data_type(line, found)
>>> found
[<Type.ACHIRAL: 2>, <Type.CHIRAL: 1>]
>>> all_found
True

len(Type) == 2 because there are only two enum values, but len(Type.__members__) == 4 because there are 4 names for those two values, so you can safely use variants of the name.

For not chiral or a-chiral, you'll have to use a regex that detects the whole term, with spaces and/or special characters.

regex = re.compile(r"(?i)\b(not |a-|a)?chiral(ity)?\b")
for term in regex.findall(text):
 ...

You can't use Type[term] to map those terms to the Type(Enum) directly, since the enum identifiers can't have spaces or special characters. But you could create your own dictionary to map the terms to the enum types.

Types = {'not chiral': Type.ACHIRAL,
 'a-chiral': Type.ACHIRAL,
 ...
 }

I learned a lot from this. Thank you! I have a follow up question - let's say that the keywords I am looking for are not exactly always the same. Instead of just chiral and achiral the words I am looking for could also be chirality and achirality or chiral and not chiral or chiral and a-chiral. With these two keywords it's hard to come up with examples, but in the case where maybe it's more difficult to keep a list of finite set of words to look for but the keywords all have a similar 'root' word, how would you modify your approach?

Stack Exchange Network

Looking for specific words in a string and noting the order

The problem:

The approach

Advice needed

1 Answer 1

Data Representation

Enums

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Looking for specific words in a string and noting the order

The problem:

The approach

Advice needed

1 Answer 1

Data Representation

Enums

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions