The problem:
I have some text that I go through line-by-line, if I find a line containing the keyword DATA_TYPE_1
and/or DATA_TYPE_2
, I know all the following values are that data type. What I do right now is keep track of if the last keyword I saw was DATA_TYPE_1
or DATA_TYPE_2
(or in the case where I see both in the same line, I stop checking to see if the other data type keyword has come up, I just keep track of the order the keywords came up from left to right)
ALL VALUES BELOW ARE DATA_TYPE_1
1
2
3
4
NOW ALL VALUES ARE DATA_TYPE_2
5
6
7
8
data_type_1 values = 1, 2, 3, and 4
data_type_2 values = 5, 6, 7, and 8
ALL VALUES ARE DATA_TYPE_2
1
2
3
NOW VALUES ARE DATA_TYPE_1
4
5
6
data_type_2 values = 1, 2, and 3
data_type_1 values = 4, 5, 6
ALL VALUES BELOW ARE DATA_TYPE_2 AND DATA_TYPE_1, RESPECTIVELY
1 2
3 4
5 6
7 8
data_type_1 values = 2, 4, 6, and 8
data_type_2 values = 1, 3, 5, and 7
The approach
Essentially, I look at a given line and use regular expressions to identify DATA_TYPE_1
and DATA_TYPE_2
. If both are present, I want to know what order they are in. I want to clean up some of the logic statements if possible of the following function determine_data_type
:
edit: did not realize I should include the regexes -- for explaining what I am trying to so I referred to the two keywords I am looking for as data_type_1
and data_type_2
but they are actually achiral
and chiral
; I've included the regexes below
import re
REGEX_1 = re.compile(r'(?i)(\bachiral\b)')
REGEX_2 = re.compile(r'(?i)(\bchiral\b)')
def determine_data_type(text, type_array):
'''
Determines which keywords are present in a given string
Parameters:
text: str
Line of text to examine for DATA_TYPE_1 and DATA_TYPE_2
type_array: List[bool]
Initial type of data
Returns:
type_array: List[bool]
Updated type_array
A list of bools: [type_1, type_2, both_types]
If only type_1 is present: [True, False, False]
If only type_2 is present: [False, True, False]
If type_1 and type_2 appear: [True, False, True] or [False, True, True]
depending on which appears first
'''
type_1, type_2, both_types = type_array
# only see if data type needs updating if
# (1) haven't found data type keywords yet or
# (2) I expect the data type to switch from 1 to 2 or vice versa
if not both_types:
if not type_1 and not type_2:
if re.search(REGEX_1, text):
type_1 = True
if re.search(REGEX_2, text):
type_2 = True
if type_1 and type_2:
both_types = True
# get the positions of both words
type_1_pos = re.search(REGEX_1, text).start()
type_2_pos = re.search(REGEX_2, text).start()
if type_1_pos > type_2_pos:
# type_1 is not first
type_1 = False
else:
type_2 = False
elif type_1 and not type_2:
if re.search(REGEX_2, text):
type_2 = True
if not re.search(REGEX_1, text):
type_1 = False
else:
both_types = True
elif type_2 and not type_1:
if re.search(REGEX_1, text):
type_1 = True
if not re.search(REGEX_2, text):
type_1 = False
else:
both_types = True
return type_array
if __name__ == '__main__':
text = ["blah, blah blah",
"OTHER COLUMN HEADER achiral chiral",
"blah blah blah"]
# at first we don't know what type of data we are looking at
data_type = [False, False, False] # [type_1, type_2, both_types]
for line in text:
data_type = determine_data_type(line, data_type)
print(data_type) #[True, False, True]
Advice needed
As you can see, I repeat a lot of the code above - like when I search twice for both type_1
and type_2
in order to get the position of each word when they appear on the same line. Also, in the cases where only type_1
is True
or only type_2
is True
, I was trying to think of a way to only search for the "Falsey" one, and if it's found, quickly check if the "Truthy" one happens to be there as well
I am using Python 3.6
1 Answer 1
Data Representation
I think [False, True, True]
is a very confusing representation of both data types in "reverse" order. Let's revisit that.
You have 2 types, and a line which may contain none, one, or both types.
types = {'DATA_TYPE_1', 'DATA_TYPE_2'}
line = "ALL VALUES BELOW ARE DATA_TYPE_2 AND DATA_TYPE_1, RESPECTIVELY";
Let's use a regex that will split the line
up into individual words.
import re
word_re = re.compile(r'\w+')
Now, what we want to do is extract all the words from the line
, keeping only the ones that represent the types
we are looking for, keeping the words in the order they were in the line
:
order = [word for word in word_re.findall(line) if word in types]
>>> order
['DATA_TYPE_2', 'DATA_TYPE_1']
Or, with your updated question, it looks like there aren't commas or other punctuation to get in the way of a simple line.split()
, so we can omit the regular expression:
types = {"achiral", "chiral"}
line = "OTHER COLUMN HEADER achiral chiral"
order = [word for word in line.split() if word in types]
>>> order
['achiral', 'chiral']
If you produced this, it is quite clear what the field order is. If you also maintained a list
of all the types which have been found, adding new types as they are found, when the list size reaches the number of types (2) then you've encountered all (both) of the types.
def determine_data_type(text, found):
found.extend(word for word in text.split() if word in types)
return len(found) == len(types)
types = {"achiral", "chiral"}
line = "OTHER COLUMN HEADER achiral chiral"
found = []
all_found = determine_data_type(line, found)
>>> found
['achiral', 'chiral']
>>> all_found
True
Enums
Using strings to represent data types is awkward. When you have a finite set of named items, enum
should be the tool you reach for.
from enum import Enum
Type = Enum('Type', 'ACHIRAL CHIRAL')
def determine_data_type(text, found):
found.extend(Type[word] for word in text.upper().split() if word in Type.__members__)
return len(found) == len(Type)
line = "OTHER COLUMN HEADER achiral chiral"
found = []
all_found = determine_data_type(line, found)
>>> found
[<Type.ACHIRAL: 1>, <Type.CHIRAL: 2>]
>>> all_found
True
Being able to use Type.ACHIRAL
or Type.CHIRAL
as named constants in your program, instead of using strings which can be mistyped, will result in safer and faster programs.
From comment:
Let's say that the keywords I am looking for are not exactly always the same. Instead of just chiral and achiral the words I am looking for could also be chirality and achirality or chiral and not chiral or chiral and a-chiral. With these two keywords it's hard to come up with examples, but in the case where maybe it's more difficult to keep a list of finite set of words to look for but the keywords all have a similar 'root' word, how would you modify your approach?
With chiral
/chirality
and achiral
/achirality
, you could just use the Enum
type's ability to have type-aliases.
from enum import Enum
class Type(Enum):
CHIRAL = 1
ACHIRAL = 2
CHIRALITY = 1
ACHIRALITY = 2
def determine_data_type(text, found):
found.extend(Type[word] for word in text.upper().split() if word in Type.__members__)
return len(found) == len(Type)
line = "OTHER COLUMN HEADER achiral chirality"
found = []
all_found = determine_data_type(line, found)
>>> found
[<Type.ACHIRAL: 2>, <Type.CHIRAL: 1>]
>>> all_found
True
len(Type) == 2
because there are only two enum values, but len(Type.__members__) == 4
because there are 4 names for those two values, so you can safely use variants of the name.
For not chiral
or a-chiral
, you'll have to use a regex that detects the whole term, with spaces and/or special characters.
regex = re.compile(r"(?i)\b(not |a-|a)?chiral(ity)?\b")
for term in regex.findall(text):
...
You can't use Type[term]
to map those terms to the Type(Enum)
directly, since the enum identifiers can't have spaces or special characters. But you could create your own dictionary to map the terms to the enum types.
Types = {'not chiral': Type.ACHIRAL,
'a-chiral': Type.ACHIRAL,
...
}
-
\$\begingroup\$ I learned a lot from this. Thank you! I have a follow up question - let's say that the keywords I am looking for are not exactly always the same. Instead of just
chiral
andachiral
the words I am looking for could also bechirality
andachirality
orchiral
andnot chiral
orchiral
anda-chiral
. With these two keywords it's hard to come up with examples, but in the case where maybe it's more difficult to keep a list of finite set of words to look for but the keywords all have a similar 'root' word, how would you modify your approach? \$\endgroup\$Jinx– Jinx2020年02月28日 15:29:17 +00:00Commented Feb 28, 2020 at 15:29
Explore related questions
See similar questions with these tags.
regex_1
andregex_2
are not included in your code. You are also missing theimport re
which you must be using. Include your entire code or your question may be put on hold. \$\endgroup\$