Given a string the task it to find its first word with some rules:
- The string can have points and commas
- A word can start with a letter, a point or space
- One word can contain one apostrophe and it stills being a valid one
For example:
assert first_word("Hello world") == "Hello"
assert first_word(" a word ") == "a"
assert first_word("don't touch it") == "don't"
assert first_word("greetings, friends") == "greetings"
assert first_word("... and so on ...") == "and"
assert first_word("hi") == "hi"
assert first_word("Hello.world") == "Hello"
The code:
def first_word(text: str) -> str:
"""
returns the first word in a given text.
"""
text = re.sub("[^A-Za-z'\s.]",'',text)
words = text.split()
for word in words:
for i in range(len(word)):
if word[i].isalpha() or word[i] == "'":
if i == len(word) - 1:
if word.find('.') != -1:
return word.split('.')[0]
else:
return word
How could we improve it?
2 Answers 2
You could make the code better (and shorter) by using regex to split any delimiters that occur in the string, for example, in Hello.world
, the string (list form) would then be like ['', 'Hello', '']
(after splitting the first word from delimiters) and then you can access the first word from index [1]
(always). Like this,
import re
def first_word(s):
return re.split(r"(\b[\w']+\b)(?:.+|$)", s)[1]
Here are some tests:
tests = [
"Hello world",
"a word",
"don't touch it",
"greetings, friends",
"... and so on ...",
"hi",
"Hello.world",
"Hello.world blah"]
for test in tests:
assert first_word("Hello world") == "Hello"
assert first_word(" a word ") == "a"
assert first_word("don't touch it") == "don't"
assert first_word("greetings, friends") == "greetings"
assert first_word("... and so on ...") == "and"
assert first_word("hi") == "hi"
assert first_word("Hello.world") == "Hello"
assert first_word("Hello.world blah") == "Hello"
print('{}'.format(first_word(test)))
(\b[\w']+\b)(?:.+|$)
is used above, where (\b[\w']+\b)
calls the first word of the string (in list form). \b
allows you to perform a "whole words only" search using a regular expression in the form of \b"word"\b
. Note that using [\w']
(instead of [\w+]
) leaves the apostrophe in don't
. For (?:.+|$)
, you can take a look here.
Here are the expected outputs:
Hello
a
don't
greetings
and
hi
Hello
Hello
After timing it -
%timeit first_word(test)
>>> 1.54 μs ± 17.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
NOTE - A delimiter is a sequence of one or more characters used to specify the boundary between separate, independent regions in plain text or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values.
Hope this helps!
Your code looks pretty great, much better that mine!
The beauty of regular expressions is that sometimes we can do the entire task, similar to our task here, with it so that to reduce writing additional if
and then
s. Maybe, here we could find an expression that would do so, something similar to:
(\b[\w']+\b)(?:.+|$)
which wraps our desired first word in a capturing group:
(\b[\w']+\b)
followed by a non-capturing group:
(?:.+|$)
Of course, if we wish to add more boundaries or reduce our boundaries or change our char list [\w']
, we can surely do so.
Test
Let's test our expression with re.finditer
to see if that would work:
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(\b[\w']+\b)(?:.+|$)"
test_str = ("Hello world\n"
" a word \n"
"don't touch it\n"
"greetings, friends\n"
"... and so on ...\n"
"hi\n"
"Hello.world")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Output
Match 1 was found at 0-11: Hello world Group 1 found at 0-5: Hello Match 2 was found at 13-20: a word Group 1 found at 13-14: a Match 3 was found at 21-35: don't touch it Group 1 found at 21-26: don't Match 4 was found at 36-54: greetings, friends Group 1 found at 36-45: greetings Match 5 was found at 59-72: and so on ... Group 1 found at 59-62: and Match 6 was found at 73-75: hi Group 1 found at 73-75: hi Match 7 was found at 76-87: Hello.world Group 1 found at 76-81: Hello
RegEx Circuit
jex.im visualizes regular expressions:
Basic Performance Test
const repeat = 1000000;
const start = Date.now();
for (var i = repeat; i >= 0; i--) {
const regex = /(\b[\w']+\b)(?:.+|$)/gm;
const str = `Hello.world`;
const subst = `1ドル`;
var match = str.replace(regex, subst);
}
const end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚💚💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");
first_word(" a word ") == "a"
if a word can start with a space? \$\endgroup\$