Why I made it
I’m building a voice assistant and thought it would be great if it could deliver news updates.
What it does
It cleans up the user’s query, determines whether it refers to a news category or a specific search term, and then calls the NewsAPI to fetch the most recent matching headline. After receiving the response, it extracts and tidies the article’s title, author, source, and description before returning a conversational summary which the text-to-speech module of my voice assistant (which I haven't posted yet) can say out loud with ease.
Code
Here is the Python code:
import requests
import json5
def NewsConcerning(text):
""" Search for news concerning the text """
text = (
str(text)
.replace("news", "")
.replace("new", "")
.replace("for ", "")
.replace("concerning", "")
.replace("can ", "")
.replace("you ", "")
.replace("give ", "")
.replace("me ", "")
.replace("the ", "")
.replace("most ", "")
.replace("recent ", "")
.replace("in ", "")
.replace("category ", "")
.replace(" currently ", "")
.replace("current ", "")
.replace("now ", "")
.replace("what ", "")
.replace("whats ", "")
.replace(" is ", "")
.replace("about ", "")
.replace(" ", "")
.replace("today", "")
)
if "business" in text:
text = "category=business"
elif "entertainment" in text:
text = "category=entertainment"
elif "general" in text:
text = "category=general"
elif "health" in text:
text = "category=health"
elif "science" in text:
text = "category=science"
elif "sports" in text:
text = "category=sports"
elif "technology" in text:
text = "category=technology"
elif text == "": # Eg input has been fully stripped and nothing is left
text = "category=general"
else: # If specific text is queried:
text = "q=" + text
url = (
"https://newsapi.org/v2/top-headlines?"
+ text
+ "&sortBy=publishedAt&pageSize=1&apiKey=THE_API_KEY_HAS_BEEN_REMOVED"
)
response = requests.get(url)
parsed_data = json5.loads(response.text)
#print(parsed_data)
try:
#print(text)
if "articles" in parsed_data and parsed_data["articles"]:
print("<ALL OK>")
first_article = parsed_data["articles"][0]
name = first_article["source"]["name"]
author = first_article["author"]
title = first_article["title"]
desc = first_article["description"]
title = (
title.replace(name, "")
.replace(" - ", " ")
.replace("LISTEN | ", "")
.replace(
" | Opinion",
". I have also come to notice how this is an opinion-based text.",
)
.replace(" | ", " ")
.replace("Watch Live: ", "live, ")
.replace("!!!!!!", ", followed by six exclamation marks")
.replace("!!!!!", ", followed by five exclamation marks")
.replace("!!!!", ", followed by four exclamation marks")
.replace("!!!", ", followed by three exclamation marks")
.replace("!!", ", followed by two exclamation marks")
.replace("\n", "")
)
if author:
if author.lower() != name.lower():
author = author.replace(",", "and")
author = " by " + author
title = title.replace(author, "")
else:
author = ""
else:
author = ", though I did not find who wrote it"
if desc == "":
desc = "Sadly, I could not find any description."
else:
content = desc.split("<p>") # In case description comes in HTML format
if len(content) > 1:
desc = content[1]
desc = desc.replace(""", "").replace(".com", "")
if len(name):
name = "on " + name
return f"\n\nHere is the recently published news I found {name}{author}:\nThe title to it is {title}... \n{desc}"
except IndexError:
text = text.replace("q=", "").replace("category=", "")
return f"I could not find anything concerning what you asked for. Mabe I misheard, were you asking about {text}? Try again if not."
except KeyError:
return "Whoops! Looks like you have been very interested in the news today! I'm sorry to say that my daily request limit, one hundred per day or fifty per twelve hours, has been reached."
i = input("Enter what you would ask of the voice assistant (eg: news concerning bitcoin): ")
print(NewsConcerning(i))
Critique request
Please, tell me anything that comes to mind.
5 Answers 5
Request
Instead of:
url = (
"https://newsapi.org/v2/top-headlines?"
+ text
+ "&sortBy=publishedAt&pageSize=1&apiKey=THE_API_KEY"
)
response = requests.get(url)
I find this more legible:
url = "https://newsapi.org/v2/top-headlines"
params = {
"category": "general",
"sortBy": "publishedAt",
"pageSize": 1,
"apiKey": "THE_API_KEY"
}
response = requests.get(url, params=params)
NB:
- You can use
requests.session()instead ofrequests.get... - Always check the status code. It should normally always be 200. Otherwise, the rest of the program is going to fail or at the very least yield unexpected results.
- An API key should not be hardcoded. Use a .env file instead.
HTML parsing
This approach does not scale well:
desc = desc.replace(""", "").replace(".com", "")
Who knows how many more HTML tags are going to show up over time. Use a lib like Beautifulsoup instead. Or lxml would be sufficient.
If you are always expecting a JSON response, it would be appropriate to build a small function that retrieves content from a URL, based on one or more parameters you supply (notably the category). Then you can have another dedicated function for parsing the JSON response. The idea is: separation of responsibility. One function should do one thing and do it well.
Also, isolating concerns makes debugging easier. For example, if the JSON response is invalid, you can jump straight to the function that is responsible for retrieving the content.
Securing credentials
The API key will most likely show up in web server logs as it is part of the URL. Whenever possible, transmit it differently, using headers for instance.
The documentation states that this method is supported:
You can attach your API key to a request in one of three ways:
Via the apiKey querystring parameter.
Via the X-Api-Key HTTP header.
Via the Authorization HTTP header. Including Bearer is optional, and be sure not to base 64 encode it like you may have seen in other authentication tutorials.We strongly recommend the either of last two so that your API key isn't visible to others in logs or via request sniffing.
(emphasis is mine)
Naming
NewsConcerning is styled in the way PEP 8 would tell us to style a class name. As function it should be in snake case: news_concerning.
String replacements
Many of the strings you replace with spaces are problematic. Replacing "in " with "", for instance, would change "vein " to "ve".
To address this you can utilize word boundaries (\b) in a regular expression.
DRY
You call str.replace several times. Instead create a list of strings, join them with | and turn them into a regular expression that replaced any of them with "".
Taking these into consideration:
import re
def NewsConcerning(text):
""" Search for news concerning the text """
substrings = [
"news", "new", "for ", "concerning", "can ",
"you ", "give ", "me ", "the ", "most ", "recent ",
"in ", "category ", " currently ", "current ",
"now ", "what ", "whats ", " is ", "about "
" ", "today"
]
pat = re.compile("|".join("\\b" + re.escape(s) + "\\b" for s in substrings))
text = pat.sub("", str(str))
This also removes the possibility for one call to str.replace to change the string so that a subsequent call will be triggered that otherwise wouldn't.
More repetition
Your conditional has more repetition.
categories = (
"business", "entertainment", "general", "health",
"science", "sports"
)
cat_in_text = next((cat for cat in categories if cat in text), None)
if cat_in_text:
text = f"category={cat_in_text}"
elif text == "": # Eg input has been fully stripped and nothing is left
text = "category=general"
else: # If specific text is queried:
text = "q=" + text
URLs
But there's a problem with this approach. You should next be escaping this text variable when you insert it into your URL, but you can't without escaping that = you inserted intentionally.
__main__ guard
The final bit of code should be wrapped in a __main__ guard.
if __name__ == "__main__":
i = input("Enter what you would ask of the voice assistant (eg:news concerning bitcoin): ")
print(NewsConcerning(i))
Case sensitivity
As it stands, your program doesn't work very well if the user inputs mixed-case words. The fix is easy: convert the input to lowercase.
You can either convert the input immediately to lowercase, or let the NewsConcerning function do it.
if __name__ == "__main__":
i = input("Enter what you would ask of the voice assistant (eg:news concerning bitcoin): ")
print(NewsConcerning(lower(i)))
Exception handling
Your try is miles from your actual exception handlers. The code in your try block should be as absolutely minimal as possible.
Boolean tests and false-y values
In a couple of places you do boolean tests on the length of lists or on whether strings are empty. You should be aware that empty strings are false in Python, as are empty lists.
String Replacements and Matching
You should use a case-insensitive regular expression using the re.I flag to do your matching. For example:
import re
...
matching_words = (
'business',
'entertainment',
...
'general'
)
matching_rex = re.compile(fr'\b({"|".join(matching_words)})\b', flags=re.I)
m = matching_rex.search(text)
if m:
...
If any of the words in matching_words contain characters that have special meaning in regular expressions (e.g. '.', '|', '(', etc.), then each word should be escaped either manually in matching_words or, preferably by calling re.escape on each word to be matched.
A case-insensitive regular expression can be used with re.sub doing your replacements of the input text. For the sake of efficiency, you should only do the replacement logic if you are unable to find a keyword match.
Docstrings and Comments
These would be helpful.
single responsibility
The SRP says that a function should do one thing well.
The expression that replaces "concerning" and other stop words should be in a helper function. For one thing, that would make it very easy to unit test.
consistency
.replace(" currently ", "")
.replace("current ", "")
.replace("now ", "")
.replace("what ", "")
.replace("whats ", "")
.replace(" is ", "")
It's a bit worrying that "new" has no SPACE characters,
"for " ends with one, and a few words are surrounded by them.
I would have a hard time writing a docstring that explains
just what cleaned up result we're computing here.
That is to say, there doesn't appear to be a proper specification
for the cleanup behavior.
Also, if the input text is on the long side,
then repeatedly scanning it for two dozen words is scaling poorly.
Given that we already have a bunch of stop words, I would
anticipate that in a few months we might accumulate, say,
twice as many. That's just how code maintenance tends to go.
Consider putting a stake in the ground:
"We parse text as English words, and remove stopwords."
And then you might start to implement that spec with import nltk.
That package offers nltk.word_tokenize(), which is robust
and convenient.
Or you might go a bit further, such as saying you will strip punctuation, as you would apparently want a "what's" \$\rightarrow\$ "whats" conversion. You might even lemmatize, so a common word stem would match e.g. both "current" and "currently".
Consider putting your stop words in a set(), so that when
you loop over for word in words: you can probe set membership
in \$O(1)\$ constant time, rather than consuming time
proportional to number of stopwords.
break out a helper
if "business" in text:
text = "category=business"
...
elif "technology" in text:
text = "category=technology"
This should be a for loop over the enumerated categories,
with an assignment of text = f"category={category}",
rather than the copy-pasta we see here.
Any time you're tempted to start leaning on that CMD-V paste key,
take a breath, step back, and think about what the appropriate
function signature should be, so you can parameterize
instead pasting then editing.
Also, don't put a secret API key in source code, checked into git,
and certainly don't publish it to the world.
Prefer to obtain it from an env var with os.environ["API_KEY"]
comments
The "followed by six exclamation marks" thing is weird enough that
it deserves either a unit test which exercises that case,
or a # comment or helper """docstring""" that explains
the business Use Case.
Apparently there's some site that is fond of exclamation marks,
but it's unclear how that interferes with the proper functioning
of NewsConcerning().
Which brings us back to: write an explicit spec for the function,
please, preferably as a docstring.
Then some months from now a maintainer can decide whether it
is still delivering correct results, even as news content and
formatting continues to change.
Oh, and
PEP 8
asks that you spell it def news_concerning():
check for error at the site of the error
response = requests.get(url)
...
except KeyError:
return "Whoops! ..."
That diagnostic message is very nice and helpful, thank you. But it's very odd to defer the check for so many dozens of lines down in the source code. We should not have to juggle "are we in error state here? / normal state?" while reading so much code that manipulates the abnormal text result.
The usual idiom is
response = requests.get(url)
response.raise_for_status()
on the assumption that a 400- or 500-series error won't let
us accomplish anything useful with response.text.
Or you might examine response.status
Just a high-level comment: A large part of the code consists of rule-based parsing of text (all the string-replace calls). This is extremely brittle (there are lots of cases that you won't cover) and error prone. Most importantly, it's not maintainable. It's perhaps good as a first stab, to get an idea about what you want to do, or build a rough first prototype, but it's practically impossible to build this out into a bigger program. I would advise to spent time at learning to use a language model (see the tutorials at the HuggingFace site), and either fine tune it or constrain it.
-
\$\begingroup\$ I disagree. Rule-based parsing is significantly easier to understand and maintain than a language model, provided that it's well-structured – which, as other answers have pointed out, it isn't. If you want to do more sophsticated NLP, I'd recommend a package like
spacy, rather than jumping straight to the solution-looking-for-a-problem of an LLM. \$\endgroup\$wizzwizz4– wizzwizz42025年11月25日 20:44:48 +00:00Commented Nov 25 at 20:44 -
\$\begingroup\$ I don't disagree that rule-based parsing is easier to understand. But it's totally unsuitable for handling free-flowing natural language text. The proof is in the pudding, the dev who wrote that code will notice this soon enough. (But spacy is a very nice library to get familiar with - also agree with that.) \$\endgroup\$mudskipper– mudskipper2025年11月25日 21:28:36 +00:00Commented Nov 25 at 21:28
-
\$\begingroup\$ I also agree, btw, that it is good to first get some familiarity with rule-based parsing before diving into LLMs. \$\endgroup\$mudskipper– mudskipper2025年11月25日 21:30:50 +00:00Commented Nov 25 at 21:30
-
\$\begingroup\$ Actually, rule-based parsing works quite well for voice interfaces: users tend to use a restricted grammar when interacting with them, and by careful observation this can be turned into a system of rules. I'd recommend re-running the studies every few years, to keep up with linguistic drift, but you can get a lot of mileage out of the simple solutions even here. \$\endgroup\$wizzwizz4– wizzwizz42025年11月25日 21:37:40 +00:00Commented Nov 25 at 21:37