I developed this feature that I think can be improved quite a bit. The result is the desired one, but it took me many lines of code. Any idea to optimize it?
import json
import urllib.request
dni = 71179047
def get_cuit(dni):
request = urllib.request.Request('https://www.cuitonline.com/search.php?q=' + str(dni))
response = urllib.request.urlopen(request)
data_content = response.read()
html = data_content.decode('ISO-8859-1')
cuit = re.findall(r"\(?\b[0-9]{2}\)?-?[0-9]{7,8}-[0-9]{1}\b", html)[0]
result = re.sub('[^0-9]','', cuit)
return result
get_cuit(dni)
-
9\$\begingroup\$ It's cliché, but: Do not parse HTML with regexes. Do not parse HTML with regexes. Do not parse HTML with regexes. \$\endgroup\$Reinderien– Reinderien2020年12月18日 19:52:15 +00:00Commented Dec 18, 2020 at 19:52
3 Answers 3
- Use Requests for HTTP and let it deal with the encoding
- Do not bake the query into the URL string
- Use BeautifulSoup for HTML parsing
- Use typehints
Nice and simple:
from requests import get
from bs4 import BeautifulSoup
def get_cuit(dni: int) -> str:
with get(
'https://www.cuitonline.com/search.php',
params={'q': dni},
) as resp:
resp.raise_for_status()
doc = BeautifulSoup(resp.text, 'html.parser')
span = doc.find('span', class_='cuit')
return span.text
print(get_cuit(71_179_047))
-
\$\begingroup\$ As you said, 'nice and simple'. Thanks! ;) \$\endgroup\$Raymont– Raymont2020年12月19日 01:52:18 +00:00Commented Dec 19, 2020 at 1:52
-
\$\begingroup\$ @Raymont BeautifulSoup really helps too. If you aren't familiar with it, I highly recommend checking it out to see what it can do. \$\endgroup\$BruceWayne– BruceWayne2020年12月20日 18:00:37 +00:00Commented Dec 20, 2020 at 18:00
-
1\$\begingroup\$ Why do Python devs tend to prefer
from requests import get
/get(foo)
overimport requests
/requests.get(foo)
? It just seems very short-sighted to me. Surely, readingrequests.get(foo)
is more informative/self-explanatory than just readingget(foo)
? In the latter case, I'd have no idea where thisget
came from unless I scrolled all the way up to check all the import statements. Why not just keep namespaces separate? Ease of reading is vastly more important than ease of writing. \$\endgroup\$Will– Will2020年12月21日 01:08:53 +00:00Commented Dec 21, 2020 at 1:08 -
1\$\begingroup\$ @Will This is a 12-line program. The provenance of
get
is somewhat painfully obvious. In a much longer file, it would be worth considering the otherimport
style. \$\endgroup\$Reinderien– Reinderien2020年12月21日 04:36:20 +00:00Commented Dec 21, 2020 at 4:36 -
1\$\begingroup\$ Reading your answer just happened to remind me that I don't see the merit of the
from X import Y
import style in any case, generally speaking. That said, I admit that it isn't a big deal per se in your specific example code (even ifget
seems overly generic to me), and that this wasn't the ideal place to express that notion. \$\endgroup\$Will– Will2020年12月21日 08:29:01 +00:00Commented Dec 21, 2020 at 8:29
Couple additional points to what @Reinderien mentioned:
- if you would need to execute this function often, consider initializing a
requests.Session()
and re-using it for subsequent requests - you could improve the speed of HTML parsing here as well:
use
lxml
instead ofhtml.parser
(lxml
needs to be installed):doc = BeautifulSoup(resp.text, 'lxml')
since you are looking for a single element,
SoupStrainer
would really help out to allow your parser not waste time and resources parsing other parts of HTML:from bs4 import BeautifulSoup, SoupStrainer parse_only = SoupStrainer('span', class_='cuit') doc = BeautifulSoup(resp.text, 'lxml', parse_only=parse_only)
If you did have to use a regular expression for this task (which you shouldn't in this case - see the other answer), you could improve it by:
[0-9]
is equivalent to\d
, a digit character- A number inside brackets indicates the number of times the previous token should be repeated. In the case of
{1}
, it's superfluous - you can leave out the brackets entirely and the token will be matched exactly once anyway.
\(?\b\d{2}\)?-?\d{7,8}-\d\b
- Since you don't want all matches, but just the first match,
.search
could be a bit more appropriate than using.findall
to find all matches followed by extracting the[0]
th match.
Explore related questions
See similar questions with these tags.