8
\$\begingroup\$

I developed this feature that I think can be improved quite a bit. The result is the desired one, but it took me many lines of code. Any idea to optimize it?

import json
import urllib.request
dni = 71179047
 
def get_cuit(dni):
 request = urllib.request.Request('https://www.cuitonline.com/search.php?q=' + str(dni))
 response = urllib.request.urlopen(request)
 data_content = response.read()
 html = data_content.decode('ISO-8859-1')
 cuit = re.findall(r"\(?\b[0-9]{2}\)?-?[0-9]{7,8}-[0-9]{1}\b", html)[0]
 result = re.sub('[^0-9]','', cuit)
 return result
get_cuit(dni)
asked Dec 18, 2020 at 19:33
\$\endgroup\$
1
  • 9
    \$\begingroup\$ It's cliché, but: Do not parse HTML with regexes. Do not parse HTML with regexes. Do not parse HTML with regexes. \$\endgroup\$ Commented Dec 18, 2020 at 19:52

3 Answers 3

15
\$\begingroup\$
  • Use Requests for HTTP and let it deal with the encoding
  • Do not bake the query into the URL string
  • Use BeautifulSoup for HTML parsing
  • Use typehints

Nice and simple:

from requests import get
from bs4 import BeautifulSoup
def get_cuit(dni: int) -> str:
 with get(
 'https://www.cuitonline.com/search.php',
 params={'q': dni},
 ) as resp:
 resp.raise_for_status()
 doc = BeautifulSoup(resp.text, 'html.parser')
 span = doc.find('span', class_='cuit')
 return span.text
print(get_cuit(71_179_047))
answered Dec 18, 2020 at 22:55
\$\endgroup\$
8
  • \$\begingroup\$ As you said, 'nice and simple'. Thanks! ;) \$\endgroup\$ Commented Dec 19, 2020 at 1:52
  • \$\begingroup\$ @Raymont BeautifulSoup really helps too. If you aren't familiar with it, I highly recommend checking it out to see what it can do. \$\endgroup\$ Commented Dec 20, 2020 at 18:00
  • 1
    \$\begingroup\$ Why do Python devs tend to prefer from requests import get/get(foo) over import requests/requests.get(foo)? It just seems very short-sighted to me. Surely, reading requests.get(foo) is more informative/self-explanatory than just reading get(foo)? In the latter case, I'd have no idea where this get came from unless I scrolled all the way up to check all the import statements. Why not just keep namespaces separate? Ease of reading is vastly more important than ease of writing. \$\endgroup\$ Commented Dec 21, 2020 at 1:08
  • 1
    \$\begingroup\$ @Will This is a 12-line program. The provenance of get is somewhat painfully obvious. In a much longer file, it would be worth considering the other import style. \$\endgroup\$ Commented Dec 21, 2020 at 4:36
  • 1
    \$\begingroup\$ Reading your answer just happened to remind me that I don't see the merit of the from X import Y import style in any case, generally speaking. That said, I admit that it isn't a big deal per se in your specific example code (even if get seems overly generic to me), and that this wasn't the ideal place to express that notion. \$\endgroup\$ Commented Dec 21, 2020 at 8:29
5
\$\begingroup\$

Couple additional points to what @Reinderien mentioned:

  • if you would need to execute this function often, consider initializing a requests.Session() and re-using it for subsequent requests
  • you could improve the speed of HTML parsing here as well:
    • use lxml instead of html.parser (lxml needs to be installed):

      doc = BeautifulSoup(resp.text, 'lxml')
      
    • since you are looking for a single element, SoupStrainer would really help out to allow your parser not waste time and resources parsing other parts of HTML:

      from bs4 import BeautifulSoup, SoupStrainer
      parse_only = SoupStrainer('span', class_='cuit')
      doc = BeautifulSoup(resp.text, 'lxml', parse_only=parse_only)
      
answered Dec 19, 2020 at 2:47
\$\endgroup\$
4
\$\begingroup\$

If you did have to use a regular expression for this task (which you shouldn't in this case - see the other answer), you could improve it by:

  • [0-9] is equivalent to \d, a digit character
  • A number inside brackets indicates the number of times the previous token should be repeated. In the case of {1}, it's superfluous - you can leave out the brackets entirely and the token will be matched exactly once anyway.
\(?\b\d{2}\)?-?\d{7,8}-\d\b
  • Since you don't want all matches, but just the first match, .search could be a bit more appropriate than using .findall to find all matches followed by extracting the [0]th match.
answered Dec 19, 2020 at 1:52
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.