Function in Python to extract web Data

Question 1

I developed this feature that I think can be improved quite a bit. The result is the desired one, but it took me many lines of code. Any idea to optimize it?

import json
import urllib.request
dni = 71179047
 
def get_cuit(dni):
 request = urllib.request.Request('https://www.cuitonline.com/search.php?q=' + str(dni))
 response = urllib.request.urlopen(request)
 data_content = response.read()
 html = data_content.decode('ISO-8859-1')
 cuit = re.findall(r"\(?\b[0-9]{2}\)?-?[0-9]{7,8}-[0-9]{1}\b", html)[0]
 result = re.sub('[^0-9]','', cuit)
 return result
get_cuit(dni)

Question 2

It's cliché, but: Do not parse HTML with regexes. Do not parse HTML with regexes. Do not parse HTML with regexes.

Question 3

Use Requests for HTTP and let it deal with the encoding
Do not bake the query into the URL string
Use BeautifulSoup for HTML parsing
Use typehints

Nice and simple:

from requests import get
from bs4 import BeautifulSoup
def get_cuit(dni: int) -> str:
 with get(
 'https://www.cuitonline.com/search.php',
 params={'q': dni},
 ) as resp:
 resp.raise_for_status()
 doc = BeautifulSoup(resp.text, 'html.parser')
 span = doc.find('span', class_='cuit')
 return span.text
print(get_cuit(71_179_047))

Question 4

As you said, 'nice and simple'. Thanks! ;)

Question 5

@Raymont BeautifulSoup really helps too. If you aren't familiar with it, I highly recommend checking it out to see what it can do.

Question 6

Why do Python devs tend to prefer from requests import get/get(foo) over import requests/requests.get(foo)? It just seems very short-sighted to me. Surely, reading requests.get(foo) is more informative/self-explanatory than just reading get(foo)? In the latter case, I'd have no idea where this get came from unless I scrolled all the way up to check all the import statements. Why not just keep namespaces separate? Ease of reading is vastly more important than ease of writing.

Question 7

@Will This is a 12-line program. The provenance of get is somewhat painfully obvious. In a much longer file, it would be worth considering the other import style.

Question 8

Reading your answer just happened to remind me that I don't see the merit of the from X import Y import style in any case, generally speaking. That said, I admit that it isn't a big deal per se in your specific example code (even if get seems overly generic to me), and that this wasn't the ideal place to express that notion.

Question 9

Couple additional points to what @Reinderien mentioned:

if you would need to execute this function often, consider initializing a requests.Session() and re-using it for subsequent requests
you could improve the speed of HTML parsing here as well:
- use lxml instead of html.parser (lxml needs to be installed):
```
doc = BeautifulSoup(resp.text, 'lxml')
```
- since you are looking for a single element, SoupStrainer would really help out to allow your parser not waste time and resources parsing other parts of HTML:
```
from bs4 import BeautifulSoup, SoupStrainer
parse_only = SoupStrainer('span', class_='cuit')
doc = BeautifulSoup(resp.text, 'lxml', parse_only=parse_only)
```

Question 10

If you did have to use a regular expression for this task (which you shouldn't in this case - see the other answer), you could improve it by:

[0-9] is equivalent to \d, a digit character
A number inside brackets indicates the number of times the previous token should be repeated. In the case of {1}, it's superfluous - you can leave out the brackets entirely and the token will be matched exactly once anyway.

\(?\b\d{2}\)?-?\d{7,8}-\d\b

Since you don't want all matches, but just the first match, .search could be a bit more appropriate than using .findall to find all matches followed by extracting the [0]th match.

Reinderien Reinderien 71k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2020-12-18 22:55:34Z

15

\$\begingroup\$

Use Requests for HTTP and let it deal with the encoding
Do not bake the query into the URL string
Use BeautifulSoup for HTML parsing
Use typehints

Nice and simple:

from requests import get
from bs4 import BeautifulSoup
def get_cuit(dni: int) -> str:
 with get(
 'https://www.cuitonline.com/search.php',
 params={'q': dni},
 ) as resp:
 resp.raise_for_status()
 doc = BeautifulSoup(resp.text, 'html.parser')
 span = doc.find('span', class_='cuit')
 return span.text
print(get_cuit(71_179_047))

Share

answered Dec 18, 2020 at 22:55

Reinderien's user avatar

Reinderien Reinderien

71k5 gold badges76 silver badges256 bronze badges

\$\endgroup\$

8

\$\begingroup\$ As you said, 'nice and simple'. Thanks! ;) \$\endgroup\$

Raymont
– Raymont

2020年12月19日 01:52:18 +00:00
Commented Dec 19, 2020 at 1:52
\$\begingroup\$ @Raymont BeautifulSoup really helps too. If you aren't familiar with it, I highly recommend checking it out to see what it can do. \$\endgroup\$

BruceWayne
– BruceWayne

2020年12月20日 18:00:37 +00:00
Commented Dec 20, 2020 at 18:00
1

\$\begingroup\$ Why do Python devs tend to prefer from requests import get/get(foo) over import requests/requests.get(foo)? It just seems very short-sighted to me. Surely, reading requests.get(foo) is more informative/self-explanatory than just reading get(foo)? In the latter case, I'd have no idea where this get came from unless I scrolled all the way up to check all the import statements. Why not just keep namespaces separate? Ease of reading is vastly more important than ease of writing. \$\endgroup\$

Will
– Will

2020年12月21日 01:08:53 +00:00
Commented Dec 21, 2020 at 1:08
1

\$\begingroup\$ @Will This is a 12-line program. The provenance of get is somewhat painfully obvious. In a much longer file, it would be worth considering the other import style. \$\endgroup\$

Reinderien
– Reinderien

2020年12月21日 04:36:20 +00:00
Commented Dec 21, 2020 at 4:36
1

\$\begingroup\$ Reading your answer just happened to remind me that I don't see the merit of the from X import Y import style in any case, generally speaking. That said, I admit that it isn't a big deal per se in your specific example code (even if get seems overly generic to me), and that this wasn't the ideal place to express that notion. \$\endgroup\$

Will
– Will

2020年12月21日 08:29:01 +00:00
Commented Dec 21, 2020 at 8:29

| Show 3 more comments

Stack Exchange Network

Function in Python to extract web Data

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Function in Python to extract web Data

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions