Return to Answer

replaced http://stackoverflow.com/ with https://stackoverflow.com/

edited May 23, 2017 at 12:41

In get_max_pages you have the default value links=list(). I'm not sure if you know about the mutable default mutable default and thought this would avoid it but it wont. links will be created once as an empty list. Every time you call the function the same list exists so the same list will be appended to, which is not what you need. Here's a simple example:

In get_max_pages you have the default value links=list(). I'm not sure if you know about the mutable default and thought this would avoid it but it wont. links will be created once as an empty list. Every time you call the function the same list exists so the same list will be appended to, which is not what you need. Here's a simple example:

added 196 characters in body

Source Link

edited Nov 27, 2015 at 11:10

SuperBiasedMan

edited Nov 27, 2015 at 11:10

SuperBiasedMan

13.5k
5
37
62

Though as @Mathias Ettinger noted, you should use map(str, args) to ensure that all the arguments are converted to strings as join will raise errors if any of the arguments aren't strings.

check_page_content is hard to understand at first, a lot happens in one line so you should try spread it out to multiple lines. It would be easier to build rows with a list comprehension that filters in advance, rather than gathering up a lot of rows just to later remove them:

Though as @Mathias Ettinger noted, you should use map(str, args) to ensure that all the arguments are converted to strings as join will raise errors if any of the arguments aren't strings.

Source Link

answered Nov 27, 2015 at 10:49

SuperBiasedMan

answered Nov 27, 2015 at 10:49

SuperBiasedMan

13.5k
5
37
62

Your get_url function is confusing. It looks like you keep assigning new values to url and ignoring all the previous values. This is what happens when I run it:

>>> get_url("Hello", "World")
'https://thepiratebay.se/user/World/'

Surely this is either a bug or redundant behaviour, since the only argument that matters is the last one? It seems like instead you should be using str.join, which will concatenate all the arguments together with a string separator, so for example:

def get_url(*args):
 return BASE_USER + "/".join(args)
>>> get_url("Hello", "World")
'https://thepiratebay.se/user/Hello/World'
>>> get_url("ban", "an", "a")
'https://thepiratebay.se/user/ban/an/a'

def check_page_content(url):
 bs = BeautifulSoup(request_html(url), "html.parser")
 rows = [row for row in bs.findAll("tr")
 if row.findChildren('div', {'class': 'detName'})]

But then I see that you're just returning the boolean value of the list! That means you can break as soon as you found that there's any row, no need to store the list at all. Luckily there's the any function that will do this for you. It supports short circuiting, which means that it will return as soon as it finds a condition that evaluates as True:

def check_page_content(url):
 bs = BeautifulSoup(request_html(url), "html.parser")
 return any(row.findChildren('div', {'class': 'detName'})
 for row in bs.findAll("tr"))

any will apply truthiness to all the values in bs.findAll so if any of them have truthy results then your function will immediately return. Even if you have to check every single row, this is faster than building a full list, mapping and filtering it.

>>> def a(b=list()):
 b.append("another")
 return b
>>> a()
['another']
>>> a()
['another', 'another']
>>> a()
['another', 'another', 'another']

Instead you need to use links=None and then use if links is None: links = [] so that a new list is created within each function call.

You have most of your code nested inside if valid, but if you flipped it around you could save nesting by doing this:

def get_max_pages(user, url=None, placeholder=0, links = list(), valid=True):
 if url is None:
 url = get_url(user, str(placeholder), 3)
 if not valid:
 return links
 td = BeautifulSoup(request_html(url), "html.parser").find('td', {'colspan': 9, 'style':'text-align:center;'})

Less nesting generally makes it easier to read code like this.

You have another map filter, I'd recommend doing a similar list comprehension, though it's a little harder to follow:

 regex = re.compile("/user/%s/\d{,3}/\d{,2}" % user)
 pages = [int(a.text) - 1 
 for a in td.findAll('a', {'href': regex}) if a.text]

What's happening here is that I'm iterating through td.findAll, and if a.text is True then I'm storing int(a.text) - 1 in pages. This does what your next line does without needing multiple calls.

However, you then may need to further filter pages based on a condition. You could still do this in a comprehension though, just run a list comprehension over pages itself:

if links:
 pages = [x for x in pages if x > placeholder]

Note I removed the int call because you already store them as integers to it's not necessary.

Lastly, your while loop here is strange. It would be easier to just do for element in pages and then break from the loop instead of setting valid = False. This gives you a much simpler loop with less lines:

 for element in pages:
 valid_page = check_page_content(get_url(user, element, 3))
 if valid_page:
 links.append(element)
 else:
 break

lang-py