3

I want to find all links in a div, for example:

<div>
 <a href="#0"></a>
 <a href="#1"></a>
 <a href="#2"></a>
</div>

So I write a func as follow:

def get_links(div):
 links = []
 if div.tag == 'a':
 links.append(div)
 return links 
 else:
 for a in div:
 links + get_links(a)
 return links

why the results is [] rather than [a, a, a]? ------- question

I know this is a question of list reference, could you show some detail

This is the complete module:

import lxml.html
def get_links(div):
 links = []
 if div.tag == 'a':
 links.append(div)
 return links 
 else:
 for a in div:
 links + get_links(a)
 return links
if __name__ == '__main__':
 fragment = '''
 <div>
 <a href="#0">1</a>
 <a href="#1">2</a>
 <a href="#2">3</a>
 </div>'''
 fragment = lxml.html.fromstring(fragment)
 links = get_links(fragment) # <---------------
asked Jan 5, 2015 at 8:03
3
  • 2
    Try changing links + get_links(a) to links += get_links(a) Commented Jan 5, 2015 at 8:04
  • If you don't change links, who else should do it? Commented Jan 5, 2015 at 8:17
  • Yes, This is the right way. Thanks. I want to write +=, but I forget, and I think I write is +=. so I dont find the error... and I think this is a question of list reference Commented Jan 5, 2015 at 8:26

3 Answers 3

2

List addition in Python returns a new list obtained from the concatenation of the arugments, doesn't change them:

x = [1, 2, 3, 4]
print(x + [5, 6]) # displays [1, 2, 3, 4, 5, 6]
print(x) # here x is still [1, 2, 3, 4]

you can use the extend method:

x.extend([5, 6])

or also +=

x += [5, 6]

The latter is IMO a bit "strange" because it's a case in which x=x+y is not the same as x+=y and therefore I prefer to avoid it and make the in-place extension more explicit.

For your code

links = links + get_links(a)

would also be acceptable but remember that it does a different thing: it allocates a new list with the concatenation and then assign the name links to point to it: it doesn't change the original object referenced by links:

x = [1, 2, 3, 4]
y = x
x = x + [5, 6]
print(x) # displays [1, 2, 3, 4, 5, 6]
print(y) # displays [1, 2, 3, 4]

but

x = [1, 2, 3, 4]
y = x
x += [5, 6]
print(x) # displays [1, 2, 3, 4, 5, 6]
print(y) # displays [1, 2, 3, 4, 5, 6]
answered Jan 5, 2015 at 8:07
1
  • Yes, This is the right way.Thank you! I want to write +=, but I forget, and I think I write is +=. so I dont find the error... and I think this is a question of list reference Commented Jan 5, 2015 at 8:20
1

If tag is not 'a' your code looks like that.

# You create an empty list
links = []
for a in div:
 # You combine <links> with result of get_links() but you do not assign it to anything
 links + get_links(a)
# So you return an empty list 
return links

You should change + with +=:

links += get_links(a)

Or use extend()

links.extend(get_links(a))
answered Jan 5, 2015 at 8:08
1
  • Yes, This is the right way. I want to write +=, but I forget, and I think I write is +=. so I dont find the error... and I think this is a question of list reference Commented Jan 5, 2015 at 8:18
0

Other option is to use xpath method to get all a tags from div at any level.

Code:

from lxml import etree
root = etree.fromstring(content)
print root.xpath('//div//a')

Output:

[<Element a at 0xb6cef0cc>, <Element a at 0xb6cef0f4>, <Element a at 0xb6cef11c>]
answered Jan 5, 2015 at 8:10
2
  • 2
    Your code only returns a tags that are direct children to the div tag. '//div//a' is better. Commented Jan 5, 2015 at 8:12
  • @infgeoax: yes agree. Updated code to get a tags from div at any level. Thanx. Commented Jan 5, 2015 at 8:16

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.