|
|
cssselect(self,
expr,
translator='html')
Run the CSS expression on this element and its children,
returning a list of the results.
source code
|
|
|
drop_tag(self)
Remove the tag, but not its children or text. The children and text
are merged into the parent.
source code
|
|
|
drop_tree(self)
Removes this element from the tree, including its children and
text. The tail text is joined to the previous element or
parent.
source code
|
|
|
find_class(self,
class_name)
Find any elements with the given class name.
source code
|
|
|
find_rel_links(self,
rel)
Find any links like <a rel="{rel}">...</a>; returns a list of elements.
source code
|
|
|
get_element_by_id(self,
id,
*default)
Get the first element in a document with the given id. If none is
found, return the default argument if provided or raise KeyError
otherwise.
source code
|
|
|
iterlinks(self)
Yield (element, attribute, link, pos), where attribute may be None
(indicating the link is in the text). pos is the position
where the link occurs; often 0, but sometimes something else in
the case of links in stylesheets or style tags.
source code
|
|
|
make_links_absolute(self,
base_url=None,
resolve_base_href=True,
handle_failures=None)
Make all links in the document absolute, given the
base_url for the document (the full URL where the document
came from), or if no base_url is given, then the .base_url
of the document.
source code
|
|
|
resolve_base_href(self,
handle_failures=None)
Find any <base href> tag in the document, and apply its
values to all links found in the document. Also remove the
tag once it has been applied.
source code
|
|
|
rewrite_links(self,
link_repl_func,
resolve_base_href=True,
base_href=None)
Rewrite all the links in the document. For each link
link_repl_func(link) will be called, and the return value
will replace the old link.
source code
|
|
|
set(self,
key,
value=None)
Sets an element attribute. If no value is provided, or if the value is None,
creates a 'boolean' attribute without value, e.g. "<form novalidate></form>"
for form.set('novalidate').
source code
|
|
|
text_content(self)
Return the text content of the tag (and the text in any children).
source code
|
|
Inherited from object:
__delattr__,
__format__,
__getattribute__,
__hash__,
__init__,
__new__,
__reduce__,
__reduce_ex__,
__repr__,
__setattr__,
__sizeof__,
__str__,
__subclasshook__
|
|
|
base_url
Returns the base URL, given when the page was parsed.
|
|
|
body
Return the <body> element. Can be called from a child element
to get the document's head.
|
|
|
classes
A set-like wrapper around the 'class' attribute.
|
|
|
forms
Return a list of all the forms
|
|
|
head
Returns the <head> element. Can be called from a child
element to get the document's head.
|
|
|
label
Get or set any <label> element associated with this element.
|
|
Inherited from object:
__class__
|
Run the CSS expression on this element and its children,
returning a list of the results.
Equivalent to lxml.cssselect.CSSSelect(expr, translator='html')(self)
-- note that pre-compiling the expression can provide a substantial
speedup.
Remove the tag, but not its children or text. The children and text
are merged into the parent.
Example:
>>> h = fragment_fromstring('<div>Hello <b>World!</b></div>')
>>> h.find('.//b').drop_tag()
>>> print(tostring(h, encoding='unicode'))
<div>Hello World!</div>
Get the first element in a document with the given id. If none is
found, return the default argument if provided or raise KeyError
otherwise.
Note that there can be more than one element with the same id,
and this isn't uncommon in HTML documents found in the wild.
Browsers return only the first match, and this function does
the same.
Yield (element, attribute, link, pos), where attribute may be None
(indicating the link is in the text). pos is the position
where the link occurs; often 0, but sometimes something else in
the case of links in stylesheets or style tags.
Note: <base href> is not taken into account in any way. The
link you get is exactly the link in the document.
Note: multiple links inside of a single text string or
attribute value are returned in reversed order. This makes it
possible to replace or delete them from the text string value
based on their reported text positions. Otherwise, a
modification at one text position can change the positions of
links reported later on.
make_links_absolute(self,
base_url=None,
resolve_base_href=True,
handle_failures=None)
source code
Make all links in the document absolute, given the
base_url for the document (the full URL where the document
came from), or if no base_url is given, then the .base_url
of the document.
If resolve_base_href is true, then any <base href>
tags in the document are used and removed from the document.
If it is false then any such tag is ignored.
If handle_failures is None (default), a failure to process
a URL will abort the processing. If set to 'ignore', errors
are ignored. If set to 'discard', failing URLs will be removed.
resolve_base_href(self,
handle_failures=None)
source code
Find any <base href> tag in the document, and apply its
values to all links found in the document. Also remove the
tag once it has been applied.
If handle_failures is None (default), a failure to process
a URL will abort the processing. If set to 'ignore', errors
are ignored. If set to 'discard', failing URLs will be removed.
rewrite_links(self,
link_repl_func,
resolve_base_href=True,
base_href=None)
source code
Rewrite all the links in the document. For each link
link_repl_func(link) will be called, and the return value
will replace the old link.
Note that links may not be absolute (unless you first called
make_links_absolute()), and may be internal (e.g.,
'#anchor'). They can also be values like
'mailto:email' or 'javascript:expr'.
If you give base_href then all links passed to
link_repl_func() will take that into account.
If the link_repl_func returns None, the attribute or
tag text will be removed completely.
base_url
Returns the base URL, given when the page was parsed.
Use with urlparse.urljoin(el.base_url, href) to get
absolute URLs.
- Get Method:
- unreachable.base_url(self)
- Returns the base URL, given when the page was parsed.
body
Return the <body> element. Can be called from a child element
to get the document's head.
- Get Method:
- unreachable.body(self)
- Return the <body> element. Can be called from a child element
to get the document's head.
classes
A set-like wrapper around the 'class' attribute.
- Get Method:
- unreachable.classes(self)
- A set-like wrapper around the 'class' attribute.
- Set Method:
- unreachable.classes(self,
classes)
forms
Return a list of all the forms
- Get Method:
- unreachable.forms(self)
- Return a list of all the forms
head
Returns the <head> element. Can be called from a child
element to get the document's head.
- Get Method:
- unreachable.head(self)
- Returns the <head> element. Can be called from a child
element to get the document's head.
label
Get or set any <label> element associated with this element.
- Get Method:
- unreachable.label(self)
- Get or set any <label> element associated with this element.
- Set Method:
- unreachable.label(self,
label)
- Delete Method:
- unreachable.label(self)