| Home | Trees | Indices | Help |
|
|---|
re.compile(r'(?is)<body.*?>')
re.compile(r'(?is)</body.*?>')
re.compile(r'(?is)</?(ins|del).*?>')
re.compile(r'[ \t\n\r]$')
(u'param', u'img', u'area', u'br', u'basefont', u...
(u'address', u'blockquote', u'center', u'di...
(u'dd', u'dt', u'frameset', u'li'...
re.compile(r'(?u)\S+(?:\s+|$)')
re.compile(r'^[ \t\n\r]')
None{u'html_annotate (line 35)':...
doclist should be ordered from oldest to newest, like:
>>> version1 = 'Hello World' >>> version2 = 'Goodbye World' >>> print(html_annotate([(version1, 'version 1'), ... (version2, 'version 2')])) <span title="version 2">Goodbye</span> <span title="version 1">World</span>
The documents must be fragments (str/UTF8 or unicode), not complete documents
The markup argument is a function to markup the spans of words. This function is called like markup('Hello', 'version 2'), and returns HTML. The first argument is text and never includes any markup. The default uses a span with a title:
>>> print(default_markup('Some Text', 'by Joe')) <span title="by Joe">Some Text</span>
Do a diff of the old and new document. The documents are HTML fragments (str/UTF8 or unicode), they are not complete documents (i.e., no <html> tag).
Returns HTML with <ins> and <del> tags added around the appropriate text.
Markup is generally ignored, with the markup from new_html preserved, and possibly some markup from old_html (though it is considered acceptable to lose some of the old markup). Only the words in the HTML are diffed. The exception is <img> tags, which are treated like words, and the href attribute of <a> tags, which are noted inside the tag itself when there are changes.
Cleans up any DEL_START/DEL_END markers in the document, replacing them with <del></del>. To do this while keeping the document valid, it may need to drop some tags (either start or end tags).
It may also move the del into adjacent tags to try to move it to a similar location where it was originally located (e.g., moving a delete into preceding <div> tag, if the del looks like (DEL_START, 'Text</div>', DEL_END)
Return (unbalanced_start, balanced, unbalanced_end), where each is a list of text and tag chunks.
unbalanced_start is a list of all the tags that are opened, but not closed in this span. Similarly, unbalanced_end is a list of tags that are closed but were not opened. Extracting these might mean some reordering of the chunks.
pre_delete and post_delete implicitly point to a place in the document (where the two were split). This moves that point (by popping items from one and pushing them onto the other). It moves the point to try to find a place where unbalanced_start applies.
As an example:
>>> unbalanced_start = ['<div>'] >>> doc = ['<p>', 'Text', '</p>', '<div>', 'More Text', '</div>'] >>> pre, post = doc[:3], doc[3:] >>> pre, post (['<p>', 'Text', '</p>'], ['<div>', 'More Text', '</div>']) >>> locate_unbalanced_start(unbalanced_start, pre, post) >>> pre, post (['<p>', 'Text', '</p>', '<div>'], ['More Text', '</div>'])
As you can see, we moved the point so that the dangling <div> that we found will be effectively replaced by the div in the original document. If this doesn't work out, we just throw away unbalanced_start without doing anything.
Parse the given HTML and returns token objects (words with attached tags).
This parses only the content of a page; anything in the head is ignored, and the <head> and <body> elements are themselves optional. The content is then parsed by lxml, which ensures the validity of the resulting parsed document (though lxml may make incorrect guesses when the markup is particular bad).
<ins> and <del> tags are also eliminated from the document, as that gets confusing.
If include_hrefs is true, then the href attribute of <a> tags is included as a special kind of diffable token.
Parses an HTML fragment, returning an lxml element. Note that the HTML will be wrapped in a <div> tag that was not in the original document.
If cleanup is true, make sure there's no <head> or <body>, and get rid of any <ins> and <del> tags.
This function takes a word, such as 'test
' and returns ('test','
')
Takes an lxml element el, and generates all the text chunks for that tag. Each start tag is a chunk, each word is a chunk, and each end tag is a chunk.
If skip_tag is true, then the outermost container tag is not returned (just its contents).
Serialize a single lxml element as HTML. The serialized form includes the elements tail.
If skip_outer is true, then don't serialize the outermost tag
(u'param',u'img',u'area',u'br',u'basefont',u'input',u'base',u'meta',...
(u'address',u'blockquote',u'center',u'dir',u'div',u'dl',u'fieldset',u'form',...
(u'dd',u'dt',u'frameset',u'li',u'tbody',u'td',u'tfoot',u'th',...
{u'html_annotate (line 35)':u'''doclist should be ordered from oldest to newest, like::>>> version1 = 'Hello World'>>> version2 = 'Goodbye World'>>> print(html_annotate([(version1, 'version 1'),... (version2, 'version 2')]))<span title="version 2">Goodbye</span> <span title="version 1"\...
| Home | Trees | Indices | Help |
|
|---|