Message 231278 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	Alexander.Todorov
Recipients	Alexander.Todorov
Date	2014年11月17日.10:42:23
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1416220944.83.0.361625539874.issue22891@psf.upfronthosting.co.za>

Content
In the urllib.parse (or urlparse on Python 2.X) module there is this function: 157 def urlsplit(url, scheme='', allow_fragments=True): 158 """Parse a URL into 5 components: 159 <scheme>://<netloc>/<path>?<query>#<fragment> 160 Return a 5-tuple: (scheme, netloc, path, query, fragment). 161 Note that we don't break the components up in smaller bits 162 (e.g. netloc is a single string) and we don't expand % escapes.""" 163 allow_fragments = bool(allow_fragments) 164 key = url, scheme, allow_fragments, type(url), type(scheme) 165 cached = _parse_cache.get(key, None) 166 if cached: 167 return cached 168 if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth 169 clear_cache() 170 netloc = query = fragment = '' 171 i = url.find(':') 172 if i > 0: 173 if url[:i] == 'http': # optimize the common case 174 scheme = url[:i].lower() 175 url = url[i+1:] 176 if url[:2] == '//': 177 netloc, url = _splitnetloc(url, 2) 178 if allow_fragments and '#' in url: 179 url, fragment = url.split('#', 1) 180 if '?' in url: 181 url, query = url.split('?', 1) 182 v = SplitResult(scheme, netloc, url, query, fragment) 183 _parse_cache[key] = v 184 return v 185 for c in url[:i]: 186 if c not in scheme_chars: 187 break 188 else: 189 scheme, url = url[:i].lower(), url[i+1:] 190 191 if url[:2] == '//': 192 netloc, url = _splitnetloc(url, 2) 193 if allow_fragments and '#' in url: 194 url, fragment = url.split('#', 1) 195 if '?' in url: 196 url, query = url.split('?', 1) 197 v = SplitResult(scheme, netloc, url, query, fragment) 198 _parse_cache[key] = v 199 return v There is an issue here (or a few of them) as follows: * if url[:1] is already lowercase (equals "http") (line 173) then .lower() on line 174 is reduntant: 174 scheme = url[:i].lower() # <--- no need for .lower() b/c value is "http" * OTOH line 173 could refactor the condition and match URLs where the scheme is uppercase. For example 173 if url[:i].lower() == 'http': # optimize the common case * The code as is returns the same results (as far as I've tested it) for both: urlsplit("http://github.com/atodorov/repo.git?param=value#myfragment") urlsplit("HTTP://github.com/atodorov/repo.git?param=value#myfragment") urlsplit("HTtP://github.com/atodorov/repo.git?param=value#myfragment") but the last 2 invocations also go through lines 185-199 * Lines 174-184 are essentially the same as lines 189-199. The only optimization I can see is avoiding the for loop around lines 185-187 which checks for valid characters in the URL scheme and executes only a few loops b/c scheme names are quite short usually. My personal vote goes for removal of lines 173-184. Version-Release number of selected component (if applicable): This is present in both Python 3 and Python 2 on all versions I have access to: python3-libs-3.4.1-16.fc21.x86_64.rpm python-libs-2.7.8-5.fc21.x86_64.rpm python-libs-2.7.5-16.el7.x86_64.rpm python-libs-2.6.6-52.el6.x86_64 Versions are from Fedora Rawhide and RHEL. Also the same code is present in the Mercurial repository. Bug first reported as https://bugzilla.redhat.com/show_bug.cgi?id=1160603 and now filing here for upstream consideration.

Content

In the urllib.parse (or urlparse on Python 2.X) module there is this function:
 157 def urlsplit(url, scheme='', allow_fragments=True):
 158 """Parse a URL into 5 components:
 159 <scheme>://<netloc>/<path>?<query>#<fragment>
 160 Return a 5-tuple: (scheme, netloc, path, query, fragment).
 161 Note that we don't break the components up in smaller bits
 162 (e.g. netloc is a single string) and we don't expand % escapes."""
 163 allow_fragments = bool(allow_fragments)
 164 key = url, scheme, allow_fragments, type(url), type(scheme)
 165 cached = _parse_cache.get(key, None)
 166 if cached:
 167 return cached
 168 if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth
 169 clear_cache()
 170 netloc = query = fragment = ''
 171 i = url.find(':')
 172 if i > 0:
 173 if url[:i] == 'http': # optimize the common case
 174 scheme = url[:i].lower()
 175 url = url[i+1:]
 176 if url[:2] == '//':
 177 netloc, url = _splitnetloc(url, 2)
 178 if allow_fragments and '#' in url:
 179 url, fragment = url.split('#', 1)
 180 if '?' in url:
 181 url, query = url.split('?', 1)
 182 v = SplitResult(scheme, netloc, url, query, fragment)
 183 _parse_cache[key] = v
 184 return v
 185 for c in url[:i]:
 186 if c not in scheme_chars:
 187 break
 188 else:
 189 scheme, url = url[:i].lower(), url[i+1:]
 190 
 191 if url[:2] == '//':
 192 netloc, url = _splitnetloc(url, 2)
 193 if allow_fragments and '#' in url:
 194 url, fragment = url.split('#', 1)
 195 if '?' in url:
 196 url, query = url.split('?', 1)
 197 v = SplitResult(scheme, netloc, url, query, fragment)
 198 _parse_cache[key] = v
 199 return v
There is an issue here (or a few of them) as follows:
* if url[:1] is already lowercase (equals "http") (line 173) then .lower() on line 174 is reduntant:
174 scheme = url[:i].lower() # <--- no need for .lower() b/c value is "http"
* OTOH line 173 could refactor the condition and match URLs where the scheme is uppercase. For example
 173 if url[:i].lower() == 'http': # optimize the common case
* The code as is returns the same results (as far as I've tested it) for both:
urlsplit("http://github.com/atodorov/repo.git?param=value#myfragment")
urlsplit("HTTP://github.com/atodorov/repo.git?param=value#myfragment")
urlsplit("HTtP://github.com/atodorov/repo.git?param=value#myfragment")
but the last 2 invocations also go through lines 185-199
* Lines 174-184 are essentially the same as lines 189-199. The only optimization I can see is avoiding the for loop around lines 185-187 which checks for valid characters in the URL scheme and executes only a few loops b/c scheme names are quite short usually.
My personal vote goes for removal of lines 173-184. 
Version-Release number of selected component (if applicable):
This is present in both Python 3 and Python 2 on all versions I have access to:
python3-libs-3.4.1-16.fc21.x86_64.rpm
python-libs-2.7.8-5.fc21.x86_64.rpm
python-libs-2.7.5-16.el7.x86_64.rpm
python-libs-2.6.6-52.el6.x86_64
Versions are from Fedora Rawhide and RHEL. Also the same code is present in the Mercurial repository.
Bug first reported as
https://bugzilla.redhat.com/show_bug.cgi?id=1160603
and now filing here for upstream consideration.

History
Date	User	Action	Args
2014年11月17日 10:42:24	Alexander.Todorov	set	recipients: + Alexander.Todorov
2014年11月17日 10:42:24	Alexander.Todorov	set	messageid: <1416220944.83.0.361625539874.issue22891@psf.upfronthosting.co.za>
2014年11月17日 10:42:24	Alexander.Todorov	link	issue22891 messages
2014年11月17日 10:42:23	Alexander.Todorov	create

homepage