This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2014年11月17日 10:42 by Alexander.Todorov, last changed 2022年04月11日 14:58 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| urlparse_refactor.patch | Alexander.Todorov, 2014年11月17日 10:42 | patch which removes unnecessary lines from urlparse() | review | |
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 661 | Tim.Graham, 2017年03月14日 13:55 | ||
| PR 16837 | merged | miss-islington, 2019年10月18日 13:07 | |
| PR 16839 | merged | orsenthil, 2019年10月18日 13:51 | |
| Messages (10) | |||
|---|---|---|---|
| msg231278 - (view) | Author: Alexander Todorov (Alexander.Todorov) * | Date: 2014年11月17日 10:42 | |
In the urllib.parse (or urlparse on Python 2.X) module there is this function:
157 def urlsplit(url, scheme='', allow_fragments=True):
158 """Parse a URL into 5 components:
159 <scheme>://<netloc>/<path>?<query>#<fragment>
160 Return a 5-tuple: (scheme, netloc, path, query, fragment).
161 Note that we don't break the components up in smaller bits
162 (e.g. netloc is a single string) and we don't expand % escapes."""
163 allow_fragments = bool(allow_fragments)
164 key = url, scheme, allow_fragments, type(url), type(scheme)
165 cached = _parse_cache.get(key, None)
166 if cached:
167 return cached
168 if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth
169 clear_cache()
170 netloc = query = fragment = ''
171 i = url.find(':')
172 if i > 0:
173 if url[:i] == 'http': # optimize the common case
174 scheme = url[:i].lower()
175 url = url[i+1:]
176 if url[:2] == '//':
177 netloc, url = _splitnetloc(url, 2)
178 if allow_fragments and '#' in url:
179 url, fragment = url.split('#', 1)
180 if '?' in url:
181 url, query = url.split('?', 1)
182 v = SplitResult(scheme, netloc, url, query, fragment)
183 _parse_cache[key] = v
184 return v
185 for c in url[:i]:
186 if c not in scheme_chars:
187 break
188 else:
189 scheme, url = url[:i].lower(), url[i+1:]
190
191 if url[:2] == '//':
192 netloc, url = _splitnetloc(url, 2)
193 if allow_fragments and '#' in url:
194 url, fragment = url.split('#', 1)
195 if '?' in url:
196 url, query = url.split('?', 1)
197 v = SplitResult(scheme, netloc, url, query, fragment)
198 _parse_cache[key] = v
199 return v
There is an issue here (or a few of them) as follows:
* if url[:1] is already lowercase (equals "http") (line 173) then .lower() on line 174 is reduntant:
174 scheme = url[:i].lower() # <--- no need for .lower() b/c value is "http"
* OTOH line 173 could refactor the condition and match URLs where the scheme is uppercase. For example
173 if url[:i].lower() == 'http': # optimize the common case
* The code as is returns the same results (as far as I've tested it) for both:
urlsplit("http://github.com/atodorov/repo.git?param=value#myfragment")
urlsplit("HTTP://github.com/atodorov/repo.git?param=value#myfragment")
urlsplit("HTtP://github.com/atodorov/repo.git?param=value#myfragment")
but the last 2 invocations also go through lines 185-199
* Lines 174-184 are essentially the same as lines 189-199. The only optimization I can see is avoiding the for loop around lines 185-187 which checks for valid characters in the URL scheme and executes only a few loops b/c scheme names are quite short usually.
My personal vote goes for removal of lines 173-184.
Version-Release number of selected component (if applicable):
This is present in both Python 3 and Python 2 on all versions I have access to:
python3-libs-3.4.1-16.fc21.x86_64.rpm
python-libs-2.7.8-5.fc21.x86_64.rpm
python-libs-2.7.5-16.el7.x86_64.rpm
python-libs-2.6.6-52.el6.x86_64
Versions are from Fedora Rawhide and RHEL. Also the same code is present in the Mercurial repository.
Bug first reported as
https://bugzilla.redhat.com/show_bug.cgi?id=1160603
and now filing here for upstream consideration.
|
|||
| msg231283 - (view) | Author: Senthil Kumaran (orsenthil) * (Python committer) | Date: 2014年11月17日 14:23 | |
Thanks for filing this. Does test cases pass after removal of those lines? (I will be surprised) and if the tests fail, then after analyzing it, this could only be considered a new change (thus change to made in latest python) in order not to break compatibility. |
|||
| msg231309 - (view) | Author: Alexander Todorov (Alexander.Todorov) * | Date: 2014年11月18日 08:53 | |
> Does test cases pass after removal of those lines? (I will be surprised) Yes they do. Also see my detailed explanation above, there's no reason for test cases to fail. |
|||
| msg238254 - (view) | Author: Martin Panter (martin.panter) * (Python committer) | Date: 2015年03月17日 02:01 | |
The patch seems sensible. The only behaviour change I can forsee would be the odd case of http:1234 no longer being parsed like this:
>>> urlsplit("http:1234")
SplitResult(scheme='http', netloc='', path='1234', query='', fragment='')
Instead it would be parsed the same as HTTP:1234 (or tel:1234!):
>>> urlsplit("HTTP:1234")
SplitResult(scheme='', netloc='', path='HTTP:1234', query='', fragment='')
If optimizing for "http:" really is important, it might still be done without the code duplication. Other options might be factoring a subroutine, using str.strip(), set.issubset(), or regular expressions.
|
|||
| msg271712 - (view) | Author: Martin Panter (martin.panter) * (Python committer) | Date: 2016年07月30日 23:52 | |
Issue 27657 has been opened about the quirk with numeric URLs |
|||
| msg289586 - (view) | Author: Tim Graham (Tim.Graham) * | Date: 2017年03月14日 13:55 | |
I sent a pull request for this issue's dependency (issue 27657) and added a second commit there with this patch. A test added in the first commit demonstrates that removing this code no longer results in the behavior change described in msg238254. |
|||
| msg354890 - (view) | Author: Senthil Kumaran (orsenthil) * (Python committer) | Date: 2019年10月18日 13:07 | |
New changeset 5a88d50ff013a64fbdb25b877c87644a9034c969 by Senthil Kumaran (Tim Graham) in branch 'master': bpo-27657: Fix urlparse() with numeric paths (#661) https://github.com/python/cpython/commit/5a88d50ff013a64fbdb25b877c87644a9034c969 |
|||
| msg354895 - (view) | Author: miss-islington (miss-islington) | Date: 2019年10月18日 13:24 | |
New changeset 82b5f6b16e051f8a2ac6e87ba86b082fa1c4a77f by Miss Islington (bot) in branch '3.7': bpo-27657: Fix urlparse() with numeric paths (GH-661) https://github.com/python/cpython/commit/82b5f6b16e051f8a2ac6e87ba86b082fa1c4a77f |
|||
| msg354904 - (view) | Author: Senthil Kumaran (orsenthil) * (Python committer) | Date: 2019年10月18日 15:23 | |
New changeset 0f3187c1ce3b3ace60f6c1691dfa3d4e744f0384 by Senthil Kumaran in branch '3.8': [3.8] bpo-27657: Fix urlparse() with numeric paths (GH-661) (#16839) https://github.com/python/cpython/commit/0f3187c1ce3b3ace60f6c1691dfa3d4e744f0384 |
|||
| msg393915 - (view) | Author: Gregory P. Smith (gregory.p.smith) * (Python committer) | Date: 2021年05月19日 01:37 | |
There are still a lot of cleanliness and even logical issues lurking within urllib.parse. But this bug is dated and other changes have happened since. I'm closing it as out of date. Open new issues for specific action items. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:58:10 | admin | set | github: 67080 |
| 2021年05月19日 01:37:03 | gregory.p.smith | set | status: open -> closed nosy: + gregory.p.smith messages: + msg393915 resolution: out of date stage: patch review -> resolved |
| 2019年10月18日 15:23:21 | orsenthil | set | messages: + msg354904 |
| 2019年10月18日 13:51:49 | orsenthil | set | pull_requests: + pull_request16391 |
| 2019年10月18日 13:24:32 | miss-islington | set | nosy:
+ miss-islington messages: + msg354895 |
| 2019年10月18日 13:07:37 | miss-islington | set | pull_requests: + pull_request16385 |
| 2019年10月18日 13:07:36 | orsenthil | set | messages: + msg354890 |
| 2017年03月14日 13:55:02 | Tim.Graham | set | nosy:
+ Tim.Graham messages: + msg289586 pull_requests: + pull_request549 |
| 2016年07月30日 23:52:19 | martin.panter | set | dependencies:
+ urlparse fails if the path is numeric messages: + msg271712 |
| 2015年03月17日 02:01:15 | martin.panter | set | nosy:
+ martin.panter messages: + msg238254 |
| 2014年11月18日 08:53:29 | Alexander.Todorov | set | messages: + msg231309 |
| 2014年11月17日 14:23:34 | orsenthil | set | messages: + msg231283 |
| 2014年11月17日 13:28:24 | berker.peksag | set | nosy:
+ orsenthil title: [patch]: code removal from urllib.parse.urlsplit() -> code removal from urllib.parse.urlsplit() stage: patch review versions: - Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.6 |
| 2014年11月17日 10:42:24 | Alexander.Todorov | create | |