RFC 1738: Uniform Resource Locators (URL) specification
The specification for URLs (
RFC 1738,
Dec. '94) poses a problem, in that it limits the use of allowed characters
in URLs to only a limited subset of the US-ASCII character set:
"...Only alphanumerics [0-9a-zA-Z],
the special characters "$-_.+!*'()," [not including the quotes -
ed], and reserved characters used for their reserved purposes may
be used unencoded within a URL."
HTML, on the other hand, allows the entire range of the
ISO-8859-1 (ISO-Latin) character set
to be used in documents - and HTML4 expands the allowable range to include
all of the
Unicode character
set as well. In the case of non-ISO-8859-1 characters (characters above
FF hex/255 decimal in the Unicode set), they just can not be used in URLs,
because there is no safe way to specify character set information in the
URL content yet [
RFC2396.]
URLs should be encoded everywhere in an HTML document that a URL is
referenced to import an object (
A,
APPLET,
AREA,
BASE,
BGSOUND,
BODY,
EMBED,
FORM,
FRAME,
IFRAME,
ILAYER,
IMG,
ISINDEX,
INPUT,
LAYER,
LINK,
OBJECT,
SCRIPT,
SOUND,
TABLE,
TD,
TH,
and
TR elements.)
What characters need to be encoded and why?
ASCII Control characters
Why:
These characters are not printable.
Characters:
Includes the
ISO-8859-1
(ISO-Latin) character ranges 00-1F hex (0-31 decimal) and 7F
(127 decimal.)
Non-ASCII characters
Why:
These are by definition not legal in URLs since they are not in
the ASCII set.
Characters:
Includes the entire "top half" of the
ISO-Latin set 80-FF hex
(128-255 decimal.)
| "Reserved characters" |
| Why: |
URLs use some characters for special use in defining their syntax.
When these characters are not used in their special role inside a
URL, they need to be encoded. |
| Characters: |
| Character | Code Points (Hex) |
Code Points (Dec) |
Dollar ("$")
Ampersand ("&")
Plus ("+")
Comma (",")
Forward slash/Virgule ("/")
Colon (":")
Semi-colon (";")
Equals ("=")
Question mark ("?")
'At' symbol ("@")
|
24 26 2B 2C 2F 3A 3B 3D 3F 40 |
36 38 43 44 47 58 59 61 63 64 |
|
| "Unsafe characters" |
| Why: |
Some characters present the possibility of being
misunderstood within URLs for various reasons. These characters
should also always be encoded. |
| Characters: |
| Character | Code Points (Hex) |
Code Points (Dec) | Why encode? |
| Space | 20 | 32 |
Significant sequences of spaces may be lost in some
uses (especially multiple spaces) |
Quotation marks 'Less Than' symbol ("<")
'Greater Than' symbol (">") |
22 3C 3E | 34 60 62 |
These characters are often used to delimit URLs
in plain text. |
| 'Pound' character ("#") |
23 | 35 |
This is used in URLs to indicate where a fragment
identifier (bookmarks/anchors in HTML) begins. |
| Percent character ("%") |
25 | 37 |
This is used to URL encode/escape other characters,
so it should itself also be encoded. |
Misc. characters:
Left Curly Brace ("{")
Right Curly Brace ("}")
Vertical Bar/Pipe ("|")
Backslash ("\")
Caret ("^")
Tilde ("~")
Left Square Bracket ("[")
Right Square Bracket ("]")
Grave Accent ("`") |
7B 7D 7C 5C 5E 7E 5B 5D 60 |
123 125 124 92 94 126 91 93 96 |
Some systems can possibly modify these characters. |
|
How are characters URL encoded?
URL encoding of a character consists of a "%" symbol,
followed by the two-digit hexadecimal representation (case-insensitive)
of the ISO-Latin code point for the
character.
- Example
- Space = decimal code point 32 in the
ISO-Latin set.
- 32 decimal = 20 in hexadecimal
- The URL encoded representation will be "%20"
URL encoding converter
The box below allows you to convert content
between its unencoded and encoded forms. The initial input
state is considered to be "unencoded" (hit 'Convert' at the
beginning to start in the encoded state.) Further, to allow actual
URLs to be encoded, this little converter does not encode
URL syntax characters (the ";", "/", "?", ":", "@", "=", "#"
and "&" characters)...if you also need to encode these
characters for any reason, see the "Reserved characters" table
above for the appropriate encoded values.
NOTE:
This converter uses the String.charCodeAt and String.fromCharCode
functions, which are only available in Javascript version 1.2 or
better, so it doesn't work in Opera 3.x and below, Netscape 3 and below, and
IE 3 and below. Browser detection can be tiresome, so this
will just fail in those browsers...you have been warned. 8-}
Browser Peculiarities
- Internet Explorer is notoriously relaxed in its requirements for
encoding spaces in URLs. This tends to contribute to author
sloppiness in authoring URLs. Keep in mind that Netscape and
Opera are much more strict on this point, and spaces MUST
be encoded if the URL is to be considered to be correct.
|