S\"[ RfDs/CfVs | Other proposals ]
parse\" ( "ccc" -- c-addr u = Parse the input stream for a escaped string ) (see also 6.2.2008 PARSE)
\u and \U to handle
unicode codepoints (4 or 8 digits) in a portablle way.ESCAPE (c-addr len -- c-addr' len') andshould be defined. These words could then form the basis of S\" and parse\". I have had to 're-invent' the escape mechanism leading to potential consistencies in the way S\" and my ESCAPE works!
UNESCAPE (c-addr len -- c-addr' len')
S" 6.1.2165 is the primary word for generating strings.
In more complex applications, it suffers from several deficiencies:
S" string can only contain printable characters,S" string cannot contain the '"' character,S" string cannot be used with wide characters as discussed
in the Forth 200x internationalisation and XCHAR proposals.S\" with very
similar operations. S\" behaves like S", but uses the '\' character
as an escape character for the entry of characters that cannot be
used with S".
This technique is widespread in languages other than Forth.
It has benefit in areas such asThe majority of current Forth systems contain code, either in the kernel or in application code, that assumes char=byte=au. To avoid breaking existing code, we have to live with this practice.
The following list describes what is currently available in the surveyed Forth systems that support escaped strings.\a BEL (alert, ASCII 7)\b BS (backspace, ASCII 8)\e ESC (escape, ASCII 27)\f FF (form feed, ASCII 12)\l LF (line feed, ASCII 10)\m CR/LF pair (ASCII 13, 10) - for HTML etc.\n newline CR/LF for Windows/Dos, LF for Unices\q double-quote (ASCII 34)\r CR (carriage return, ASCII 13)\t HT (horizontal tab, ASCII 9)\v VT (vertical tab, ASCII 11)\z NUL (no character, ASCII 0)\" double-quote (ASCII 34)\[0-7]+ Octal numerical character value, finishes at the first non-octal character\x[0-9a-f]+ Hex numerical character value, finishes at the first non-hex character\\ backslash itself (ASCII 92)\ before any other character represents that characterItem 1) is about the common char=byte=au assumption.
Item 2) includes the use ofCOUNT to step through memory and the
impact of char in the file word sets.
Item 3) has to rationalise a fixed width serial/comms channel with 1..4 byte characters, e.g. UTF-8
Item 4) should enable 16 bit systems to handle UTF-8 and UTF-32.
The basis of the current approach is to use the terminology of
primitive characters and extended characters. A primitive character
(called a pchar here) is a fixed-width unit handled by EMIT and
friends as well as C@, C! and friends. A pchar corresponds to the
current ANS definition of a character. Characters that may be
wider than a pchar are called "extended characters" or xchars.
The xchars are an integer multiple of pchars. An xchar consists
of one or more primitive characters and represents the encoding
for a "display unit". A string is represented by caddr/len
in terms of primitive characters.
S\" is used like S" but treats the '\' character specially. One
or more characters after the '\' indicate what is substituted.
The following three of these cause parsing and readability
problems. As far as I know, requiring characters to come in
8 bit units will not upset any systems. Systems with characters
less than 7 bits are non-compliant, and I know of no 7 bit CPUs.
All current systems use character units of 8 bits or more.
\[0-7]+
Octal numerical character value, finishes at the first non-octal character\x[0-9a-f]+
Hex numerical character value, finishes at the first non-hex character\
before any other character represents that characterNote that now the number-prefix extension has been accepted, 3.4.1 Parsing contains a definition of <hexdigit> to be a case insensitive hexadecimal digit [0-9a-fA-F].
6.2.xxxx S\" s-slash-quote CORE EXT
" (double-quote), using the translation
rules below. Append the run-time semantics given below to the
current definition.
\a BEL
(alert, ASCII 7)\b BS
(backspace, ASCII 8)\e ESC
(escape, ASCII 27)\f FF
(form feed, ASCII 12)\l LF
(line feed, ASCII 10)\m CR/LF
pair (ASCII 13, 10)\n newline
(implementation dependent newline,\q
double-quote (ASCII 34)\r CR
(carriage return, ASCII 13)\t HT
(horizontal tab, ASCII 9)\v VT
(vertical tab, ASCII 11)\z NUL
(no character, ASCII 0)\"
double-quote (ASCII 34)\x<hexdigit><hexdigit>\x
is not followed by two hexadecimal characters.
\\
backslash itself (ASCII 92)\
An ambiguous condition exists if a \ is placed before any
character, other than those defined in 6.2.xxxx S\".
C", 11.6.1.2165 S", A.6.1.2165 S"\x is not followed by two hexadecimal characters.\ is placed before any character, other than those defined
in 6.2.xxxx S\".
decimal
: c+! \ c c-addr --
\ *G Add character C to the contents of address C-ADDR.
tuck c@ + swap c!
;
: addchar \ char string --
\ *G Add the character to the end of the counted string.
tuck count + c!
1 swap c+!
;
: append \ c-addr u $dest --
\ *G Add the string described by C-ADDR U to the counted string at
\ ** $DEST. The strings must not overlap.
>r
tuck r@ count + swap cmove \ add source to end
r> c+! \ add length to count
;
: extract2H \ c-addr len -- c-addr' len' u
\ *G Extract a two-digit hex number in the given base from the
\ ** start of the string, returning the remaining string
\ ** and the converted number.
base @>r hex
0 0 2over drop 2>number 2drop drop
>r 2 /string r>
r> base !
;
create EscapeTable \ -- addr
\ *G Table of translations for \a..\z.
7 c, \ \a BEL (Alert)
8 c, \ \b BS (Backspace)
char c c, \ \c
char d c, \ \d
27 c, \ \e ESC (Escape)
12 c, \ \f FF (Form feed)
char g c, \ \g
char h c, \ \h
char i c, \ \i
char j c, \ \j
char k c, \ \k
10 c, \ \l LF (Line feed)
char m c, \ \m
10 c, \ \n (Unices only)
char o c, \ \o
char p c, \ \p
char " c, \ \q " (Double quote)
13 c, \ \r CR (Carriage Return)
char s c, \ \s
9 c, \ \t HT (horizontal tab}
char u c, \ \u
11 c, \ \v VT (vertical tab)
char w c, \ \w
char x c, \ \x
char y c, \ \y
0 c, \ \z NUL (no character)
create CRLF$ \ -- addr ; CR/LF as counted string
2 c, 13 c, 10 c,
: addEscape \ c-addr len dest -- c-addr' len'
\ *G Add an escape sequence to the counted string at dest,
\ ** returning the remaining string.
over 0= \ zero length check
if drop exit then
>r \ -- caddr len ; R: -- dest
over c@ [char] x = if \ hex number?
1 /string extract2H r> addchar exit
then
over c@ [char] m = if \ CR/LF pair
1 /string 13 r@ addchar 10 r> addchar exit
then
over c@ [char] n = if \ CR/LF pair? (Windows/DOS only)
1 /string crlf$ count r> append exit
then
over c@ [char] a [char] z 1+ within if
over c@ [char] a - EscapeTable + c@ r> addchar
else
over c@ r> addchar
then
1 /string
;
: parse\" \ c-addr len dest -- c-addr' len'
\ *G Parses a string up to an unescaped '"', translating '\'
\ ** escapes to characters. The translated string is a
\ ** counted string at *\i{dest}.
\ ** The supported escapes (case sensitive) are:
\ *D \a BEL (alert)
\ *D \b BS (backspace)
\ *D \e ESC (not in C99)
\ *D \f FF (form feed)
\ *D \l LF (ASCII 10)
\ *D \m CR/LF pair - for HTML etc.
\ *D \n newline - CRLF for Windows/DOS, LF for Unices
\ *D \q double-quote
\ *D \r CR (ASCII 13)
\ *D \t HT (tab)
\ *D \v VT
\ *D \z NUL (ASCII 0)
\ *D \" double-quote
\ *D \xAB Two char Hex numerical character value
\ *D \\ backslash itself
\ *D \ before any other character represents that character
dup>r 0 swap c! \ zero destination
begin \ -- caddr len ; R: -- dest
dup
while
over c@ [char] " \ check for terminator
while
over c@ [char] \ = if \ deal with escapes
1 /string r@ addEscape
else \ normal character
over c@ r@ addchar 1 /string
then
repeat then
dup \ step over terminating "
if 1 /string then
r> drop
;
create pocket \ -- addr
\ *G A tempory buffer to hold processed string.
\ This would normally be an internal system buffer.
s" /COUNTED-STRING" environment? 0= [if] 256 [then]
1 chars + allot
: readEscaped \ "ccc" -- c-addr
\ *G Parses an escaped string from the input stream according to
\ ** the rules of *\fo{parse\"} above, returning the address
\ ** of the translated counted string in *\fo{POCKET}.
source>in @ /string tuck \ -- len caddr len
pocket parse\" nip
->in +!
pocket
;
: S\" \ "string" -- caddr u
\ *G As *\fo{S"}, but translates escaped characters using
\ ** *\fo{parse\"} above.
readEscaped count state @
if postpone sliteral then
; IMMEDIATE
HEX
T{ : GC5 S\" \a\b\e\f\l\m\q\r\t\v\x0F0\x1Fa\xaBx\z\"\\" ; -> }
T{ GC5 SWAP DROP -> 14 }T \ String length
T{ GC5 DROP C@ -> 07 }T \ \a BEL Bell
T{ GC5 DROP 1 CHARS + C@ -> 08 }T \ \b BS Backspace
T{ GC5 DROP 2 CHARS + C@ -> 1B }T \ \e ESC Escape
T{ GC5 DROP 3 CHARS + C@ -> 0C }T \ \f FF Form feed
T{ GC5 DROP 4 CHARS + C@ -> 0A }T \ \l LF Line feed
T{ GC5 DROP 5 CHARS + C@ -> 0D }T \ \m CR of CR/LF pair
T{ GC5 DROP 6 CHARS + C@ -> 0A }T \ LF of CR/LF pair
T{ GC5 DROP 7 CHARS + C@ -> 22 }T \ \q " Double Quote
T{ GC5 DROP 8 CHARS + C@ -> 0D }T \ \r CR Carriage Return
T{ GC5 DROP 9 CHARS + C@ -> 09 }T \ \t TAB Horizontal Tab
T{ GC5 DROP A CHARS + C@ -> 0B }T \ \v VT Vertical Tab
T{ GC5 DROP B CHARS + C@ -> 0F }T \ \x0F Given Char
T{ GC5 DROP C CHARS + C@ -> 30 }T \ 0 0 Digit follow on
T{ GC5 DROP D CHARS + C@ -> 1F }T \ \x1F Given Char
T{ GC5 DROP E CHARS + C@ -> 61 }T \ a a Hex follow on
T{ GC5 DROP F CHARS + C@ -> AB }T \ \xaB Insensitive Given Char
T{ GC5 DROP 10 CHARS + C@ -> 78 }T \ x x Non hex follow on
T{ GC5 DROP 11 CHARS + C@ -> 00 }T \ \z NUL No Character
T{ GC5 DROP 12 CHARS + C@ -> 22 }T \ \" " Double Quote
T{ GC5 DROP 13 CHARS + C@ -> 5C }T \ \\ \ Back Slash
Note this does not test \n as this is a system dependent value.
\" (now the same as for \q).S\".
Peter Knaggs <pjk@bcs.org.uk>
Engineering, Mathematics and Physical Sciences,
University of Exeter, Exeter, Devon EX4 7QF, England
web: www.rigwit.co.uk
Note that you can be both a system implementor and a programmer, so you can submit both kinds of ballots.