Escaped Strings S\"

Escaped Strings `S\"`

[ RfDs/CfVs | Other proposals ]

Poll Standings

See below for voting instructions.

Systems

[ ] conforms to ANS Forth.: VFX Forth (Windows/DOS/Linux) (Stephen Pelc)
SwiftForth and SwiftX (Leon Wagner)
bigForth (Bernd Paysan)
[ ] already implements the proposal in full since release [ ].: VFX Forth (Windows/DOS/Linux) [since 3.80] (Stephen Pelc)
SwiftForth and SwiftX [since 3.0] (Leon Wagner)
[ ] implements the proposal in full in a development version.
[ ] will implement the proposal in full in release [ ].
[ ] will implement the proposal in full in some future release.: Bernd Paysan
[ ] There are no plans to implement the proposal in full in [ ].
[ ] will never implement the proposal in full.

Programmers

[ ] I have used (parts of) this proposal in my programs.: Dick van Oudheusden
Stephen Pelc
Leon Wagner
Graham Smith
[ ] I would use (parts of) this proposal in my programs if the systems I am interested in implemented it.: Stephen Pelc
Mark Wills
Bernd Paysan
Gerry Jackson
David N. Williams
Graham Smith
[ ] I would use (parts of) this proposal in my programs if this proposal was in the Forth standard.: Stephen Pelc
Mark Wills
Gerry Jackson
David N. Williams
Graham Smith
[ ] I would not use (parts of) this proposal in my programs.
[ ] I couldn't care less or spoil ballot in computer readable form.: Jacko

Informal Results

Dick van Oudheusden
The Forth Foundation Library uses this proposal.
Missing functionality in this proposal would be:
parse\" ( "ccc" -- c-addr u = Parse the input stream for a escaped string ) (see also 6.2.2008 PARSE)
Peter Fälth
I am still missing \u and \U to handle unicode codepoints (4 or 8 digits) in a portablle way.
Graham Smith
I have more use for 'escaping' text strings at run time. So, the words:
ESCAPE (c-addr len -- c-addr' len') and
UNESCAPE (c-addr len -- c-addr' len')
should be defined.
These words could then form the basis of S\" and parse\".
I have had to 're-invent' the escape mechanism leading to potential consistencies in the way S\" and my ESCAPE works!

Problem

The word S" 6.1.2165 is the primary word for generating strings. In more complex applications, it suffers from several deficiencies:

the S" string can only contain printable characters,
the S" string cannot contain the '"' character,
the S" string cannot be used with wide characters as discussed in the Forth 200x internationalisation and XCHAR proposals.

Current practice

At least SwiftForth, gForth and VFX Forth support S\" with very similar operations. S\" behaves like S", but uses the '\' character as an escape character for the entry of characters that cannot be used with S".

This technique is widespread in languages other than Forth.

It has benefit in areas such as

construction of multiline strings for display by operating system services,
construction of HTTP headers,
generation of GSM modem and Telnet control strings.

The majority of current Forth systems contain code, either in the kernel or in application code, that assumes char=byte=au. To avoid breaking existing code, we have to live with this practice.

The following list describes what is currently available in the surveyed Forth systems that support escaped strings.

\a BEL (alert, ASCII 7)

\b BS (backspace, ASCII 8)

\e ESC (escape, ASCII 27)

\f FF (form feed, ASCII 12)

\l LF (line feed, ASCII 10)

\m CR/LF pair (ASCII 13, 10) - for HTML etc.

\n newline CR/LF for Windows/Dos, LF for Unices

\q double-quote (ASCII 34)

\r CR (carriage return, ASCII 13)

\t HT (horizontal tab, ASCII 9)

\v VT (vertical tab, ASCII 11)

\z NUL (no character, ASCII 0)

\" double-quote (ASCII 34)

\[0-7]+ Octal numerical character value, finishes at the first non-octal character

\x[0-9a-f]+ Hex numerical character value, finishes at the first non-hex character

\\ backslash itself (ASCII 92)

\ before any other character represents that character

Considerations

We are trying to integrate several issues:

no/least code breakage
minimal standards changes
variable width character sets
small system functionality

Item 1) is about the common char=byte=au assumption.

Item 2) includes the use of COUNT to step through memory and the impact of char in the file word sets.

Item 3) has to rationalise a fixed width serial/comms channel with 1..4 byte characters, e.g. UTF-8

Item 4) should enable 16 bit systems to handle UTF-8 and UTF-32.

The basis of the current approach is to use the terminology of primitive characters and extended characters. A primitive character (called a pchar here) is a fixed-width unit handled by EMIT and friends as well as C@, C! and friends. A pchar corresponds to the current ANS definition of a character. Characters that may be wider than a pchar are called "extended characters" or xchars. The xchars are an integer multiple of pchars. An xchar consists of one or more primitive characters and represents the encoding for a "display unit". A string is represented by caddr/len in terms of primitive characters.

The consequences of this are:

No existing code is broken.
Most systems have only one keyboard and only one screen/display unit, but may have several additional comms channels. The impact of a keyboard driver having to convert Chinese or Russian characters into a (say) UTF-8 sequence is minimal compared to handling the key stroke sequences. Similarly on display.
Comms channels and files work as expected.
16-bit embedded systems can handle all character widths as they are described as strings.
No conflict arises with the XCHARs proposal.

Multiple encodings can be handled if they share a common primitive character size - nearly all encodings are described in terms of octets, e.g. TCP/IP, UTF-8, UTF-16, UTF-32, ...

Approach

This proposal does not require systems to handle xchars, and does not disenfranchise those that do.

S\" is used like S" but treats the '\' character specially. One or more characters after the '\' indicate what is substituted. The following three of these cause parsing and readability problems. As far as I know, requiring characters to come in 8 bit units will not upset any systems. Systems with characters less than 7 bits are non-compliant, and I know of no 7 bit CPUs. All current systems use character units of 8 bits or more.

Of observed current practice, the following two are problematic.

\[0-7]+ Octal numerical character value, finishes at the first non-octal character

\x[0-9a-f]+ Hex numerical character value, finishes at the first non-hex character

Why do we need two representations, both of variable length? This proposal selects the hexadecimal representation, requiring two hex digits. A consequence of this is that xchars must be represented as a sequence of pchars. Although initially seen as a problem by some people, it avoids at least the following problems:

Endian issues when transmitting an xchar, e.g. big-endian host to little-endian comms channel
Issues when an xchar is larger than a cell, e.g. UTF-32 on a 16 bit system.
Does not have problems in distinguishing the end of the number from a following character such as '0' or 'A'.

At least one system (Gforth) already supports UTF-8 as its native character set, and one system (JaxForth) used UTF-16. These systems are not affected.

\ before any other character represents that character

This is an unnecessary general case, and so is not mandated. By making it an ambiguous condition, we do not disenfranchise existing implementations, and leave the way open for future extensions.

Note that now the number-prefix extension has been accepted, 3.4.1 Parsing contains a definition of <hexdigit> to be a case insensitive hexadecimal digit [0-9a-fA-F].

Proposal

6.2.xxxx S\" s-slash-quote CORE EXT

Interpretation:: Interpretation semantics for this word are undefined.
Compilation: ( "ccc<quote>" -- ): Parse ccc delimited by " (double-quote), using the translation rules below. Append the run-time semantics given below to the current definition.
Translation rules:: Characters are processed one at a time and appended to the compiled string. If the character is a '\' character it is processed by parsing and substituting one or more characters as follows, where the character after the backslash is case sensitive:
\a BEL (alert, ASCII 7)

\b BS (backspace, ASCII 8)

\e ESC (escape, ASCII 27)

\f FF (form feed, ASCII 12)

\l LF (line feed, ASCII 10)

\m CR/LF pair (ASCII 13, 10)

\n newline (implementation dependent newline,

eg, CR/LF, LF, or LF/CR)

\q double-quote (ASCII 34)

\r CR (carriage return, ASCII 13)

\t HT (horizontal tab, ASCII 9)

\v VT (vertical tab, ASCII 11)

\z NUL (no character, ASCII 0)

\" double-quote (ASCII 34)

\x<hexdigit><hexdigit>

The resulting character is the conversion of these two hexadecimal digits. An ambiguous conditions exists if \x is not followed by two hexadecimal characters.

\\ backslash itself (ASCII 92)

\ An ambiguous condition exists if a \ is placed before any character, other than those defined in 6.2.xxxx S\".
Run-time: ( -- c-addr u ): Return c-addr and u describing a string consisting of the translation of the characters ccc. A program shall not alter the returned string.
See:: 3.4.1 Parsing, 6.2.0855 C", 11.6.1.2165 S", A.6.1.2165 S"

Labelling

Ambiguous conditions occur:

If \x is not followed by two hexadecimal characters.
If a \ is placed before any character, other than those defined in 6.2.xxxx S\".

Reference Implementation

Taken from the VFX Forth source tree and modified to remove implementation dependencies. This code assumes the system is case insensitive.

Another implementation (with some deviations) can be found at the gforth course tree.


decimal
: c+! \ c c-addr --
\ *G Add character C to the contents of address C-ADDR.
 tuck c@ + swap c!
;
: addchar \ char string --
\ *G Add the character to the end of the counted string.
 tuck count + c!
 1 swap c+!
;
: append \ c-addr u $dest --
\ *G Add the string described by C-ADDR U to the counted string at
\ ** $DEST. The strings must not overlap.
>r
 tuck r@ count + swap cmove \ add source to end
 r> c+! \ add length to count
;
: extract2H	\ c-addr len -- c-addr' len' u
\ *G Extract a two-digit hex number in the given base from the
\ ** start of the string, returning the remaining string
\ ** and the converted number.
 base @>r hex
 0 0 2over drop 2>number 2drop drop
>r 2 /string r>
 r> base !
;
create EscapeTable \ -- addr
\ *G Table of translations for \a..\z.
 7 c,	\ \a BEL (Alert)
 8 c,	\ \b BS (Backspace)
 char c c, \ \c
 char d c, \ \d
 27 c,	\ \e ESC (Escape)
 12 c,	\ \f FF (Form feed)
 char g c, \ \g
 char h c, \ \h
 char i c, \ \i
 char j c, \ \j
 char k c, \ \k
 10 c,	\ \l LF (Line feed)
 char m c, \ \m
 10 c, \ \n (Unices only)
 char o c, \ \o
 char p c, \ \p
 char " c, \ \q " (Double quote)
 13 c,	\ \r CR (Carriage Return)
 char s c, \ \s
 9 c,	\ \t HT (horizontal tab}
 char u c, \ \u
 11 c,	\ \v VT (vertical tab)
 char w c, \ \w
 char x c, \ \x
 char y c, \ \y
 0 c,	\ \z NUL (no character)
create CRLF$ \ -- addr ; CR/LF as counted string
 2 c, 13 c, 10 c,
: addEscape	\ c-addr len dest -- c-addr' len'
\ *G Add an escape sequence to the counted string at dest,
\ ** returning the remaining string.
 over 0= \ zero length check
 if drop exit then
>r \ -- caddr len ; R: -- dest
 over c@ [char] x = if \ hex number?
 1 /string extract2H r> addchar exit
 then
 over c@ [char] m = if \ CR/LF pair
 1 /string 13 r@ addchar 10 r> addchar exit
 then
 over c@ [char] n = if \ CR/LF pair? (Windows/DOS only)
 1 /string crlf$ count r> append exit
 then
 over c@ [char] a [char] z 1+ within if
 over c@ [char] a - EscapeTable + c@ r> addchar
 else
 over c@ r> addchar
 then
 1 /string
;
: parse\"	\ c-addr len dest -- c-addr' len'
\ *G Parses a string up to an unescaped '"', translating '\'
\ ** escapes to characters. The translated string is a
\ ** counted string at *\i{dest}.
\ ** The supported escapes (case sensitive) are:
\ *D \a BEL (alert)
\ *D \b BS (backspace)
\ *D \e ESC (not in C99)
\ *D \f FF (form feed)
\ *D \l LF (ASCII 10)
\ *D \m CR/LF pair - for HTML etc.
\ *D \n newline - CRLF for Windows/DOS, LF for Unices
\ *D \q double-quote
\ *D \r CR (ASCII 13)
\ *D \t HT (tab)
\ *D \v VT
\ *D \z NUL (ASCII 0)
\ *D \" double-quote
\ *D \xAB Two char Hex numerical character value
\ *D \\ backslash itself
\ *D \ before any other character represents that character
 dup>r 0 swap c! \ zero destination
 begin \ -- caddr len ; R: -- dest
 dup
 while
 over c@ [char] "  \ check for terminator
 while
 over c@ [char] \ = if \ deal with escapes
 1 /string r@ addEscape
 else \ normal character
 over c@ r@ addchar 1 /string
 then
 repeat then
 dup \ step over terminating "
 if 1 /string then
 r> drop
;
create pocket \ -- addr
\ *G A tempory buffer to hold processed string.
\ This would normally be an internal system buffer.
s" /COUNTED-STRING" environment? 0= [if] 256 [then]
1 chars + allot
: readEscaped	\ "ccc" -- c-addr
\ *G Parses an escaped string from the input stream according to
\ ** the rules of *\fo{parse\"} above, returning the address
\ ** of the translated counted string in *\fo{POCKET}.
 source>in @ /string tuck \ -- len caddr len
 pocket parse\" nip
 ->in +!
 pocket
;
: S\" \ "string" -- caddr u
\ *G As *\fo{S"}, but translates escaped characters using
\ ** *\fo{parse\"} above.
 readEscaped count state @
 if postpone sliteral then
; IMMEDIATE

Test Cases

HEX T{ : GC5 S\" \a\b\e\f\l\m\q\r\t\v\x0F0\x1Fa\xaBx\z\"\\" ; -> } T{ GC5 SWAP DROP -> 14 }T \ String length T{ GC5 DROP C@ -> 07 }T \ \a BEL Bell T{ GC5 DROP 1 CHARS + C@ -> 08 }T \ \b BS Backspace T{ GC5 DROP 2 CHARS + C@ -> 1B }T \ \e ESC Escape T{ GC5 DROP 3 CHARS + C@ -> 0C }T \ \f FF Form feed T{ GC5 DROP 4 CHARS + C@ -> 0A }T \ \l LF Line feed T{ GC5 DROP 5 CHARS + C@ -> 0D }T \ \m CR of CR/LF pair T{ GC5 DROP 6 CHARS + C@ -> 0A }T \ LF of CR/LF pair T{ GC5 DROP 7 CHARS + C@ -> 22 }T \ \q " Double Quote T{ GC5 DROP 8 CHARS + C@ -> 0D }T \ \r CR Carriage Return T{ GC5 DROP 9 CHARS + C@ -> 09 }T \ \t TAB Horizontal Tab T{ GC5 DROP A CHARS + C@ -> 0B }T \ \v VT Vertical Tab T{ GC5 DROP B CHARS + C@ -> 0F }T \ \x0F Given Char T{ GC5 DROP C CHARS + C@ -> 30 }T \ 0 0 Digit follow on T{ GC5 DROP D CHARS + C@ -> 1F }T \ \x1F Given Char T{ GC5 DROP E CHARS + C@ -> 61 }T \ a a Hex follow on T{ GC5 DROP F CHARS + C@ -> AB }T \ \xaB Insensitive Given Char T{ GC5 DROP 10 CHARS + C@ -> 78 }T \ x x Non hex follow on T{ GC5 DROP 11 CHARS + C@ -> 00 }T \ \z NUL No Character T{ GC5 DROP 12 CHARS + C@ -> 22 }T \ \" " Double Quote T{ GC5 DROP 13 CHARS + C@ -> 5C }T \ \\ \ Back Slash
Note this does not test \n as this is a system dependent value.
Change History

2010年03月27日 6.3 Revised Reference Implementation removing endif and assumption on the length of a counted string
2009年03月31日 6.2 Revised Reference Implementation taking into account the fact that no standard word may use the PAD
2008年11月23日 6.1 Replaced description of \" (now the same as for \q).
Replaced the test cases with tests that do not assume the word can be used in interpretation mode. In keeping with the definition
2007年10月30日 6 Clarification of case sensitivity: Escape character is case sensitive, Hex digits are not.
2007年09月13日 5 Added clarifications
2007年07月19日 4 Modified ambiguous condition.
Added ambiguous conditions to definition of S\".
Added test cases.
Corrected Reference Implementation.
2007年07月12日 3 Redrafted non-normative portions
2006年08月22日 2 Updated solution section
2006年08月21日 1 First draft

Credits
Stephen Pelc, <stephen@mpeforth.com>
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441,
fax: +44 (0)23 8033 9691
web: www.mpeforth.com - free VFX Forth downloads
Peter Knaggs <pjk@bcs.org.uk>
Engineering, Mathematics and Physical Sciences,
University of Exeter, Exeter, Devon EX4 7QF, England
web: www.rigwit.co.uk

Voting Instructions
Fill out the appropriate ballot(s) below and mail it/them to <vote@forth200x.org>. Your vote will be published (including your name (without email address) and/or the name of your system) here. You can vote (or change your vote) at any time, and the results will be published here.
Note that you can be both a system implementor and a programmer, so you can submit both kinds of ballots.

Ballot for systems
If you maintain several systems, please mention the systems separately in the ballot. Insert the system name or version between the brackets. Multiple hits for the same system are possible (if they do not conflict).

[ ] conforms to ANS Forth.

[ ] already implements the proposal in full since release [ ].

[ ] implements the proposal in full in a development version.

[ ] will implement the proposal in full in release [ ].

[ ] will implement the proposal in full in some future release.

[ ] There are no plans to implement the proposal in full in [ ].

[ ] will never implement the proposal in full.

If you want to provide information on partial implementation, please do so informally, and I will aggregate this information in some way.
Ballot for programmers
Just mark the statements that are correct for you (e.g., by putting an "x" between the brackets). If some statements are true for some of your programs, but not others, please mark the statements for the dominating class of programs you write.

[ ] I have used (parts of) this proposal in my programs.

[ ] I would use (parts of) this proposal in my programs if the systems I am interested in implemented it.

[ ] I would use (parts of) this proposal in my programs if this proposal was in the Forth standard.

[ ] I would not use (parts of) this proposal in my programs.

If you feel that there is closely related functionality missing from the proposal (especially if you have used that in your programs), make an informal comment, and I will collect these, too. Note that the best time to voice such issues is the RfD stage.