Archives
- April 2025
- March 2025
- February 2025
- January 2025
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023
- May 2023
- April 2023
- March 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- October 2019
- September 2019
- August 2019
- July 2019
- June 2019
- May 2019
- April 2019
- March 2019
- February 2019
- January 2019
- December 2018
- November 2018
- October 2018
- August 2018
- July 2018
- June 2018
- May 2018
- April 2018
- March 2018
- February 2018
- January 2018
- December 2017
- November 2017
- October 2017
- August 2017
- July 2017
- June 2017
- May 2017
- April 2017
- March 2017
- February 2017
- January 2017
- December 2016
- November 2016
- October 2016
- September 2016
- August 2016
- July 2016
- June 2016
- May 2016
- April 2016
- March 2016
- February 2016
- January 2016
- December 2015
- November 2015
- October 2015
- September 2015
- August 2015
- July 2015
- June 2015
- May 2015
- April 2015
- March 2015
- February 2015
- January 2015
- December 2014
- November 2014
- October 2014
- September 2014
- August 2014
- July 2014
- June 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- December 2013
- November 2013
- October 2013
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- April 2013
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- January 2011
- November 2010
- October 2010
- August 2010
- July 2010
Learn Something Old Every Day, Part XIV: read() Return Value May Surprise
Last week I amused myself by porting some source code from Watcom C to Microsoft C. In general that is not difficult, because Watcom C was intended to achieve a high degree of compatibility with Microsoft’s C dialect.
Yet one small-ish program kept crashing when built with Microsoft C. It didn’t seem to be doing anything suspicious and didn’t produce any noteworthy warnings when built with either compiler.
After some head scratching and debugging, I traced the difference to a piece of code like this:
if( read( hdl, buf, BUF_SIZE ) != BUF_SIZE )
// Last file block read, deal with EOF
else
// Not near end of file
To my surprise, the return value from read()
is rather different between the two compilers’ run-time libraries when the file is open with the O_TEXT
flag (and therefore meant to translate line endings from CR/LF to LF when reading).
It’s hard to call it a bug because both run-time libraries behave as documented. Here’s what the Watcom documentation says:
The read function returns the number of bytes of data transmitted from the file to the buffer (this does not include any carriage-return characters that were removed during the transmission). Normally, this is the number given by the len argument. When the end of the file is encountered before the read completes, the return value will be less than the number of bytes requested.
While this is perhaps not as clear as it might be, read()
will attempt to fill the entire buffer, regardless of how many CR characters it might need to delete.
Microsoft’s documentation says something different (quoted from Microsoft C 5.0, but it has not substantially changed):
The read function returns the number of bytes actually read, which may be less than count if there are fewer than count bytes left in the file or if the file was opened in text mode (see below).
…
If the file was opened in text mode, the return value may not correspond to the number of bytes actually read . When text mode is in effect, each carriage-return-line-feed pair ( CR-LF) is replaced with a single line-feed character (LF). Only the single line-feed character is counted in the return value. The replacement does not affect the file pointer.
The discrepancy stems from the fact that for files opened in text mode, the total number of bytes stored in the file on disk will likely be higher than the total number of bytes read into application buffers. Because, of course, any CR/LF sequence will shrink to just LF.
While Microsoft chose the length argument to read()
to mean the number of bytes read from disk, Watcom instead interprets it as the number of bytes written to the application’s buffer.
Although both approaches make some logical sense, the approach chosen by Microsoft and other library writers (including at least Borland and IBM) has two drawbacks:
- It makes it impossible to test for end-of-file and error conditions by simply checking whether the number of bytes read equals the number of bytes requested
- It is inconsistent with the behavior of
fread()
for text files
In all the run-time libraries tested, fread()
on text files behaves like Watcom’s read()
. That is, fread()
attempts to fill the destination buffer with the number of bytes specified by caller, regardless of how many bytes need to be actually read from disk. This is perhaps understandable given the specification of fread()
which does not take a single size argument but rather uses a product of “number of items” and “size of item” to determine the number of bytes read.
The library writers quite possibly felt that although a call such as
n = fread( buf, 1, BUF_SIZE, f );
could conceivably return less than BUF_SIZE for text files, analogous to what Microsoft’s read()
does, but a call like
n = fread( buf, BUF_SIZE, 1, f );
would by necessity have to return zero in such case… and that would be pretty useless. It is much better to try and fill the specified buffer.
Needless to say, the behavior of read()
for text files is not specified by any standard, because the one standard which does specify read()
behavior (that is, POSIX and its successors) knows nothing of text files.
The behavior of the Watcom runtime in this regard has not changed since at least version 8.5 (1991); likewise the Microsoft runtime hasn’t changed at least since Microsoft C 5.0 (1987). Most likely it never changed. The fact that the discrepancy exists probably indicates that very few programmers use the POSIX file I/O calls with text files at all.
There are also differences in exactly how different run-time libraries deal with CR characters in text files (Microsoft, Watcom, and Borland run-times all behave differently), but that’s perhaps something for a different blog post.
11 Responses to Learn Something Old Every Day, Part XIV: read() Return Value May Surprise
Hello Michal,
Regarding read ()’s behaviour, I think POSIX.1-2024 does have a bit to say — about short read counts:
“The value returned [by read ()] may be less than nbyte […] if the read() request was interrupted by a signal [etc. …]
“If a read() is interrupted by a signal before it reads any data, it shall return -1 with errno set to [EINTR].” (https://pubs.opengroup.org/onlinepubs/9799919799/functions/read.html)
So even on POSIX systems where programs do not need to do newline conversion, a short read count does not necessarily mean that an end-of-file is reached. I normally test for end-of-file by checking if read () returns exactly 0 (with no error).
Thank you!
I do recall using read() under DOS years ago, generally in binary mode, but occasionally in text mode.
That would have been with the Borland C compilers.
However the method used to detect EOF was not return length less than specified read length, but rather read() eventually returning 0.
Simply because that is what one has to do under UNIX in order to detect EOF irrespective of what the fd is connected to.
i.e. short reads are possible even before EOF for “slow” devices, e.g. tty’s, sockets, etc.
True. The code in question was meant to work with disk files, and I’m fairly certain any read shorter than the required size really meant either EOF or an error.
True. A good reminder that the C stdio interface is much easier to use, in addition to being more portable.
@Necasek: Sorry, but the superficial convenience of stdio(3) fails to
charm me. There are too many edge cases, too many obfuscations of
what’s really going on. On modern systems, the buffering largely just
gets in the way, and then there are “physically impossible” calls like
ungetc(3), implemented through hackery. Ugh.
With read(2), the return value (combined with errno if it’s -1) gives
all the required information needed to proceed. No feof(3), no
ferror(3) — just the return value (and errno).
Me understands that a “text mode” kludge, however unfortunate, is
needed here (bad design decisions…), but is the ensuing confusion
about ‘len’ (which meknows as ‘nbytes’) really the fault of the read(2)
interface? Medoesn’t think so.
Which was the bad design decision? Using LF for line endings? Using CR/LF for line endings? Something else?
That’s actually a rather good question. Me’ll try to provide a
somewhat satisfying answer.
There are several layers of design mistakes involved here. Zeroth,
there’s the fact that ASCII does not have a separate newline character
(though something like RS could be used). This made sense at the time,
it reduced complexity; just like it made sense for some operating
systems to settle on either LF xor CR for a newline, in lieu of a
dedicated character. But in hindsight, they’re still mistakes. In fact,
it’s an instance of the classic pattern of compounding one design
mistake with another.
Me supposes that, at the time, few realized how important line-based
(as opposed to traditional record-based) processing would become; lines
have virtually replaced old-school records, at least in lower-level
applications. If they had, the issue would’ve probably been approached
with a bit more care at the time.
A *really* painful mistake, though, is to try to compensate for all that
in something low-level like the read(2) routine. Sure, it makes porting
of UNIX programs to mess-dos superficially easier, but as you found out,
it confuses the interface. Death to O_TEXT! 🙂
Whoa, when me posted the above comment, WordPress returned a comment
page where me new comment was #5 — me orig comment and your response
were missing. You might want to have another look at the caching
there… (Refreshing the page, of course, fixed it.)
@zeurkous:
The WordPress caching or whatnot seems to suck most of the time. Pro tip: The RSS feed seems to be updated even when the actual web page has an old cached copy.
In general re CR/LR and whatnot, I have a strong suspicion that this has crept in to what I think were supposed to be binary or at least application specific file formats.
In particular you can copy a Firefox profile directory from a Linux computer and use it on a Windows computer. Web sites you visit using that profile even still things you are using Linux as the profile seem to contain the browser ID and whatnot. But trying to do the other way around, copying a Firefox profile from Windows to Linux seems to just not work.
This is just a strong speculation, but I think that the file I/O for the Firefox profile uses “cooked” files, i.e. uses some part of the compiler (or OS library/API) to use CR/LF in Windows and only LF in Linux, and the code in Windows accepts a lone LF but the code in Linux just gets angry if it sees both CR+LF.
In hindsight I think that it would had been great if ASCII, the actual committee, had at some point in the late 70’s put down the foot and told everyone that except for output to mechanical devices like printers, one of CR or LF would be deprecated and any system/OS using that code would not be eligible for government contracts in the US. I.E. say any computer put on the market starting in 1980, that isn’t software wise clearly based on an earlier product from the same company, would have to use whichever of CR or LF that ASCII would decide on. It would had taken some time for this to propagate, but in particular all the 16-bit and 32-bit systems would had had the same standard, i.e. IBM PC, Mac and for that sake Amiga, Atari ST and whatnot. Also all the home computers that wasn’t already on the market, or based on older systems, would also had been compatible. (I.E. Commodore would had gotten away with continuing using CR on all their 8-bit computers even if LF had been selected by ASCII, as the original PET used CR and the APIs of all later 8-bit computers from Commodore were add-ons to the API on the PET).
I am sorry, but if some standard is designed to solve a specific problem (such as ANSI and teletypes/terminals), how is solving that problem a “design mistake”? ANSI had nothing to do with files and almost nothing to do with computers.
Me already admitted that caveat: likely, at the time, no-one realized
that–
a) ASCII’s main use would be in computing, not in oldskool
telecomms;
b) line-by-line processing would become more important than
record-by-record processing, at least in lower-level applications;
and, thus:
c) that a separate, dedicated newline character would solve a lot of
future problems.
As MiaM pointed out, though, the issue could’ve been addressed
relatively early on, and it wasn’t. Another way of doing so could’ve
been to issue another revision of the standard. (In the end, we did get
a ‘NEL’ character in C1, but that just opened another can of worms as
implementing C1 turns ASCII from 7-bit into 8-bit.)
This site uses Akismet to reduce spam. Learn how your comment data is processed.