Re: Reading large files
[
Date Prev][
Date Next][
Thread Prev][
Thread Next]
[
Date Index]
[
Thread Index]
- Subject: Re: Reading large files
- From: Philippe Lhoste <PhiLho@...>
- Date: 2005年8月24日 23:56:34 +0200
Wim Couwenberg wrote:
Anyway, here's a simplistic
script to test binary-ness. Adjust the pattern in "find" to something
more sensible, if you like. Usage:
lua isbin.lua <file-name>
---------------
file isbin.lua:
---------------
local now = os.clock()
local input, err = io.open(arg[1], "rb")
assert(input, err)
local isbin = false
local chunk_size = 2^12
local find = string.find
local read = input.read
repeat
local chunk = read(input, chunk_size)
if not chunk then break end
if find(chunk, "[^\f\n\r\t032円-128円]") then
isbin = true
break
end
until false
input:close()
now = os.clock() - now
if isbin then
print "this file is binary..."
else
print "this is a text file..."
end
print(string.format("this took %.3f seconds", now))
-----------
end of file
-----------
Woah, so non-English text files are binary? ;-)
(Perhaps by old FTP and Mail standards...)
Also, I believe 127円 is seen as binary (DEL code), and 128円 is already
in high-Ascii area...
So perhaps I would rewrite your pattern as: [^\f\n\r\t032円-126円192円-256円]
Note I excluded the 128円-191円 area, seen in ISO (8859-1 for example) as
control characters, but if you consider the quite common Windows Ansi
encoding (CP1252), it contains many valid characters, including the euro
symbol, (c), (R), etc.
And, of course, your test doesn't work for UTF-8 and most other Unicode
encodings. But that's another can of worms...
Additional note: many implementations of this kind of test agree that
testing the first bytes (256, 512...) of a file is enough to see if it
is binary or not. Perhaps it is too simplistic for some file formats,
but it can work most of the time.
And I doubt there are so many text files of over 1GB... Except perhaps
some exceptional log files or XML data files.
--
Philippe Lhoste
-- (near) Paris -- France
-- http://Phi.Lho.free.fr
-- -- -- -- -- -- -- -- -- -- -- -- -- --