EDIT: Due to a comment Warren Young made, it made me realize that I was not clear on one quite relevant point. My search string is already in UTF-16LE order (not in Unicode Codepoint order, which is UTF-16BE), so perhaps the Unicode issue is somewhat moot,
Perhaps my issue is a question of how do I grep for bytes (not chars) in groups of 2-bytes, ie. so that UTF-16LE \x09\x0A
is not treated as TAB,newline, but just as 2 bytes which happen to be UTF-16LE ऊ
? ... Note: I do not need to be concerned about UTF-16 surrogate pairs, so 2-byte blocks is fine.
Here is sample pattern for this 3-character string ऊपर
:
\x09\x0A\x09\x2A\x09\x30
but it returns nothing, though the string is in the file.
(here is the original post)
When searching a UTF-16LE file with a pattern in \x00\x01\x...etc
format, I have encountered problems for some values. I've been using sed
(and experimented with grep
), but being in the UTF-8 locale they recognize some UTF-16LE values as ASCII characters. I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option.
eg. In this text ऊ
(UNICODE 090A), though it is a single character, ऊ
is perceived as two ASCII chars \x09
and \x0A
.
grep
has a -P
(perl) option which can search for \x00\x...
patterns, but I'm getting the same ASCII interpretation.
Is there some way to use grep -P
to search in a UTF-16 mode, or perhaps better, how can this be done is perl or some other script.
grep
seems to be the most appealing because of its compactness, but whatever gets the job done will overrride that preference.
PS; My ऊ
example uses a literal string, but my actual usage needs a regex style search. So this perl example is not quite what I'm after, though it does process the file as UTF-16... I'd prefer to not have to open and close the file... I think perl
has more compact ways for basic things like a regex search. I'm after something with that type of compact syntax.
4 Answers 4
My answer is essentially the same as in your other question on this topic:
$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern
As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.
-
Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in '\xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...Peter.O– Peter.O2012年06月09日 13:34:46 +00:00Commented Jun 9, 2012 at 13:34
-
I think you may be borrowing trouble. If you provide
grep
a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to howgrep
sees the data. Keep in mind that internally,grep
is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.Warren Young– Warren Young2012年06月09日 14:50:04 +00:00Commented Jun 9, 2012 at 14:50 -
As the Codepoint for
@
is 0x0040, the Codepoint forऊ
is 0x090A (U+090A). My patterns are flipped into Little-Endian order\x0A\x09
which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with the\x0A\x09
combination, which I do encounter.Peter.O– Peter.O2012年06月09日 16:38:47 +00:00Commented Jun 9, 2012 at 16:38 -
Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grep
raw byte data
, but I'm not sure yet.Peter.O– Peter.O2012年06月09日 16:39:02 +00:00Commented Jun 9, 2012 at 16:39 -
1I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.Warren Young– Warren Young2012年06月09日 20:56:59 +00:00Commented Jun 9, 2012 at 20:56
Install ripgrep
utility which supports UTF-16.
For example:
rg pattern filename
ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the
-E
/--encoding flag.
)
To print all lines, run: rg -N . filename
.
I believe that Warren's answer is a better general *nix solution, but this perl script works exactly as I wanted (for my somewhat non-standard situation). It does require that I change the search-pattern's current format slightly:
from \x09\x0A\x09\x2A\x09\x30\x00\s09
to \x{090A}\x{092A}\x{0930}\x{0009}
It does everything in one process which is particularly what I was after.
#! /usr/bin/env perl
use strict;
use warnings;
die "3 args are required" if scalar @ARGV != 3;
my $if =$ARGV[0];
my $of =$ARGV[1];
my $pat=$ARGV[2];
open(my $ifh, '<:encoding(UTF-16LE)', $if) or warn "Can't open $if: $!";
open(my $ofh, '>:encoding(UTF-16LE)', $of) or warn "Can't open $of: $!";
while (<$ifh>) { print $ofh $_ if /^$pat/; }
-
Your main loop can be rewritten as
while (<$ifh>) { print $ofh $_ if /^$pat/; }
You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.Warren Young– Warren Young2012年06月09日 23:52:52 +00:00Commented Jun 9, 2012 at 23:52 -
@Warren, thanks for the help. I've changed the script to the simpler loop.Peter.O– Peter.O2012年06月10日 00:28:36 +00:00Commented Jun 10, 2012 at 0:28
ugrep (Universal grep) supports Unicode, UTF-8/16/32 files, detects invalid Unicode to ensure proper results, displays text and binary files, and is fast and free:
ugrep searches UTF-8/16/32 input and other formats. Option -Q permits many other file formats to be searched, such as ISO-8859-1 to 16, EBCDIC, code pages 437, 850, 858, 1250 to 1258, MacRoman, and KOI8.
Simply give it a pattern of Unicode characters to match:
ugrep -QUTF-16LE "ऊपर" filename
or with the code points in hex:
ugrep -QUTF-16LE "\x{090A}\x{092A}\x{0930}" filename
See ugrep on GitHub for details.
You must log in to answer this question.
Explore related questions
See similar questions with these tags.
[ña-z]
will probably do surpising stuff, and so willgñ*
org[ñn]u
, butg\(ñ\)*,
g(n\|ñ)u` should work fine (it just means something different than you see ;-). The machinery is 8-bit clean nowadays, and swallows the UTF-8 bytes without complaint, but doesn't combine them up to characters.