MSORT
Msort's graphical user interface
- Description
- Comparison with GNU Sort and BSD Sort
- Details
- Documentation
- Downloads
- Environment
- Change Log
- Bugs
- Roadmap
msort is a program for sorting files in sophisticated ways.
It was originally developed for alphabetizing dictionaries of "exotic" languages
in formats like those used by Shoebox and Toolbox,
for which it has been extensively used, but is useful for many other purposes.
msort differs from typical sort utilities in providing greater flexibility
in parsing the input into records and identifying key fields
and greater control over the sort order. Its main distinctive features are:
- Msort can be used as a command-line program or via a graphical user
interface that is helpful not only to those who find a complicated
command line difficult to deal with but also to those unfamiliar with
the finer points of sorting.
- Records need not be single lines of text but may be delimited in a number of ways. Fixed length records are also supported.
- Key fields may be selected by position in the record (counting from the beginning or the end),
by character ranges (e.g. the key consists of the fourth through eighth characters),
or by matching a regular expression to a tag.
- For each key an arbitrary sort order may be specified. Msort also understands locales.
- For each key an effectively unlimited number of multigraphs (sequences of characters to be
treated as a single unit for purposes of sorting, "collating elements" in Unicode parlance)
of effectively unlimited length may be defined.
- In addition to the usual lexicographic and numerical comparisons, msort supports
hybrid lexicographic-numeric comparison (for things like filenames and section headings,
so that, e.g., 2a will precede 10b),
random comparison, and ordering by
angle,
date,
time,
month name,
domain name/email address,
ISO8601 date-time,
and string length.
- Numbers may be in just about any known number system, e.g. Chinese
五十七 or Devanagari ३८२४९.
- For each key a distinct set of characters may be excluded from
consideration when sorting in any combination of initial, final, and medial
position in the key field.
- For each key a distinct set of regular expression substitutions may be defined.
These provide the means to make names like McCarthy sort before MacCawley,
as if McCarthy were spelled MacCarthy as well as to handle the rare cases
in which a single character is treated for purposes of sorting as a sequence, such
as German ß "eszet", which is traditionally sorted as if it were ss.
- Lexicographic keys may be reversed, allowing the construction of reverse dictionaries.
- Any or all keys may be optional. For optional keys, the user may specify how records
missing the key field should compare to records in which the key field is present.
- A choice of sorting algorithms with different properties is provided.
msort understands UTF-8 Unicode. Unicode may be used anywhere that text is entered:
in the text to be sorted, in sort order and exclusion definitions, as a field or record
separator, or as a field tag. Full Unicode case-folding is available.
Review by Ben Martin at linux.com
(上の日本語訳)
If you are looking for the specialized Hungarian sort program also called msort,
try here.
Msort's capabilities are very close to a superset of those of
GNU sort and BSD sort.
Msort provides greater flexibility in selecting key fields, more comparison
types, the ability to use collation rules from different locales on different keys,
the ability to handle numbers in non-Western number systems,
and a variety of other options lacking in GNU sort and BSD sort.
Whereas msort understands Unicode, GNU sort and BSD sort do not.
It is a property of the UTF-8 transfer format that a binary sort will sort in Unicode
codepoint order, so for some purposes
GNU sort will behave in an acceptable manner on Unicode input. However, operations
requiring an understanding of the encoding of the input do not work properly in GNU
sort and BSD sort with Unicode input. Capabilities of
GNU sort and BSD sort lacking in msort are the
ability to merge files without sorting them (the --merge option) and
the ability to emit only the first of an equal run (the --unique option).
Generally speaking, msort is the more powerful program, either the only choice
or the more convenient choice in cases in which something other than standard sorts
of positionally selected fields are required.
On the other hand, if GNU sort or BSD sort
is capable of doing what you want, it will generally
be faster. The exact ratio varies with the details of the sort and the nature of the
input, but in my tests, where msort and GNU sort are capable of performing
the same sort, GNU sort is typically several times faster than msort.
BSD sort seems to be slightly faster than GNU sort.
Language C main program
Current version 8.53
Last modified 2010年01月10日
A standard Unix manual page is included in the package, or you can read it
here. The full
documentation is the
reference manual (PDF), a copy of which is included in the package.
The manual contains a number of examples, including how to use msort to
sort SIL Standard Dictionary Format files as used by Shoebox and Toolbox.
If you would like to be notified of new releases, subscribe to msort at Freshmeat.
Packages
- Debian
- Debian package (testing)
- Debian package (unstable)
- FreeBSD
- FreeBSD Freshport
- Mac OS X
-
Macport
- Mac OS X binaries
- Softpedia (PPC and Intel)
- Darwinports
- Nexenta/GNU Solaris
- Nexenta packages
- OpenPKG
- OpenPKG package
- Redhat Linux
- Redhat RPMs
- SUSE Linux
- Source and i686 executable RPMs courtesy of Pascal Bleser: SUSE RPMs.
- Solaris (SPARC and Intel)
- Solaris Package Index
- T2
- T2
- Ubuntu
- Ubuntu packages
The underlying command-line program msort should compile and run without difficulty
on any POSIX-conformant system on which the requisite libraries are available.
In practice, this should mean just about anywhere.
It is known to compile and run without modification under
GNU/Linux, FreeBSD, Mac OS X, and SunOs. I am note sure whether the current version
will compile and run properly under MS Windows, even under Cygwin, due to the fact
that MS Windows uses UTF-16 Unicode internally while msort expects UTF-32.
Note also that msort may be configured
to compile without the GMP and Uninum libraries, at the cost of
forgoing the ability to handle numbers in non-Western number systems. If you cannot
or do not want to install these libraries, run configure with the option
--disable-uninum. This will also disable linkage with libgmp.
The graphical user interface should run anywhere
that Tcl/Tk is available, but a few features may not work on non-Unix systems. In particular,
the Abort Sort command depends on the existence of a Unix-style kill
program that can be used to send a signal to another process.
It is known to run under GNU/Linux, FreeBSD, and SunOS.
msg will run properly under Mac OS X if you have installed X11 and use Tk-X11.
msg now adapts itself to Tk-Aqua sufficiently well as to be usable, but some
details remain to be dealt with.
Note: obtaining the necessary Tcl/Tk environment.
The GUI requires both the basic Tcl/Tk distribution and the iwidgets library.
If you already have Tcl/Tk and just need to add iwidgets, you can obtain the
package from the Sourceforge project
site.
On the download page you will find source and binary packages
for both [incr Tcl/Tk], which is the basic part of this package, and [incr widgets],
which is the part that contains the widgets. You will need to install both.
(iwidgets is an alternative name for [incr widgets].)
The easiest way to obtain the Tcl/Tk environment you need is to install the
ActiveTcl
distribution from ActiveState.
This distribution provides the Tcl language, the Tk graphics library, and a bunch
of extensions, including [incr tcl] and [incr widgets].
Don't be concerned by the fact that ActiveState is a commercial outfit.
The Tcl/Tk distribution that they provide is free as in both beer and speech.
They make their money selling services and programming tools. The ActiveTcl
distribution is currently available for: GNU/Linux, HP-UX, AIX, Solaris, Mac OS X,
and MS Windows.
For FreeBSD, Tcl and Tk are available at:
8.53 - 2010年01月10日
- Adapted to be compatible with libtre 0.8
- Removed unnecessary conditioning of Hybrid mapping code on availability of locale support.
- Added -Z option for copying the first record to the output without sorting it. This is useful for sorting files with a header.
- Considerably reduced the memory used for exclusions
- Fixed a bug in the reporting of exclusions
8.52 - 2008年12月06日
- ISO8601 keys may now have an optional leading sign.
- If a key has comparison type "random", it is no longer stored
since it won't be used. This saves a little time and possibly a good bit of
storage.
- If one or more records have been discarded due to problems in key
extraction but the run is otherwise successful, the exit code is
now RECORDEXCLUDED (13) rather than BADRECORD (8).
- Cleaned up and improved the log.
- Made error-checking and reporting finer-grained in GetMonthNames.
- A few of the regression tests depend on the locale system, which may fail
for reasons independent of msort. These tests have now been separated
so that their failure will not suggest that msort itself is not working.
Typing "make test" runs the main set of tests. Typing "make localetest"
runs the locale-dependent tests, the results of which are written to
LocaleTestResults.
- Split time and iso8601 date/time regression tests so as not to
mix data with and without time zone offsets since mixing them causes
tests to fail if executed in some time zones.
- Added regression test for more complex substition.
- Added information to the manual section on random comparison.
If you don't know how random comparison can be useful other than
for unsorting, you might want to check this out.
8.51 - 2008年10月14日
- It is now possible to set the random number generator seed from the command-line,
allowing replication of random sorts. Whatever its origin, the seed used is now
reported in the log.
- Added regression tests for angles and collating sequences.
- Rearranged the start and completion time stamps in the log so that
the former immediately precedes the latter, facilitating comparison.
Under obscure conditions date sorts may produce a segmentation fault or
valid date fields may be rejected as invalid. I have been unable to
reproduce this bug on my own system. It may or may not be significant that
the machine on which this bug has been reported is a 64-bit machine.
Known bugs in the GUI are:
- Under Mac OS X invocation of a browser from within msg's help system does not work.
- Under Mac OS X using Tk-Aqua in some cases Unicode characters produce the wrong glyph.
For example, codepoints in the Armenian range are displayed as Chinese characters.
This may be a matter of font selection.
- Handle the disparities among wchar_ts, UTF32s, and UChars
in a more portable way.
-
It might be useful to extend the ability to use a variety of number systems
to hybrid keys. This looks very complex since in principle each numerical section
of a hybrid key might use a different writing system and since in some cases distinguishing
between the numerical portion and the adjacent textual portion could be difficult.
- It may also be useful to provide a higher-level means of specifying exclusions.
Currently, exclusions must just be listed. A higher-level approach might be
to allow Boolean combinations of predicates based on the Unicode ranges and
general character properties. Then one could, for example, exclude all punctuation,
or anything other than letters and digits, or anything not in the ASCII or Tamil
ranges. Implementing this would add significantly to the footprint since it would
probably take several hundred thousand bytes to store all the Unicode property
information, unless we could get it from a shared library.
- Another open question is whether to allow for more of the things that are
currently specified in files to be specified directly on the command line.
Currently, you can specify exclusions both in a file or on the command line
since it is not uncommon to want to exclude just a few characters. On the other
hand, if you want an unusual sort order, that is probably not something you'll do
on the fly anyhow, and it takes a fair amount of space to do it, so it seems
appropriate to use a file for sort orders and not to provide for defining
sort orders directly from the command line. But perhaps some people would
find this useful. I can see that it might be useful to be able to define
substitutions on the command line.
If you care about any of these, please feel free to drop me a line.