RE: [ILUG] [OT] silly shell thing

From: Caolan McNamara (cmc at domain stardivision.de)
Date: Wed 24 May 2000 - 13:28:18 IST


>>>>>>>>>>>>>>>>>> Original Message <<<<<<<<<<<<<<<<<<

On 24.05.00, 12:54:55, "McDaid, Aaron" <Aaron.McDaid at domain compaq.com> wrote
regarding RE: [ILUG] [OT] silly shell thing:

> > I've a file that for some reason a windows editor
> > has put in double newlines everywhere.
> > Anyone got a good way of getting rid of them ?

If they are true LFLF (0A 0A) then
grep -v ^$ filename > filename.new
will do the trick

> This newline sillyness (Macs, Windows and *nix
> disagreeing on what a newline is)breally bugs me.
> Is there any good news on the horizon?
> ie. Will Unicode stop this?
> If so, is Unicode going to replace ASCII for
> most/all text files

Hmm, well its going to be tricky. For a start the CR and LF characters
are both unicode characters (http://charts.unicode.org/Web/U0000.html)
So files that contain them are valid unicode strings.

On the other hand the unicode reccomendation is that you feck those LF
and CRLF pairs out the window and actually use proper meaningful
formatting characters, namely Paragraph Seperator and Line Seperator,
which do the obvious. (http://www.unicode.org/unicode/reports/tr13/)

In the mythical perfect world that we strive for LF and CRLF dissappear
from our lives to be replaced with "LS".

Of course in the real world unix will use the utf-8 encoding for unicode.
Unicode can be encoded in a number of ways, bigendian 16bit nos (the mac
I bet for instance), littleendian 16bit nos (windows), and utf-8 for
unix, utf-8 is cunning in that old ascii characters are encoded as single
bytes of the same value as ascii, other unicode characters are stored as
a leading byte with the highest bit set to flag that the next byte is to
be combined with this one to create a full unicode character (and so on
up to a practical max of 5bytes to create a unicode character), while
this sounds horrible it means that we can retain backwards compatility
with the vast majority of existing unix programs (and we western language
speakers will save 50% of our disk space, but keep quiet on that one).

The upshot being that we unix users will have LF's lying around our
drives for about the next 20 years as theres little impetus for us to
swap them for the longer byte sequence for a "LS" which would otherwise
serve the same purpose but somewhat confuse our older utilities, C
libraries etc. And in the windows world there will be a large set of
programs which will convert CRLF into unicode as unicode characters CR
and LR as it makes no real difference to them.

Nevertheless unicode does address this problem, but not firmly stamp it
out. Unicode files should use "LS" to denote that the line has ended and
that a new line is to be begun. Whether this document was created under
Windows, SillyOs or the Hurd shouldn't matter. But we will almost
certainly still see unicode files with CR LF pairs and single LFs. How we
will recognize a file as being utf-8, LE 16 bit unicode, BE 16 bit
unicode, GNU style 32bit unicode etc is left as an exercise for the
reader. Metadata of the mimetype style will still be required.

C.



This archive was generated by hypermail 2.1.6 : Thu 06 Feb 2003 - 13:06:15 GMT