LINUX.IE, website of the Irish Linux Users' Group
Tux rules!

   
Home
New Users
Articles
Download
Projects
Community
Vendors

  Print Version
Email to...
 
Archives:


planetILUG

Recent News

News Archive


Join the
ILUG
on FaceBook


Join the
ILUG
on LinkedIn


Join the
ILUG SETI
Group



















 
 :: Mailing Lists

[ILUG] [OT] silly shell thing

[ILUG] [OT] silly shell thing

Caolan McNamara cmc at stardivision.de
Wed May 24 13:28:18 IST 2000


>>>>>>>>>>>>>>>>>> Original Message <<<<<<<<<<<<<<<<<<

On 24.05.00, 12:54:55, "McDaid, Aaron" <Aaron.McDaid at compaq.com> wrote 
regarding RE: [ILUG] [OT] silly shell thing:


> > I've a file that for some reason a windows editor
> > has put in double newlines everywhere.
> > Anyone got a good way of getting rid of them ?

If they are true LFLF (0A 0A) then
grep -v ^$ filename > filename.new 
will do the trick

> This newline sillyness (Macs, Windows and *nix
> disagreeing on what a newline is)breally bugs me.
> Is there any good news on the horizon?
> ie. Will Unicode stop this?
> If so, is Unicode going to replace ASCII for
> most/all text files

Hmm, well its going to be tricky. For a start the CR and LF characters
are both unicode characters (http://charts.unicode.org/Web/U0000.html)
So files that contain them are valid unicode strings.

On the other hand the unicode reccomendation is that you feck those LF 
and CRLF pairs out the window and actually use proper meaningful 
formatting characters, namely Paragraph Seperator and Line Seperator, 
which do the obvious. (http://www.unicode.org/unicode/reports/tr13/)

In the mythical perfect world that we strive for LF and CRLF dissappear 
from our lives to be replaced with "LS". 

Of course in the real world unix will use the utf-8 encoding for unicode. 
Unicode can be encoded in a number of ways, bigendian 16bit nos (the mac 
I bet for instance), littleendian 16bit nos (windows), and utf-8 for 
unix, utf-8 is cunning in that old ascii characters are encoded as single 
bytes of the same value as ascii, other unicode characters are stored as 
a leading byte with the highest bit set to flag that the next byte is to 
be combined with this one to create a full unicode character (and so on 
up to a practical max of 5bytes to create a unicode character), while 
this sounds horrible it means that we can retain backwards compatility 
with the vast majority of existing unix programs (and we western language 
speakers will save 50% of our disk space, but keep quiet on that one).

The upshot being that we unix users will have LF's lying around our 
drives for about the next 20 years as theres little impetus for us to 
swap them for the longer byte sequence for a "LS" which would otherwise 
serve the same purpose but somewhat confuse our older utilities, C 
libraries etc. And in the windows world there will be a large set of 
programs which will convert CRLF into unicode as unicode characters CR 
and LR as it makes no real difference to them.

Nevertheless unicode does address this problem, but not firmly stamp it 
out. Unicode files should use "LS" to denote that the line has ended and 
that a new line is to be begun. Whether this document was created under 
Windows, SillyOs or the Hurd shouldn't matter. But we will almost 
certainly still see unicode files with CR LF pairs and single LFs. How we 
will recognize a file as being utf-8, LE 16 bit unicode, BE 16 bit 
unicode, GNU style 32bit unicode etc is left as an exercise for the 
reader. Metadata of the mimetype style will still be required.

C.




More information about the ILUG mailing list
Read this without the formatting.
                                                                                                    

 

Hosted by HEAnet


Maintained by the ILUG website team. The aim of Linux.ie is to support and help commercial and private users of Linux in Ireland. You can display ILUG news in your own webpages, read backend information to find out how. Networking services kindly provided by HEAnet, server kindly donated by Dell. Linux is a trademark of Linus Torvalds, used with permission. No penguins were harmed in the production or maintenance of this highly praised website. Looking for the Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!
RSS Version
Powered by Dell