Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Handling Mac, Unix, Win/DOS newlines at readtime...

by strredwolf (Chaplain)
on Sep 16, 2002 at 01:03 UTC ( [id://198126]=perlquestion: print w/replies, xml ) Need Help??

strredwolf has asked for the wisdom of the Perl Monks concerning the following question:

Well, I have a problem with our HTML 'compiler', since we're not only replacing tags, we're converting whole files one line at a time into Javascript (aka the HTML that results from the compiler goes into the converter and ends up inside a document.writeln('');). BUT (and a big but it is), we cannot assume that the HTML we're processing is created on Linux. We have folks on Windows, Linux, and Mac, and we got to deal with all those lines!!!

I'm thinking of doing this:

$file=''; open(IN,"<$filename") || die "$file can't be opened: $!"; { local $/=undef; $file=<IN>; } @lines=split /[\r\n]+/, $file; foreach $line (@lines) { # do some processing here }
Any suggestions?

--
$Stalag99{"URL"}="http://stalag99.keenspace.com";

Replies are listed 'Best First'.
Re: Handling Mac, Unix, Win/DOS newlines at readtime...
by Flexx (Pilgrim) on Sep 16, 2002 at 09:09 UTC

    Good Monks!

    I'd think that using \r or \n in this script should be considered harmful. If you'd really want to be portable (which might not be the case) you should exclusively use \015 and \012 to match DOS/Linux/Mac CRLF's

    Imagine this script to be run on some box using EBCDIC (I know this is most probably a hypothetic assumtion but, oh well, I just want to demonstrate something here...)

    Read perlebcdic. I has an example saying:

    $is_ebcdic_37 = "\n" eq chr(37); $is_ebcdic_1047 = "\n" eq chr(21);

    Uh-Oh... That means if you'd split on an EBCDIC system's perl on \n, you'd actually split on '%' or NAK respectively.

    You really want to use HTML::Parser (or even XML::Parser) to parse your input. At least do something like this (untested):

    @lines = split /\012\015?|\015\012?/, $file;

    Whatever approach you'll choose, input normalization is not really a trivial problem...

    So long,
    Flexx

      Beat me to but I am not as revered as Aristotle. The only way I knew is that I got caught out badly reading dos text files on linux. Everyone that uses perl on different platforms, be they Mac, *nix and/or Windows/DOS based, should read perlport for the low down on when \n is \012 or \015 or \015\012

      .

      To quote some words of wisdom:

      In most operating systems, lines in files are terminated with newlines. Just what is used as a newline may vary from OS to OS. Unix traditionally uses \012, one kind of Windows I/O uses \015\012, and Mac OS uses \015.

      Perl uses \n to represent the ``logical'' newline, where what is logical may depend on the platform in use. In MacPerl, \n always means \015. In DOSish perls, \n usually means \012, but when accessing a file in ``text'' mode, STDIO translates it to (or from) \015\012.

      Amen ;-)

      Beat me to it. That's the only sane proposal here. ++

      Makeshifts last the longest.

        Wow! Reading that from you makes me proud (and I mean this 100% honest!).
      I think the problem is only when the data being processed doesn't use the same encoding as the source file.

        Why? Do you mean the perl scripts 'source' encoding? That doesn't really matter here.

        Perl's impression of what \r and \n mean differs from system to system. A Mac would output \015 when it sees \n while a unix box would generate \012 (IIRC now, it might also be the other way round... ;).

        So if a script tries to handle DOS text files using \r\n to split, chomp, substitute, etc. input, it will work on a system that has a compatible native encoding (ASCII). If, however, you run the same script on a noncompatible system (for example on an EBCDIC platform like IBM's AS400, or even something more common like a Pre-OS X Mac) it will do really funny things with your input, as it translates \r\n to some other ordinals, regardless of the scripts source encoding (which will most likely be the platforms "native" encoding).

        Same is true of course, when you want to operate on files coming from AS400 machines on your unix box (which is more likely, I guess). If you'd try split these files on (unix') \n, you'll hit anything but a line end...

        To sum up: DOS files want to be split on \015\012, regardless of what line terminators the splitting sytem uses...

        So long,
        Flexx

Re: Handling Mac, Unix, Win/DOS newlines at readtime...
by graff (Chancellor) on Sep 16, 2002 at 02:30 UTC
    Your code will work just fine for preserving the original line boundaries no matter what system created the text. In that respect, I personally don't know of any other approach that would improve on yours.

    But since it's HTML data that you're working with, line breaks are only meaningful as such within <pre> ... </pre> -- which leads to at least three points that might be interesting for your situation:

    • original data may have strange variations in the placement of line breaks, though this does not affect browser behavior;
    • you can often "revise" the distribution of line breaks without any noticeable effect on browser behavior;
    • the previous two points do not apply in certain portions of some HTML data (i.e. within <pre> elements).

    If all you're doing is taking html data that is already "okay" and replicating it with some particular wrapping around it, your suggested code will be fine.

    If your process involves any sort of filtering, enhancement or other modification of the content, then you will be much better off looking through the various HTML modules (especially HTML::Parser or HTML::TokeParser) to read the input properly. I frankly don't know how these will handle the subtler details of input from different systems. At worst, you may need to keep something like the code you suggested when handling the contents of <pre> blocks.

    update: it sounds like you're producing all your output for just one system (the one running the perl script), which means you want to eliminate the variations in line-break characters. But if you had to keep the line-breaks as-is, so that the results could be read back nicely on the particular system that created each original, you'd want to modify your code just a little:

    $file=''; open(IN,"<$filename") || die "$file can't be opened: $!"; { local $/=undef; $file=<IN>; } ($\) = (/(\r\n|\r|\n)/); # make output rec-separator same as input @lines=split /[\r\n]+/, $file; foreach $line (@lines) { # do some processing here }
Re: Handling Mac, Unix, Win/DOS newlines at readtime...
by Zaxo (Archbishop) on Sep 16, 2002 at 01:59 UTC

    Your proposal looks fine. Ideally, your file transfers should be in ASC mode but you can't rely on people doing that.

    You could take the opportunity to translate lineends to native format for the server.

    Do you mean ... die "$filename can't... ?

    After Compline,
    Zaxo

      Ideally, your file transfers should be in ASC mode but you can't rely on people doing that.

      That assumes FTP and doesn't even apply to other common (and arguably better) methods such as HTTP, scp, rsync, etc.

      -sauoq
      "My two cents aren't worth a dime.";
      
        We ask folks to FTP into the server. Unfortunately, this does pose some problems. First one is "I can't get it working behind my firewall" (solution: WebFTP). The second one is having consistent file transfers in the right format. Given that not only the users aren't smart enough to do it, but the programs as well, we better code properly.

        --
        $Stalag99{"URL"}="http://stalag99.keenspace.com";

Re: Handling Mac, Unix, Win/DOS newlines at readtime...
by sauoq (Abbot) on Sep 16, 2002 at 03:53 UTC

    I would split on /\r\n?/ instead. That avoids removing blank lines.

    Update: In answer to graff's reply, /\r\n?|\n/ will work on all three platforms. I would probably just fix the original files with something based on the first regex I gave though. Better to standardize the files right off the bat. Customizing all sorts of code to deal with all three file types will get old real quick.

    -sauoq
    "My two cents aren't worth a dime.";
    
      /\r\n?/ will fail to split lines that were created on unix systems. Eliminating blank lines might not be so bad, but if it's an issue, then:
      split(/\r\n|\r|\n/);
      Just doing /[\r\n]{1,2}/ will lose some blank lines on unix or mac input; and it's important to try to match the longer pattern first.
        but what if a file was created on a Windows machine, but this code was being run on a Mac?

        I remember reading somewhere in this thread that \r and \n have reversed semantics on the Mac (vs. *nix, Windows).

        So maybe we really want the following: split(/ \r\n | \n\r | \r | \n /x); # (yoicks!)

        My $0.02,

        -- jkahn

      I would split on /\r\n?/ instead. That avoids removing blank lines.
      But not on a Mac. On a Mac, the meaning of "\n" and "\r" got reversed. "\n" is what you use as native end-of-line characters, remember? And on a Mac, that's chr(13).

      Also, as people tend to forget to upload their HTML as text, you often get sequences of two CR characters and one LF. You want to deal with that, too. So here's my solution:

      /\015\015?\012|\015|\012/
      which you might want to replace with "\n" using s///g, instead of splitting on it, so you get one cleaned up string, to feed into HTML::Parser or similar.
Re: Handling Mac, Unix, Win/DOS newlines at readtime...
by Anonymous Monk on Sep 17, 2002 at 03:02 UTC
    Well, all those replies are good, but if you are getting the data directly from wire thru HTTP protocol, then there is no point to worry the platform from which the html file comes, as the line break would be \r\n according to RFC 2616, regardless of platform. That will ease up the whole thing. Pei Guo
      That only applies to the headers, not the body of the response. If the Transfer-Encoding is "identity" or omitted (in which case I believe "identity" is implied), the line endings can be anything; the body is just a stream of octets. If the line ending had to be CRLF in the body, what would that do to binary downloads?

      Even if the RFC can be interpreted to say that the message body MUST have CRLF line endings (which I doubt), in real life, documents served by HTTP servers have all sorts of line endings (I've even seen mixed line endings in HTML files delivered via HTTP). If you think otherwise, look at the bytes delivered from http://www.microsoft.com/ (CRLF), http://www.linux.org/ (LF), http://www.linux.com/ (mostly LF with a few errant CRLFs), and (drat... couldn't find any pure-CR URLs).

        Yes, I agree with you, that the CRLF rule only applies to header (but there is no chance to miss that blank line between header and body, as we agreed those CRLF's are same in header regardless of platform). However in the body, the line breaks do not 'really' mean anything, as we are dealing with some markup language, not plain text. This is not only true to clear text, but also encoded text. If the text is encoded, the file system should not change anything within the encoded part. I do understand what his concern is, but I am just thinking whether that's something really worth to deal with. I guess we don't really know, as it is not clear what the objective is, I mean the actual objective required by the project, instead of the objective interpretated by the programmer.

        This is incorrect. From RFC2046:

        4.1.1.  Representation of Line Breaks
        
           The canonical form of any MIME "text" subtype MUST always represent a
           line break as a CRLF sequence.  Similarly, any occurrence of CRLF in
           MIME "text" MUST represent a line break.  Use of CR and LF outside of
           line break sequences is also forbidden.
        
           This rule applies regardless of format or character set or sets
           involved.
        

        A web server should be converting any text/* content type from the server's native newlines to a proper "network" CRLF pair. This is what the browser should always expect to get back from a web server, and user agents should convert this to its own native newline format when saving the file (if a conversion is necessary).

        In theory, the onus of newline conversions should always be on the mechanisms that push data from one environment to another. FTP has the "ASCII" protocol and commands to determine architecture type. HTTP and SMTP rely on proper implementation of MIME (as noted above). If the implementations are done properly, newlines should never be an issue. Obviously this isn't a perfect world, but I think at least some resources should be used to address the root problem here (poor implementations of the standards or users not doing something correctly) instead of just writing convoluted newline handling mechanisms.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://198126]
Front-paged by Cody Pendant
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (2)
As of 2024-04-25 02:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found