Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Handling Mac, Unix, Win/DOS newlines at readtime...

by Anonymous Monk
on Sep 17, 2002 at 03:02 UTC ( [id://198419]=note: print w/replies, xml ) Need Help??


in reply to Handling Mac, Unix, Win/DOS newlines at readtime...

Well, all those replies are good, but if you are getting the data directly from wire thru HTTP protocol, then there is no point to worry the platform from which the html file comes, as the line break would be \r\n according to RFC 2616, regardless of platform. That will ease up the whole thing. Pei Guo
  • Comment on Re: Handling Mac, Unix, Win/DOS newlines at readtime...

Replies are listed 'Best First'.
Re: Re: Handling Mac, Unix, Win/DOS newlines at readtime...
by mdillon (Priest) on Sep 17, 2002 at 03:24 UTC
    That only applies to the headers, not the body of the response. If the Transfer-Encoding is "identity" or omitted (in which case I believe "identity" is implied), the line endings can be anything; the body is just a stream of octets. If the line ending had to be CRLF in the body, what would that do to binary downloads?

    Even if the RFC can be interpreted to say that the message body MUST have CRLF line endings (which I doubt), in real life, documents served by HTTP servers have all sorts of line endings (I've even seen mixed line endings in HTML files delivered via HTTP). If you think otherwise, look at the bytes delivered from http://www.microsoft.com/ (CRLF), http://www.linux.org/ (LF), http://www.linux.com/ (mostly LF with a few errant CRLFs), and (drat... couldn't find any pure-CR URLs).

      Yes, I agree with you, that the CRLF rule only applies to header (but there is no chance to miss that blank line between header and body, as we agreed those CRLF's are same in header regardless of platform). However in the body, the line breaks do not 'really' mean anything, as we are dealing with some markup language, not plain text. This is not only true to clear text, but also encoded text. If the text is encoded, the file system should not change anything within the encoded part. I do understand what his concern is, but I am just thinking whether that's something really worth to deal with. I guess we don't really know, as it is not clear what the objective is, I mean the actual objective required by the project, instead of the objective interpretated by the programmer.

        This is a HTML problem/question, not a HTTP one. Well, the essence of the original question, if put to a broader context was: "How do I split on line boundaries, when I don't know which line ending schema is used". This for itself makes a quite interesting problem, IMHO.

        And in this context, it's not irrelevant what the endings are, even if I'd agree that it's not important from HTTP's idea of a HTML body (ah.. which it does not care about at all, it's a transpher protocol that does (of course) not alter nor interpret any message content.)

        Should you, however, do any processing of HTML data, you need to worry about line endings -- at least when a pre tag is involved... Just because a broser can simply strip the document from any line endings (outside of pre-tags), other application should not. A browser won't output anything. It's the end of the road, so it needs not worry about a documents internal integrity.. ;)

        So long,
        Flexx

      This is incorrect. From RFC2046:

      4.1.1.  Representation of Line Breaks
      
         The canonical form of any MIME "text" subtype MUST always represent a
         line break as a CRLF sequence.  Similarly, any occurrence of CRLF in
         MIME "text" MUST represent a line break.  Use of CR and LF outside of
         line break sequences is also forbidden.
      
         This rule applies regardless of format or character set or sets
         involved.
      

      A web server should be converting any text/* content type from the server's native newlines to a proper "network" CRLF pair. This is what the browser should always expect to get back from a web server, and user agents should convert this to its own native newline format when saving the file (if a conversion is necessary).

      In theory, the onus of newline conversions should always be on the mechanisms that push data from one environment to another. FTP has the "ASCII" protocol and commands to determine architecture type. HTTP and SMTP rely on proper implementation of MIME (as noted above). If the implementations are done properly, newlines should never be an issue. Obviously this isn't a perfect world, but I think at least some resources should be used to address the root problem here (poor implementations of the standards or users not doing something correctly) instead of just writing convoluted newline handling mechanisms.

        HTTP is not MIME. HTTP Content-Type values may be the same as MIME types, but what you've quoted doesn't apply to HTTP message bodies.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://198419]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-04-25 04:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found