Re: Re: Handling Mac, Unix, Win/DOS newlines at readtime...

Replies are listed 'Best First'.
Re: Re: Re: Handling Mac, Unix, Win/DOS newlines at readtime... by Anonymous Monk on Sep 17, 2002 at 04:05 UTC
Yes, I agree with you, that the CRLF rule only applies to header (but there is no chance to miss that blank line between header and body, as we agreed those CRLF's are same in header regardless of platform). However in the body, the line breaks do not 'really' mean anything, as we are dealing with some markup language, not plain text. This is not only true to clear text, but also encoded text. If the text is encoded, the file system should not change anything within the encoded part. I do understand what his concern is, but I am just thinking whether that's something really worth to deal with. I guess we don't really know, as it is not clear what the objective is, I mean the actual objective required by the project, instead of the objective interpretated by the programmer.	[reply]
Re^4: Handling Mac, Unix, Win/DOS newlines at readtime... by Flexx (Pilgrim) on Sep 17, 2002 at 09:09 UTC
This is a HTML problem/question, not a HTTP one. Well, the essence of the original question, if put to a broader context was: "How do I split on line boundaries, when I don't know which line ending schema is used". This for itself makes a quite interesting problem, IMHO. And in this context, it's not irrelevant what the endings are, even if I'd agree that it's not important from HTTP's idea of a HTML body (ah.. which it does not care about at all, it's a transpher protocol that does (of course) not alter nor interpret any message content.) Should you, however, do any processing of HTML data, you need to worry about line endings -- at least when a pre tag is involved... Just because a broser can simply strip the document from any line endings (outside of pre-tags), other application should not. A browser won't output anything. It's the end of the road, so it needs not worry about a documents internal integrity.. ;) So long, Flexx	[reply]
Re: Re: Re: Handling Mac, Unix, Win/DOS newlines at readtime... by Fastolfe (Vicar) on Sep 17, 2002 at 20:52 UTC
~~This is incorrect.~~ From RFC2046: 4.1.1. Representation of Line Breaks The canonical form of any MIME "text" subtype MUST always represent a line break as a CRLF sequence. Similarly, any occurrence of CRLF in MIME "text" MUST represent a line break. Use of CR and LF outside of line break sequences is also forbidden. This rule applies regardless of format or character set or sets involved. A web server should be converting any text/* content type from the server's native newlines to a proper "network" CRLF pair. This is what the browser should always expect to get back from a web server, and user agents should convert this to its own native newline format when saving the file (if a conversion is necessary). In theory, the onus of newline conversions should always be on the mechanisms that push data from one environment to another. FTP has the "ASCII" protocol and commands to determine architecture type. HTTP and SMTP rely on proper implementation of MIME (as noted above). If the implementations are done properly, newlines should never be an issue. Obviously this isn't a perfect world, but I think at least some resources should be used to address the root problem here (poor implementations of the standards or users not doing something correctly) instead of just writing convoluted newline handling mechanisms.	[reply]
Re: Re: Re: Re: Handling Mac, Unix, Win/DOS newlines at readtime... by mdillon (Priest) on Sep 17, 2002 at 21:02 UTC
HTTP is not MIME. HTTP Content-Type values may be the same as MIME types, but what you've quoted doesn't apply to HTTP message bodies.	[reply]
Re: Re: Re: Re: Re: Handling Mac, Unix, Win/DOS newlines at readtime... by Fastolfe (Vicar) on Sep 17, 2002 at 21:27 UTC
It looks like you're closer to the truth here than I was. Perhaps I misinterpreted. Here is an excerpt from the HTTP/1.1 specification: 3.7.1 Canonicalization and Text Defaults Internet media types are registered with a canonical form. An entity-body transferred via HTTP messages MUST be represented in the appropriate canonical form prior to its transmission except for "text" types, as defined in the next paragraph. When in canonical form, media subtypes of the "text" type use CRLF as the text line break. HTTP relaxes this requirement and allows the transport of text media with plain CR or LF alone representing a line break when it is done consistently for an entire entity-body. HTTP applications MUST accept CRLF, bare CR, and bare LF as being representative of a line break in text media received via HTTP. So it does appear HTTP doesn't really care how line endings are specified. This strikes me as a little brain-dead, but ah well...	[reply]
Re: Re: Re: Re: Re: Handling Mac, Unix, Win/DOS newlines at readtime... by Fastolfe (Vicar) on Sep 17, 2002 at 21:11 UTC
You're ~~both~~ right ~~and wrong~~ here. Let me first quote a bit of the HTML 4 specification: 4.3 The text/html content type HTML documents are sent over the Internet as a sequence of bytes accompanied by encoding information (described in the section on character encodings). The structure of the transmission, termed a message entity, is defined by RFC2045 and RFC2616. A message entity with a content type of "text/html" represents an HTML document. Here, the HTML specification explicitly indicates a reliance upon RFC2045 (and by association, RFC2046). An HTML document is stored as a native plain-text document either on a user's PC or a web server. Newlines at this point are native to the OS. Do not make the assumption that HTTP servers are filesystem-based! It just so happens that a filesystem-oriented document root, with URI contents represented as files, is the simplest and most prevalent way HTTP servers are implemented, but do not blindly assume that HTTP was written with the intent that a URI should map to a file on a filesystem. That's just not accurate. But because most HTTP servers do this, they have to do a certain number of things to ensure that a requested resource is being delivered in a fashion consistent with HTTP and MIME. This usually involves examining a file extension for a MIME type, and delivering the contents of the file in a fashion consistent with that MIME type. If an HTML document is being stored on a filesystem with native newlines, an HTTP server that relies on filesystem-oriented content should take steps to ensure that "special cases" like newlines are addressed as well. Conversion should be performed by the web server as a consequence of the web server's filesystem-oriented implementation of an HTTP service. What is the alternative? Turn HTML files into what are effectively binary files due to their quirkly (with respects to the native text format) line endings? If not, what else is supposed to be converting newlines here? Think of this from the user agent's point of view. It's expecting content with the MIME type of text/html. MIME explicitly states that text/html must have line endings in CRLF fashion. How does it get that way? If the server isn't responsible for it, what is? The HTTP servers have assumed this responsibility as a consequence of choosing a filesystem-oriented mechanism for storing content. They have to live with content stored with native line endings and should thus be responsible for getting that converted into something appropriate when delivering text/* content over the Internet.	[reply]
Re: Re: Re: Re: Re: Re: Handling Mac, Unix, Win/DOS newlines at readtime... by mdillon (Priest) on Sep 17, 2002 at 21:23 UTC
Re: Re: Re: Re: Re: Re: Re: Handling Mac, Unix, Win/DOS newlines at readtime... by Fastolfe (Vicar) on Sep 17, 2002 at 21:33 UTC

Re: Re: Handling Mac, Unix, Win/DOS newlines at readtime...

3.7.1 Canonicalization and Text Defaults

4.3 The text/html content type