Distinguish between HTML and Plain text

vit has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Distinguish between HTML and Plain text by JavaFan (Canon) on Sep 27, 2011 at 01:10 UTC
You cannot. Remember that the content of P elements can consist of just PCDATA. Which can just be "plain text". And even if you have a piece of data that validates against an HTML DTD, you still cannot know whether the author intended it as HTML, or as plain text. If you need to know, you either have to use some heuristics (for instance, it "validates", either against a DTD or the more usual "my browser doesn't barf on it"), or ask the user.	[reply]
Re: Distinguish between HTML and Plain text by ikegami (Patriarch) on Sep 26, 2011 at 23:11 UTC
Impossible. At best, you can take a guess. But you can guess very reliably because HTML must have an HTML element. If you don't know if it's text or HTML, then you're surely dealing with bytes, so you need to handle UTF-16le, UTF-16be, UCS-2le, UCS-2be, UCS-4le, UCS-4be: `/<HTML\|<\0H\0T\0M\0L\|<\0\0\0H\0\0\0T\0\0\0M\0\0\0L/` [download] If you're somehow dealing with decoded text: `/<HTML/` [download] Update: No, that's still not good enough. A text version of this very post would fail, for example.	[reply] [d/l] [select]
Re^2: Distinguish between HTML and Plain text by vit (Friar) on Sep 26, 2011 at 23:26 UTC
But you can guess very reliably because HTML must have an HTML element I forgot to mention that the html entered may be just a part of HTML, so assuming presence of "<html" tag will not work.	[reply]
Re^3: Distinguish between HTML and Plain text by ikegami (Patriarch) on Sep 26, 2011 at 23:36 UTC
This is HTML: `Please use <code>...</code> tags around your code.` [download] This is text: `Please use <code>use strict;</code> in your code.` [download] How can one possibly correctly identify them programatically? PS - This is the reason Atom is better than RSS. RSS doesn't provide a mean of specifying the content type, so it can't distinguish between text and HTML content. Clients have to guess. You could take a peek at how RSS clients do it, but I suspect they might work with less ambiguous content than you.	[reply] [d/l] [select]
Re: Distinguish between HTML and Plain text by Khen1950fx (Canon) on Sep 27, 2011 at 02:24 UTC
It can be done. Is it precise? You be the judge: `#!/usr/bin/perl -l use strict; use warnings; use Text::FromAny; my $log = '/root/Desktop/text.log'; open STDOUT, '>', $log; my $entries= "<TITLE>Page 7</TITLE>"; print $entries; my $tFromAny = Text::FromAny->new(file => $log); print $tFromAny->detectedType; close STDOUT;` [download] Instead of thinking "can't", think "don't do that". It works, but it's not best practice.	[reply] [d/l]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks