comment on

I've got zillions of lines of stuff that should be html, but, you know, not very clean.

Every line needs to be cleaned up. Problem I'm having is html that has exotic characters like

’

What the hell is that anyway? I don't know, I don't care. It seems to only have meaning under utf-8, and the team I am delivering the data to hasn't switched to utf-8 yet. So the agreed work around is we skip formatting that is "utf-8 only". However, we'd like to quick-convert html to text using HTML::Strip for everything else. Is there a way to do this? Or is there a better way to quick-convert html to text than HTML::Strip?

Below is tests and code that demonstrate the problem.

The meat is in two functions: stripUtf8Entities and stripUtf8EntitiesBetter -- which I call before converting my "html" to text. stripUtf8Entities lets me pass my tests, but only for that one "ugly" special character, I guess it won't work in general. stripUTF8EntitiesBetter doesn't pass tests, because it's just a stub. But this would be the code to change if you have a better idea on how to do this. Test output:

ok 1 - stripUtf8Entities
# before:blah
# after: blah
ok 2 - stripUtf8Entities
# before:&Uuml --
# after: Ü --
ok 3 - stripUtf8Entities
# before:blah -- &rsquo; -- blah
# after: blah --  -- blah
ok 4 - stripUtf8Entities
# before:&Uuml; -- &rsquo; -- blah
# after: Ü --  -- blah
ok 5 - stripUtf8EntitiesBetter
# before:blah
# after: blah
ok 6 - stripUtf8EntitiesBetter
# before:&Uuml --
# after: Ü --
not ok 7 - stripUtf8EntitiesBetter
# before:blah -- &rsquo; -- blah
# after: blah --  -- blah
#   Failed test 'stripUtf8EntitiesBetter
# before:blah -- &rsquo; -- blah
# after: blah --  -- blah'
#   at shopImporter-test.pl line 49.
Wide character in print at /home/hartman/idealo_external_dependencies/
+current/localperl/lib/5.8.8/Test/Builder.pm line 1192.
#          got: 'blah -- â -- blah'
#     expected: 'blah --  -- blah'
not ok 8 - stripUtf8EntitiesBetter
# before:&Uuml; -- &rsquo; -- blah
# after: Ü --  -- blah
#   Failed test 'stripUtf8EntitiesBetter
# before:&Uuml; -- &rsquo; -- blah
# after: Ü --  -- blah'
#   at shopImporter-test.pl line 49.
Wide character in print at /home/hartman/idealo_external_dependencies/
+current/localperl/lib/5.8.8/Test/Builder.pm line 1192.
#          got: 'Ã -- â -- blah'
#     expected: 'Ã --  -- blah'
1..8
# Looks like you failed 2 tests of 8.
[download]

Code:

$ cat utf8-and-html-entities.pl
#!/usr/angebote/perlroot/bin/perl
use strict;
use warnings;

# use strict;
# use IO::File;
# use Text::CSV_XS;
# use DBI;
# use Time::Local;
# use Time::HiRes;
# use Compress::Zlib;
# use LWP::UserAgent;
#use POSIX qw(locale_h);
use HTML::Strip;
use Test::More qw(no_plan);
use Data::Dumper;

#setlocale(LC_CTYPE, "de_DE.ISO8859-1");

require "../../perl/agentFunc.pl";

my $stringsBeforeAfter = [
               [ 'blah', 'blah' ],
               [ '&Uuml --', 'Ü --'],
               ["blah -- &rsquo; -- blah", "blah --  -- blah"],
               ["&Uuml; -- &rsquo; -- blah", "Ü --  -- blah"],
              ];


foreach my $beforeAfter ( @$stringsBeforeAfter ) {
  my ( $before, $after )  = @$beforeAfter;
  my $transformed =HTML2Text(  stripUtf8Entities( $before ) );
  my $strings = [ [ "before", $before ],
                  [ "after", $after ],
                  [ "transformed", $transformed ]
                ];
  #print "strings: " . Dumper($strings);
  is($transformed, $after, "stripUtf8Entities");
}

foreach my $beforeAfter ( @$stringsBeforeAfter ) {
  my ( $before, $after )  = @$beforeAfter;
  my $transformed =HTML2Text(  stripUtf8EntitiesBetter( $before ) );
  my $strings = [ [ "before", $before ],
                  [ "after", $after ],
                  [ "transformed", $transformed ]
                ];
  #print "strings: " . Dumper($strings);
  is($transformed, $after, "stripUtf8EntitiesBetter");
}

sub HTML2Text {
    my ($changeText) = @_;

    my $htmlStripObject = HTML::Strip->new();

    $changeText = $htmlStripObject->parse($changeText);

    return $changeText;
}

# works, but only for one special character: &rsquo
# what happens when I hit another char that doesn't translate well out
+ of utf8?
sub stripUtf8Entities {
   my $string = shift || "";

   my $utf8Entities = ["&rsquo;"];

   foreach my $utf8Entity ( @$utf8Entities ) {
     $string =~ s/$utf8Entity//g;
   }

   return $string;
}

#just a stub -- is there a better, more general way to do this?
sub stripUtf8EntitiesBetter {
   my $string = shift || "";
   return $string;

}
[download]

In reply to HTML::Strip and UTF8 -- is there some way I can just skip all the "UTF8 only" entities? by tphyahoo

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl Monk, Perl Meditation
	PerlMonks