Temporarily strip HTML

epoptai has asked for the wisdom of the Perl Monks concerning the following question:

I've got a string with text and html tags destined for a process that will damage the html. So I want to temporarily replace the html with placeholders while storing a "placeholder => html" map in a hash, so it can be restored after the destructive process.

As an example I need to turn this:

$str = 
 qq~This <b>contains</b> both text and <a href="http://www.w3c.org">ht
+ml</a>.~;
[download]

Into this string and hash:

$str = qq~This <1>contains<2> both text and <3>html<4>.~;

$html = {
    1 => 'b',
    2 => '/b',
    3 => 'a href="http://www.w3c.org"',
    4 => '/a'
    }
[download]

Then I can process the string and restore the html with something like:

$str =~ s|<(\d+)>|<$$html{$1}>|g;
[download]

I thought this would be easy, but I'm still staring at the wall. Thanks.

--
Check out my Perlmonks Related Scripts like framechat, reputer, and xNN.

Comment on Temporarily strip HTML Select or Download Code

Replies are listed 'Best First'.
(jeffa) Re: Temporarily strip HTML by jeffa (Bishop) on Jul 19, 2002 at 01:09 UTC
These days i tend to cringe anytime i see a hash whose keys are element indexes ... just use an array, but how about using HTML::TreeBuilder for this instead? I can't elaborate more because you have not said what you are doing with the stuff before you put it back together. jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply]
Re: Temporarily strip HTML by DamnDirtyApe (Curate) on Jul 19, 2002 at 03:00 UTC
I believe this is what you're looking for. Code: #! /usr/bin/perl use strict ; use warnings ; use Data::Dumper ; my $str = qq~This <b>contains</b> both text and <a href="http://www.w3c.org">html</a>.~ ; print $str, "\n\n" ; # Do the replacement... my @tags = () ; my $index = -1 ; $str =~ s\|<([^>]+)>(?{ push @tags, $1 ; $index++ })\|<$index>\|gs ; # Show the replaced text & the stored tags. print "-----\n", Dumper( \@tags ), "\n\n", $str, "\n\n" ; # Sub the tags back in. $index = -1 ; $str =~ s\|<([^>]+)>(?{ $index++ })\|<$tags[$index]>\|gs ; # Show the string with the HTML put back in. print "-----\n", $str, "\n\n" ; [download] Output: `This <b>contains</b> both text and <a href="http://www.w3c.org">html</a>. ----- $VAR1 = [ 'b', '/b', 'a href="http://www.w3c.org"', '/a' ]; This <0>contains<1> both text and <2>html<3>. ----- This <b>contains</b> both text and <a href="http://www.w3c.org">html</a>.` [download] Update: Argh. I was downtown about an hour after I posted this, and it suddenly occurred to me that the second substitution makes much more sense as: `# Sub the tags back in. $str =~ s\|<(\d+)>\|<$tags[$1]>\|gs ;` [download] _______________ D a m n D i r t y A p e Home Node \| Email	[reply] [d/l] [select]
Re: Temporarily strip HTML by Juerd (Abbot) on Jul 19, 2002 at 06:45 UTC
Regexp::IgnoreHTML? - Yes, I reinvent wheels. - Spam: Visit eurotraQ.	[reply]
Temporarily strip HTML (solved!) by epoptai (Curate) on Jul 20, 2002 at 00:54 UTC
Despite the title of this reply i'm still interested in any more guidance on how to do this properly. Since jeffa asked, this problem involves giving framechat the ability to translate the chatterbox in real time using babelfish. I couldn't figure out how to implement an HTML::TreeBuilder solution so i went with DamnDirtyApe's code because it works and resembles the way i had tried and failed to solve the problem. His example works as posted but had problems when tested in the wild. So it evolved into the following working example. Two problems: i had to use two global variables, and the links break sometimes when the translation mixes up the order of the html placeholders. Testing with the live chatterbox today revealed that the second problem is relatively rare, but i'm still thinking about how to prevent it. This test script translates a small sentence of English into a random European language while preserving the HTML: #!/usr/bin/perl -w use strict; use WWW::Babelfish; use CGI 'header'; use Data::Dumper; my %html = (); my $C = 1; my @langs = qw(German French Spanish Italian Portuguese); shuffle(\@langs); my $str = qq~This <b>contains</b> both text and <a href="http://www.w3c.org"> +html</a>.~; $str = translate($str); print "$langs[0]: ", $str, '<pre>', Dumper(\%html); sub translate { my ($txt) = @_; $txt =~ s\|<([^>]+)>\|savetags($1)\|eg; # weak parsing :-/ my $obj = new WWW::Babelfish( 'agent' => 'DeBabelizer' ); return $_[0] unless defined($obj); my $ttxt = $obj->translate( 'source' => 'English', 'destination' => $langs[0], 'text' => $txt ); return $_[0] unless defined($ttxt); $ttxt = encode($ttxt); # replace placeholders with corresponding html # need ;? cause the fish sometimes screws up that colon $ttxt =~ s\|\&lt\;(\d+)\&gt\;?\|<$html{$1}>\|g; return $ttxt } sub savetags { # replace html tag content with placeholders my $htm = pop; $html{$C} = $htm; # ack, global $_ = '<'.$C.'>'; $C++; # global return $_ } ## and two third-party subs that add to the fun sub shuffle { # Perl Cookbook recipe 4.17 my $array = shift; for(my$i = @$array; --$i;){ my$j = int rand ($i+1); next if $i == $j; @$array[$i,$j] = @$array[$j,$i] } } sub encode { # UTF-8 to latin1 regex from XML::TiePYX (thanks to mirod) my($text) = @_; $text =~ s{([\xc0-\xc3])(.)}{ my $hi = ord($1); my $lo = ord($2); chr((($hi & 0x03) <<6) \| ($lo & 0x3F)) }ge; return $text; } [download] Thanks to everyone who replied to my original query! -- Check out my Perlmonks Related Scripts like framechat, reputer, and xNN.	[reply] [d/l]

Back to Seekers of Perl Wisdom