http://qs321.pair.com?node_id=191864

chuleto1 has asked for the wisdom of the Perl Monks concerning the following question:

Problem:
I would like to delete all tab space "\t" new lines "\n " and more than one continuos space " " inside opening and closing tags and replacing them with ONE space per instance.
$text = "<tag>
      The purpose of the applicant rating session is for you,
the applicant, to provide a sample of your       
effective teaching skills.</tag>"


The desired result would be:

$text = "<tag>The purpose of the applicant rating session is for you, the applicant, to provide a sample of your effective teaching skills.</tag>"
  • Comment on taking white space out between closing and opening tags

Replies are listed 'Best First'.
Re: taking white space out between closing and opening tags
by Ovid (Cardinal) on Aug 21, 2002 at 20:58 UTC

    Here's a quick, untested, stab at it. Let's assume for this example that you are talking about <p> tags.

    use HTML::TokeParser::Simple; # assumes that $text is a scalar containing the actual HTML my $p = HTML::TokeParser::Simple->new( \$text ); my $token; do { $token = $p->get_token } until $token->is_start_tag('p'); my $new_text = $token->return_text; do ( $token = $p->get_token ) { my $temp = $token->return_text; if ( $token->is_text ) { $temp =~ s/\s+/ /g; # collapse whitespace $temp =~ s/^\s//; # remove initial whitespace $temp =~ s/\s$//; # remove trailing whitespace } $new_text .= $temp; } until $token->is_end_tag('p'); $new_text .= $token->return_text;

    This is a much cleaner method (and accurate) method of accomplishing this task than most regex solutions. I also happen to think that HTML::TokeParser::Simple is easier to use than many other HTML parsing modules. Of course, I may be biased as I wrote that module :)

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Re: taking white space out between closing and opening tags
by dpuu (Chaplain) on Aug 21, 2002 at 20:50 UTC
    split the problem: first get the string, then condense it. Assuming you can't use any of te std XML/HTML modules to get the text, you could try:
    sub condense { $_[0] =~ s/\s+/ /g } $in =~ s/(<tag>)(.*?)(<\/tag>)/ $1 . condense($2) . $3 /ge;
    --Dave
Re: taking white space out between closing and opening tags
by Mr. Muskrat (Canon) on Aug 21, 2002 at 20:51 UTC
    I'll help with the regex requirements.
    #/usr/bin/perl -w use strict; my $text = "<tag>\n\tThe purpose of the applicant rating session is fo +r you,\nthe applicant, to provide a sample of your\t\neffective teach +ing skills.</tag>"; $text =~ s/\s+/ /g; # convert white space to a single space print $text;
    The print statement is just there to show what's taken place.
    edited to match the given example...
    Of course, this still leaves a space between the tag and the text... dpuu and Ovid both give better ways of doing it... but I wasn't paying attention.