Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

How can I delete characters between < and > in Perl?

by boom (Scribe)
on Apr 18, 2009 at 13:26 UTC ( [id://758464]=perlquestion: print w/replies, xml ) Need Help??

boom has asked for the wisdom of the Perl Monks concerning the following question:

I need to write a Perl script to read in a file, and delete anything inside < >, even if they're on different lines. That is, if the input is:

Hello, world. I <enjoy eating bagels. They are quite tasty.I prefer when I ate a bagel to when I >ate a sandwich. bananas. I want the output to be: Hello, world. I ate a sandwich. bananas. I know how to do this if the text is on 1 line with a regex. But I don't know how to do it with multiple lines. Ultimately I need to be able to conditionally delete parts of a template so I can generate parametrized files for config files. I thought perl would be a good language but I am still getting the hang of it.

  • Comment on How can I delete characters between < and > in Perl?

Replies are listed 'Best First'.
Re: How can I delete characters between < and > in Perl?
by Anonymous Monk on Apr 18, 2009 at 13:54 UTC
    use File::Slurp; my $text = read_file( 'filename' ) ; $text =~ s!<[^>]+>!!g;
      needs to be non-greedy $text =~ s!<[^>]+?>!!g;

        What's the difference? The character class ([^>]) is never going to accidentally slurp up closing hoinkies anyway.

        That is incorrect. That's part of the reason for using negated match classes. It cannot over-match.

Re: How can I delete characters between < and > in Perl?
by roboticus (Chancellor) on Apr 18, 2009 at 14:00 UTC
Re: How can I delete characters between < and > in Perl?
by ambrus (Abbot) on Apr 18, 2009 at 21:18 UTC

    If you want to delete matches spanning multiple lines, just delete the rest of the line if there's an unmatched < sign with an s substitution, and check the return value of that substitution to see if it's happened. If it has, set a flag and keep throwing lines away until you find one with a > sign, where you delete the part up to that sign and then continue applying the ordinary replacements.

    Note however that if you are attempting to strip tags from a html or xml file, you'd better use a proper module instead of regexen written by hand. These will work better with more unusual html constructs and also malformed but usual html like one with unescaped angle brackets. Eg. try something like

    perl -we 'use 5.010; use XML::Twig; binmode STDOUT, "encoding(iso8859- +2)"; $twig = XML::Twig->new->parsefile_html($ARGV[0]); say $twig->roo +t->text;' somefile.html

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://758464]
Approved by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (3)
As of 2024-04-19 14:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found