Substitution Problem

rtlm has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Substitution Problem by meraxes (Friar) on Jul 10, 2004 at 06:54 UTC
The '.' metacharacter doesn't match newlines by default. You'll need to add the s pattern modifier to make it do that. You may want to add the i modifier as well to make it case-insensitive if you don't know that the HTML tags are all uppercase: `s{<FORM(.?)/FORM>}{replacement text}is;` [download] It may also be worth noting that if the HTML is not well formed you could end up removing a heck of a lot more than you intended using this regexp Update:* Whoops. Quite right davido. I assumed that everything was in a single scalar variable. Additionally, for a quickie list of regexp modifiers you can go to perlreref.	[reply] [d/l]
Re^2: Substitution Problem by perldeveloper (Scribe) on Jul 10, 2004 at 15:54 UTC
That's actually perlre, must be a small typo there. Update: No typo, I misread perlreref for perlref.	[reply]
Re^3: Substitution Problem by meraxes (Friar) on Jul 10, 2004 at 22:44 UTC
Ummmmm... no... perlreref is the regex quick ref. If it were a typo then th link wouldn't have worked. ;) Update: Um... still no. perlreref (perl regex reference faq), perlre (perl regex faq) and perlref (perl references and nested datastructures faq) are quite distinct.	[reply]
Re^4: Substitution Problem by perldeveloper (Scribe) on Jul 10, 2004 at 23:47 UTC
Re: Substitution Problem by wfsp (Abbot) on Jul 10, 2004 at 09:30 UTC
I agree. This uses HTML::TokeParser. I have found that it is easily adaptable to do any chore you may have parsing html. Since I've started using it I've never used a regex on html. It's never worth the effort. #!/bin/perl5 use strict; use warnings; use HTML::TokeParser; open HTML_FILE, 'form.html' or die; my $tp = HTML::TokeParser->new( \*HTML_FILE ) or die; my $html; my $found_form = 0; while ( my $t = $tp->get_token ) { $found_form++, next if $t->[0] eq 'S' and $t->[1] eq 'form'; $found_form--, next if $t->[0] eq 'E' and $t->[1] eq 'form'; next if $found_form; $html .= $t->[4] if $t->[0] eq 'S'; $html .= $t->[1] if $t->[0] eq 'T' or $t->[0] eq 'C'; $html .= $t->[2] if $t->[0] eq 'E'; } close HTML_FILE; print "$html\n"; # ["S", $t, $attr, $attrseq, $text] # ["E", $t, $text] # ["T", $text, $is_data] # ["C", $text] # ["D", $text] # ["PI", $token0, $text] [download] wfsp	[reply] [d/l]
Re: Substitution Problem by davido (Cardinal) on Jul 10, 2004 at 06:58 UTC
You didn't mention how you're reading in the document. It may be necessary, in additon to using the /s modifier on your substitution, to also slurp the entire file at once. Otherwise, you'll probably just be reading it one line at a time, and that could foul up your matching. Dave	[reply]
Re: Substitution Problem by beable (Friar) on Jul 10, 2004 at 08:43 UTC
You should really consider using a module like HTML::Parser if you can. It is very difficult to write a regex which matches arbitrary HTML. Consider the Perl Faq entry How do I remove HTML from a string?.	[reply]


Keep It Simple, Stupid
	PerlMonks