Refactoring Regular Expressions

logie17 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have a situation where I have a string of data which contains essentially encoded html text, and could really be any size. I then take that string and substring it. Once that is done I use a regular expression to search and replace the html encoded tags. I then use another regular expression to attempt to chop off any unclosed html encoding that might be left at the end of the string. To make a long story short my code appears to be working, however I'm sure there is a way to refactor my regular expressions perhaps into a single statement. Any help would be highly appreciated.
Example Code

$string = substr($incoming, 0, 177);
$string  =~ s/\&lt\;?[-.a-zA-Z0-9]*.*?\&gt\;//gs; #This does a great j
+ob axing anything basically that starts with a < and ends with a >
$string  =~ s/\&lt\;?[-.a-zA-Z0-9]*.*$//gs; #Due to my substring there
+ is sometimes an open bracket such as <img src...chomp
[download]

Is there any way to combine the two regex examples? Thanks again.

s;;5776?12321=10609$d=9409:12100$xx;;s;(\d*);push @_,$1;eg;map{print chr(sqrt($_))."\n"} @_;

Comment on Refactoring Regular Expressions Download Code

Replies are listed 'Best First'.
Re: Refactoring Regular Expressions by chromatic (Archbishop) on Apr 25, 2007 at 06:14 UTC
How "essentially HTML" is this HTML? If it's actually HTML, I would instead decode it then use HTML::Strip to remove the HTML. Writing a reliable regular expression to parse potentially nested tags properly is difficult.	[reply]
Re: Refactoring Regular Expressions - HTML encoded text by varian (Chaplain) on Apr 25, 2007 at 06:51 UTC
Logie17, the module HTML::Entities deals with html-encoded text within entities. `#!/usr/bin/perl use strict; use warnings; use HTML::Entities qw(decode_entities); my $htmlstring = 'This text contains an encoded "<" tag'; print decode_entities($htmlstring),"\n";` [download] This will output: `This text contains an encoded "<" tag` [download] Update: Your remark on encoded html text may have set me off in the wrong direction, apparently you just wanted to remove any tags. It struck me that if you split up the string into substrings you may accidently split in the mid of an encoded tag, and as a result your regex would fail to match the tag.	[reply] [d/l] [select]
Re: Refactoring Regular Expressions by Herkum (Parson) on Apr 25, 2007 at 11:44 UTC
Having a big regular expression is not only harder to read and debug it will be slower as well. If you can use some common sense to break up a string so that you use a minimal regexp then you are doing the right thing.	[reply]
Re: Refactoring Regular Expressions by graff (Chancellor) on Apr 26, 2007 at 01:22 UTC
I don't think it should be a question of refactoring the regexes. Instead, you should refactor the part that involves using "substr" in the first place. Why are you using substr at all? There seems to be no need to do that, and if you don't use it, you'll only need the one regex (the first one) -- unless of course the data turns out to be malformed in some way. But I would agree with the suggestion to use modules for this. In the long run, you'll be glad you did.	[reply]

Back to Seekers of Perl Wisdom