Lady_Aleena has asked for the wisdom of the Perl Monks concerning the following question:
I have become somewhat enamored of how text is parsed by POD for inline HTML elements, like B<a string> or I<a string>. I have gotten a regex to work for them, however I have not been able to figure out how to nest them such as I<B<a string>>. Here is what I have so far:
s/([A-Z]+)<(.+?)>/<\L$1\E>$2<\/\L$1\E>/g;
I have not gotten into adding ids, classes, styles, or other attributes yet.
Here is the list of tags I may use with this regex:
use strict;
use warnings;
while ( my $string = <DATA>) {
$string =~ s/([A-Z]+)<(.+?)>/<\L$1\E>$2<\/\L$1\E>/g;
print $string;
}
__DATA__
ABBR<a string>
ACRONYM<a string>
B<a string>
BIG<a string>
CITE<a string>
CODE<a string>
DFN<a string>
EM<a string>
I<a string>
KBD<a string>
SAMP<a string>
SMALL<a string>
SPAN<a string>
STRONG<a string>
SUB<a string>
SUP<a string>
TT<a string>
VAR<a string>
This is a B<string> which I want to I<parse> almost like EM<a POD>. CO
+DE<Let's see if I can do it.>
Before you ask, yes, I looked for it (for about an hour) on CPAN to no avail.
Thanks in advance!
Have a cookie and a very nice day!
Lady Aleena
Re: POD style regex for inline HTML elements
by Anonymous Monk on Apr 15, 2013 at 00:41 UTC
|
| [reply] |
Re: POD style regex for inline HTML elements
by Lady_Aleena (Curate) on Apr 15, 2013 at 05:52 UTC
|
Here is another sample, with a mix.
SPAN<text I<italic text>|class="span_class"> some more text B<bold tex
+t>.
Here is the desired output:
<span class="span_class">text <i>italic text</i></span> some more text
+ <b>bold text</b>.
Have a cookie and a very nice day!
Lady Aleena
| [reply] [d/l] [select] |
|
#!/usr/bin/perl --
use strict; use warnings;
use Test::More qw' no_plan ';
use Regexp::Common qw/ balanced /;
use Data::Dump qw/ dd pp /;
sub TRACE; sub DEBUG;
*TRACE = *DEBUG = sub { print STDERR @_,"\n" };
#~ *TRACE = *DEBUG = sub { };
our $allowed = join '|', qw[
ABBR ACRONYM B BIG
CITE CODE DFN EM I KBD SAMP SMALL
SPAN STRONG SUB SUP TT VAR
];
{
my $in = 'SPAN<text I<italic text>|class="span_class"> some more t
+ext B<bold text>.';
my $out = shabba($in);
my $wanted = '<span class="span_class">text <i>italic text</i></sp
+an> some more text <b>bold text</b>.';
is( $out, $wanted , 'shabba');
}
{
my $in = 'SPAN<text I<italic B<and bold> text>|class="span_class">
+ some more text B<bold text>.';
my $out = shabba($in);
my $wanted = '<span class="span_class">text <i>italic <b>and bold<
+/b> text</i></span> some more text <b>bold text</b>.';
is( $out, $wanted , 'shabba');
}
exit( 0 );
sub shabba {
local $_ = $_[0];
my $dent = $_[1] || 0;
pos = 0;
my $ret = "";
SHABBALOOP:
while( length > pos ){
m{\G(\s+)}gcsx and do {
$ret .= $1;
next SHABBALOOP;
};
m{
\G( $allowed )( $RE{balanced}{-parens=>'<>'} )
}gcsx and do {
TRACE "# $dent allowed<> { $1 ( $2 ) }";
$ret .= shabba_allowed( "$1" , "$2" , $dent );
next SHABBALOOP;
};
#~ confusion :)
#~ \G(\w+\b)
#~ fail #~ \G([^<]+)(?!:$allowed)\b
#~ \w+\b #~ \G([^<]+?)(?!:$allowed\<)\b
#~ fail #~ \G([^<]+?)(?!:$allowed\<)
#~ fail #~ \G([^<]+)(?!:$allowed\<)
#~ fail \G([^<]+)(?!:\<)
#~ inch #~ \G([^<]+?)(?!:\<)
#~ \G([^<]+?\b[^<])
#~ \G([^<]+?[^<])
#~ FAIL #~ \G([^<]+[^<])
#~ \G([^<]+[^<]\b)
#~ 2same#~ \G( (?!:$allowed\<) .+ )
#~ 2same#~ \G( .+(?!:$allowed\<) )
m{
\G([^<]+\s)
}gcmx and do {
TRACE "# $dent text { $1 }";
$ret .= shabba_text( "$1" );
next SHABBALOOP;
};;;
m{
\G([\<\>])
}gcmx and do {
TRACE "## $dent error-stray<> { $1 } at pos(@{[pos]})";
last SHABBALOOP;
};;;
m{
\G(\S)
}gcmx and do {
TRACE "# $dent inch-forward { $1 }";
$ret .= shabba_text( "$1" );
next SHABBALOOP;
};;;
}
$ret;
}
#~ sub shabba_allowed { join'',@_ }
#~ confusion :)
#~ use Text::Balanced qw' :ALL ';
#~ dd([ extract_multiple( $stuff,[\&extract_bracketed, ],)]);
#~ my $extract_allowed = gen_extract_tagged(qw/$allowed</,'>');dd(
+[ extract_multiple( $stuff,[ $extract_allowed , $extract_allowed , ],
+)]);
#~
#~ 0 and $stuff =~ s{
#~ ( $allowed )( $RE{balanced}{-parens=>'<>'} )
#~ |
#~ (.)
#~ }{
#~ if( defined $2 ){
#~ $ret .= $2;
#~ } else {
#~ $ret .= shabba_allowed( "$1" );
#~ }
#~ "";
#~ }gsex;
sub shabba_allowed {
my( $tag , $stuff, $dent ) = @_;
$stuff = $1 if $stuff =~ m{^<(.*)>$}gs;
my $ret = "";
$ret .= "<\L$tag\E" if $tag;
$stuff =~ s{\|([^<>]+)$}{
$ret .= " $1"; ## shabba_allowed_atts($tag,$1);
"";
}gsex if defined $stuff ;
$ret .= ">" if $tag;
if( defined $stuff and length $stuff and $stuff =~ m{[<>]}g ){
$ret .= shabba( $stuff , $dent+1) ; ## recurse
} else {
$ret .= $stuff;
}
$ret .= "</\L$tag\E>" if $tag;
$ret;
}
sub shabba_text { join'',@_ }
__END__
$ prove -vb lady.alena.balanced.podlike.pl
lady.alena.balanced.podlike.pl .. # 0 allowed<> { SPAN ( <text I<itali
+c text>|class="span_class"> ) }
# 1 text { text }
# 1 allowed<> { I ( <italic text> ) }
# 0 text { some more text }
# 0 allowed<> { B ( <bold text> ) }
# 0 inch-forward { . }
# 0 allowed<> { SPAN ( <text I<italic B<and bold> text>|class="span_cl
+ass"> ) }
ok 1 - shabba
# 1 text { text }
# 1 allowed<> { I ( <italic B<and bold> text> ) }
# 2 text { italic }
# 2 allowed<> { B ( <and bold> ) }
# 2 inch-forward { t }
# 2 inch-forward { e }
# 2 inch-forward { x }
# 2 inch-forward { t }
# 0 text { some more text }
# 0 allowed<> { B ( <bold text> ) }
# 0 inch-forward { . }
ok 2 - shabba
1..2
ok
All tests successful.
Files=1, Tests=2, 0 wallclock secs ( 0.11 usr + 0.02 sys = 0.12 CPU
+)
Result: PASS
| [reply] [d/l] |
|
This would be beautiful as a module. I am having problems wading through it. There are a lot of things in this I do not understand, and I am cargo culting it as best I can. Thank you for your work, Anonymous Monk (whoever you are).
PS. If you see this, would you please tell me what shabba stands for? :)
Have a cookie and a very nice day!
Lady Aleena
| [reply] [d/l] |
|
|
|
|
|
Re: POD style regex for inline HTML elements
by RonW (Parson) on Nov 06, 2014 at 23:21 UTC
|
| [reply] |
|
#!/usr/bin/perl
use strict;
use warnings FATAL => qw( all );
use Text::Balanced qw(extract_bracketed);
use Data::Dumper;
my $text = 'A line with B<bold>, I<italic>, and B<I<bold and italic>>
+text.';
my @line = extract_bracketed( $text, '<>');
print Dumper(\@line);
Here is the returned results...
$VAR1 = [
undef,
'A line with B<bold>, I<italic>, and B<I<bold and italic>> t
+ext.',
undef
];
Either I did something horribly wrong, or it doesn't extract anything just returns the original string with undefs in an array.
No matter how hysterical I get, my problems are not time sensitive. So, relax, have a cookie, and a very nice day!
Lady Aleena
| [reply] [d/l] [select] |
|
Hi Aleena,
The extract_* functions are meant to operate on the start of a string, not from an arbitrary point. As mentioned in the Text::Balanced description, you may skip a prefix before the start of the balanced text, but by default this will only skip whitespace.
So if you were to change text to:
my $text = ' <bold>, I<italic>, and B<I<bold and italic>> text.';
Your output would be:
$VAR1 = [
'<bold>',
', I<italic>, and B<I<bold and italic>> text.',
' '
];
Where the return is a triple of the bracketed text, the remaining string, and the prefix that was bypassed before the bracketed text was found.
If you leave your $text input as it was in your example but change the function call to consider everything preceding a < as a prefix:
my @line = extract_bracketed($text, '<>', qr(.*?(?=<)));
You'll get:
$VAR1 = [
'<bold>',
', I<italic>, and B<I<bold and italic>> text.',
'A line with B'
];
Where the prefix is again everything before the <. but includes the bold code at the end, which you'd have to deal with appropriately.
HTH | [reply] [d/l] [select] |
|
|