forestcreature has asked for the wisdom of the Perl Monks concerning the following question:
I feel certain that this sort of question will have been asked before, but I just can't find it! So sorry about that...
I have a couple of substitutions that replace e.g. \n with line-break tags and \t with several non-breaking spaces. However I would like to make the regexs not match for anything enclosed in pre tags tags.
e.g. currently
$$body=~s{\n}{<br/>}gis;
e.g. I would like:
$$body=~s{(!PRECEDED BY SOMETHING.*)\n(!FOLLOWED BY .*SOMETHING}{<br/>}gis;
I guess a lookahead would handle the first condition (i.e. don't match if preceded by a pre tag and/or a bunch of arbitrary other-characters, but that won't work with a lookbehind, because they can't be an arbitrary length right? So I'm a little stuck...
Cheers, JJ
Re: regex in form !regex->regex<-!regex
by ELISHEVA (Prior) on Feb 23, 2011 at 15:46 UTC
|
This kind of problem is exactly why parsing HTML with regexes is not recommended. Sooner or later one runs into this sort of problem: you need to treat something one way if it outside certain tags and another way inside those tags.
I wonder if you could explain more about what you are trying to accomplish. Where is this white space you are replacing - are you trying to beautify the layout of an HTML file? Or are you trying to do replacements inside some tags but not others (like <pre>)?
If you are beautifying, I would seriously consider using an HTML parser like HTML::Parser or some similar module (search CPAN - there are a lot of variant. This will hand the document to you node by node and you pretty print each element exactly as you wish with as much or as little white space as you desire. There might also be a CPAN module that does all this for you, although my very brief attempts at searching (on the key words "html", "beautify") turned up nothing useful. Your mileage may vary.
If you are doing your magic on attribute values or the text between tags, I still strongly recommend that you consider using an HTML parsing module. Instead of trying to guess at whether you are operating on the right sort of tag, a parser will hand you the HTML element by element and you can choose exactly which elements you wish to beautify and what part of those elements (attribute values, text between tags).
| [reply] [Watch: Dir/Any] [d/l] |
|
Yup, I'm just trying to implement a simple blog (using loathsxome, which is based on blosxom). Posts are based on text files, with or without meta-data (parsed out by one of the existing plugins).
I'm just creating a very simple auto-formatting plugin that will come closest to representing posts in much the same way as I'd format a plain ascii text file. Most of the text needs to wrap and behave like text does, hence an approximation of tabbing (4 non-breaking spaces), and <br/> instead of \n. There is also a quick and dirty syntax for hyperlinks and images. Pretty simple stuff done with a few substitution regexs. The only thing that is giving me trouble is saving ascii art (or 'properly' tabbed stuff) in <pre> from the same treatment.
I suppose either I could break things up element-wise like you suggest, or perhaps write a last set of substitutions that just reinstates \n and \t for all cases enclosed in <pre>... Even though that seems wasteful and stupid, is it worse than invoking a module to do something simple?
Cheers, JJ
p.s. I don't know the answer to that, as I'm not a real programmer. My hunch is "yes". :) My second hunch is TMTOWTDI
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
perhaps write a last set of substitutions that just reinstates \n and \t for all cases enclosed in <pre>...
You could "reinstate" those tabs - but how would you know which white space was meant to be a tab (whose width depends on settings) and which was meant to be a hard coded specific amount of space. The "reinstate" solution loses information. If that information matters, it isn't going to be a satisfactory solution.
Even though that seems wasteful and stupid, is it worse than invoking a module to do something simple?
Modules are cheap. Your time isn't.
What you want to do is not as simple as it first seems. This is only the first of many complications you are likely to run into. As someone who has studied Mediawiki's markup parsing, I can almost promise you that you will end up with a lot of ugliness if you try to do everything with regexes.
It doesn't take perl a lot to load in a module. It is designed for that sort of thing. It may not even take up extra space on your server. HTML::Parser is such a standard module, some distros and hosting companies just make it available as a matter of course. But even if you have to install it, if learning and using a module will help you do the job better and save you time over full course of your project, you should leap at it.
For what you want to do, getting hands on experience with HTML::Parser will open a lot of doors for you. For one it will give you options about how much HTML you want to integrate into your markup. Using a module to do something simple in a way that gives you expansion room is a very smart move.
I'm not a fan of using modules for every 5 line snippet I can write and test just as easily myself. However, a module like HTML::Parser represents a lot of work done for you testing and debugging a lot of corner cases and gotcha's. I'd also explore CPAN to see if there are are already parsing modules for the kind of blog markup you want to do. Why invent your own markup from the get go (unless this is a learning exercise), if it turns out that you can adapt the work of someone else who is 80% there?
| [reply] [Watch: Dir/Any] [d/l] |
|
| [reply] [Watch: Dir/Any] |
Re: regex in form !regex->regex<-!regex
by kennethk (Abbot) on Feb 23, 2011 at 16:19 UTC
|
As ELISHEVA rightly points out, this really is a job for an HTML parser. The task you want to accomplish is generally not worth the effort it takes for the result - probably the two most challenging aspects for getting your desired result are the possibility of nested tags and the lack of support for variable width look-behinds (Looking ahead and looking behind). You could get something like your desired behavior with:
#!/usr/bin/perl
use strict;
use warnings;
my $text = <<EOT;
<p>This is a line
with a break.</p><pre>This is a pre
with a break.</pre><p>This is a line
with a break.</p>
EOT
1 while $text =~ s{^((?:(?!<pre>).|<pre>(?:(?!</pre>).)*</pre>)*?)\n}{
+$1<br/>}is;
print $text;
which outputs <p>This is a line<br/>with a break.</p><pre>This is a pre
with a break.</pre><p>This is a line<br/>with a break.</p><br/>
YAPE::Regex::Explain breaks this down as The regular expression:
(?is-mx:^((?:(?!<pre>).|<pre>(?:(?!</pre>).)*</pre>)*?)\n)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?is-mx: group, but do not capture (case-insensitive)
(with . matching \n) (with ^ and $ matching
normally) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
<pre> '<pre>'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
. any character
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
<pre> '<pre>'
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
</pre> '</pre>'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
. any character
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
</pre> '</pre>'
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\n '\n' (newline)
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
Note that you have to rerun the regex (as opposed to using the g modifier) since you have to always anchor at the start. Also note that trailing br. That hints at a larger problem - are you absolutely certain you want to change all newlines in your input? They tend to show up in strange locations. It's all these corner cases that make a pre-built library so worth while. HTML::Parser has been tested and debugged for 15 years, not the 15 minutes one would like to spend. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
| [reply] [Watch: Dir/Any] |
Re: regex in form !regex->regex<-!regex
by AnomalousMonk (Archbishop) on Feb 23, 2011 at 23:19 UTC
|
>perl -wMstrict -le
"my $s =
qq{foo\nbar\t<pRe> no\nnot\tnever </PrE> x\ty\nz };
;;
my %replace = (
qq{\n} => '<br/>',
qq{\t} => ' ',
);
;;
my $pre = qr{ (?i) <pre> [^<]* </pre> }xms;
;;
print qq{[[$s]]};
$s =~ s{ $pre (*SKIP) (*FAIL) | ([\n\t]) }{$replace{$1}}xmsg;
print qq{[[$s]]};
"
[[foo
bar <pRe> no
not never </PrE> x y
z ]]
[[foo<br/>bar <pRe> no
not never </PrE> x y<br/>z ]]
| [reply] [Watch: Dir/Any] [d/l] |
|
|