Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

regex in form !regex->regex<-!regex

by forestcreature (Novice)
on Feb 23, 2011 at 15:24 UTC ( [id://889803]=perlquestion: print w/replies, xml ) Need Help??

forestcreature has asked for the wisdom of the Perl Monks concerning the following question:

I feel certain that this sort of question will have been asked before, but I just can't find it! So sorry about that...

I have a couple of substitutions that replace e.g. \n with line-break tags and \t with several non-breaking spaces. However I would like to make the regexs not match for anything enclosed in pre tags tags.

e.g. currently
$$body=~s{\n}{<br/>}gis;
e.g. I would like:
$$body=~s{(!PRECEDED BY SOMETHING.*)\n(!FOLLOWED BY .*SOMETHING}{<br/>}gis;

I guess a lookahead would handle the first condition (i.e. don't match if preceded by a pre tag and/or a bunch of arbitrary other-characters, but that won't work with a lookbehind, because they can't be an arbitrary length right? So I'm a little stuck...

Cheers,
JJ

Replies are listed 'Best First'.
Re: regex in form !regex->regex<-!regex
by ELISHEVA (Prior) on Feb 23, 2011 at 15:46 UTC

    This kind of problem is exactly why parsing HTML with regexes is not recommended. Sooner or later one runs into this sort of problem: you need to treat something one way if it outside certain tags and another way inside those tags.

    I wonder if you could explain more about what you are trying to accomplish. Where is this white space you are replacing - are you trying to beautify the layout of an HTML file? Or are you trying to do replacements inside some tags but not others (like <pre>)?

    If you are beautifying, I would seriously consider using an HTML parser like HTML::Parser or some similar module (search CPAN - there are a lot of variant. This will hand the document to you node by node and you pretty print each element exactly as you wish with as much or as little white space as you desire. There might also be a CPAN module that does all this for you, although my very brief attempts at searching (on the key words "html", "beautify") turned up nothing useful. Your mileage may vary.

    If you are doing your magic on attribute values or the text between tags, I still strongly recommend that you consider using an HTML parsing module. Instead of trying to guess at whether you are operating on the right sort of tag, a parser will hand you the HTML element by element and you can choose exactly which elements you wish to beautify and what part of those elements (attribute values, text between tags).

      Yup, I'm just trying to implement a simple blog (using loathsxome, which is based on blosxom). Posts are based on text files, with or without meta-data (parsed out by one of the existing plugins).

      I'm just creating a very simple auto-formatting plugin that will come closest to representing posts in much the same way as I'd format a plain ascii text file. Most of the text needs to wrap and behave like text does, hence an approximation of tabbing (4 non-breaking spaces), and <br/> instead of \n. There is also a quick and dirty syntax for hyperlinks and images. Pretty simple stuff done with a few substitution regexs. The only thing that is giving me trouble is saving ascii art (or 'properly' tabbed stuff) in <pre> from the same treatment.

      I suppose either I could break things up element-wise like you suggest, or perhaps write a last set of substitutions that just reinstates \n and \t for all cases enclosed in <pre>... Even though that seems wasteful and stupid, is it worse than invoking a module to do something simple?

      Cheers,
      JJ

      p.s. I don't know the answer to that, as I'm not a real programmer. My hunch is "yes". :) My second hunch is TMTOWTDI

        perhaps write a last set of substitutions that just reinstates \n and \t for all cases enclosed in <pre>...

        You could "reinstate" those tabs - but how would you know which white space was meant to be a tab (whose width depends on settings) and which was meant to be a hard coded specific amount of space. The "reinstate" solution loses information. If that information matters, it isn't going to be a satisfactory solution.

        Even though that seems wasteful and stupid, is it worse than invoking a module to do something simple?

        Modules are cheap. Your time isn't.

        What you want to do is not as simple as it first seems. This is only the first of many complications you are likely to run into. As someone who has studied Mediawiki's markup parsing, I can almost promise you that you will end up with a lot of ugliness if you try to do everything with regexes.

        It doesn't take perl a lot to load in a module. It is designed for that sort of thing. It may not even take up extra space on your server. HTML::Parser is such a standard module, some distros and hosting companies just make it available as a matter of course. But even if you have to install it, if learning and using a module will help you do the job better and save you time over full course of your project, you should leap at it.

        For what you want to do, getting hands on experience with HTML::Parser will open a lot of doors for you. For one it will give you options about how much HTML you want to integrate into your markup. Using a module to do something simple in a way that gives you expansion room is a very smart move.

        I'm not a fan of using modules for every 5 line snippet I can write and test just as easily myself. However, a module like HTML::Parser represents a lot of work done for you testing and debugging a lot of corner cases and gotcha's. I'd also explore CPAN to see if there are are already parsing modules for the kind of blog markup you want to do. Why invent your own markup from the get go (unless this is a learning exercise), if it turns out that you can adapt the work of someone else who is 80% there?

        If I understand it well, what you are trying to do is to transform some kind of pseudo-HTML into real HTML.

        Doesn't loathsxome have a module which does this?

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: regex in form !regex->regex<-!regex
by kennethk (Abbot) on Feb 23, 2011 at 16:19 UTC
    As ELISHEVA rightly points out, this really is a job for an HTML parser. The task you want to accomplish is generally not worth the effort it takes for the result - probably the two most challenging aspects for getting your desired result are the possibility of nested tags and the lack of support for variable width look-behinds (Looking ahead and looking behind). You could get something like your desired behavior with:

    #!/usr/bin/perl use strict; use warnings; my $text = <<EOT; <p>This is a line with a break.</p><pre>This is a pre with a break.</pre><p>This is a line with a break.</p> EOT 1 while $text =~ s{^((?:(?!<pre>).|<pre>(?:(?!</pre>).)*</pre>)*?)\n}{ +$1<br/>}is; print $text;
    which outputs
    <p>This is a line<br/>with a break.</p><pre>This is a pre with a break.</pre><p>This is a line<br/>with a break.</p><br/>
    YAPE::Regex::Explain breaks this down as
    The regular expression: (?is-mx:^((?:(?!<pre>).|<pre>(?:(?!</pre>).)*</pre>)*?)\n) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?is-mx: group, but do not capture (case-insensitive) (with . matching \n) (with ^ and $ matching normally) (matching whitespace and # normally): ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- (?: group, but do not capture (0 or more times (matching the least amount possible)): ---------------------------------------------------------------------- (?! look ahead to see if there is not: ---------------------------------------------------------------------- <pre> '<pre>' ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- . any character ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- <pre> '<pre>' ---------------------------------------------------------------------- (?: group, but do not capture (0 or more times (matching the most amount possible)): ---------------------------------------------------------------------- (?! look ahead to see if there is not: ---------------------------------------------------------------------- </pre> '</pre>' ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- . any character ---------------------------------------------------------------------- )* end of grouping ---------------------------------------------------------------------- </pre> '</pre>' ---------------------------------------------------------------------- )*? end of grouping ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
    Note that you have to rerun the regex (as opposed to using the g modifier) since you have to always anchor at the start. Also note that trailing br. That hints at a larger problem - are you absolutely certain you want to change all newlines in your input? They tend to show up in strange locations. It's all these corner cases that make a pre-built library so worth while. HTML::Parser has been tested and debugged for 15 years, not the 15 minutes one would like to spend.

      You make a good case, many thanks for the help both of you!

      JJ

Re: regex in form !regex->regex<-!regex
by AnomalousMonk (Archbishop) on Feb 23, 2011 at 23:19 UTC

    Gentle forestcreature: I strongly endorse the wise advice of others to use a proper HTML parser, but here's another (very) naive regex approach using the Special Backtracking Control Verbs of 5.10+ (see perlre):

    >perl -wMstrict -le "my $s = qq{foo\nbar\t<pRe> no\nnot\tnever </PrE> x\ty\nz }; ;; my %replace = ( qq{\n} => '<br/>', qq{\t} => '&nbsp; &nbsp; &nbsp;', ); ;; my $pre = qr{ (?i) <pre> [^<]* </pre> }xms; ;; print qq{[[$s]]}; $s =~ s{ $pre (*SKIP) (*FAIL) | ([\n\t]) }{$replace{$1}}xmsg; print qq{[[$s]]}; " [[foo bar <pRe> no not never </PrE> x y z ]] [[foo<br/>bar&nbsp; &nbsp; &nbsp;<pRe> no not never </PrE> x&nbsp; &nbsp; &nbsp;y<br/>z ]]

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://889803]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (8)
As of 2024-03-29 14:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found