Re: regex in form !regex->regex<-!regex

This kind of problem is exactly why parsing HTML with regexes is not recommended. Sooner or later one runs into this sort of problem: you need to treat something one way if it outside certain tags and another way inside those tags.

I wonder if you could explain more about what you are trying to accomplish. Where is this white space you are replacing - are you trying to beautify the layout of an HTML file? Or are you trying to do replacements inside some tags but not others (like <pre>)?

If you are beautifying, I would seriously consider using an HTML parser like HTML::Parser or some similar module (search CPAN - there are a lot of variant. This will hand the document to you node by node and you pretty print each element exactly as you wish with as much or as little white space as you desire. There might also be a CPAN module that does all this for you, although my very brief attempts at searching (on the key words "html", "beautify") turned up nothing useful. Your mileage may vary.

If you are doing your magic on attribute values or the text between tags, I still strongly recommend that you consider using an HTML parsing module. Instead of trying to guess at whether you are operating on the right sort of tag, a parser will hand you the HTML element by element and you can choose exactly which elements you wish to beautify and what part of those elements (attribute values, text between tags).

Comment on Re: regex in form !regex->regex<-!regex Download Code

Replies are listed 'Best First'.
Re^2: regex in form !regex->regex<-!regex by forestcreature (Novice) on Feb 23, 2011 at 16:07 UTC
Yup, I'm just trying to implement a simple blog (using loathsxome, which is based on blosxom). Posts are based on text files, with or without meta-data (parsed out by one of the existing plugins). I'm just creating a very simple auto-formatting plugin that will come closest to representing posts in much the same way as I'd format a plain ascii text file. Most of the text needs to wrap and behave like text does, hence an approximation of tabbing (4 non-breaking spaces), and `<br/>` instead of `\n`. There is also a quick and dirty syntax for hyperlinks and images. Pretty simple stuff done with a few substitution regexs. The only thing that is giving me trouble is saving ascii art (or 'properly' tabbed stuff) in `<pre>` from the same treatment. I suppose either I could break things up element-wise like you suggest, or perhaps write a last set of substitutions that just reinstates `\n` and `\t` for all cases enclosed in `<pre>`... Even though that seems wasteful and stupid, is it worse than invoking a module to do something simple? Cheers, JJ p.s. I don't know the answer to that, as I'm not a real programmer. My hunch is "yes". :) My second hunch is TMTOWTDI	[reply] [d/l] [select]
Re^3: regex in form !regex->regex<-!regex by ELISHEVA (Prior) on Feb 23, 2011 at 16:33 UTC
perhaps write a last set of substitutions that just reinstates \n and \t for all cases enclosed in `<pre>`... You could "reinstate" those tabs - but how would you know which white space was meant to be a tab (whose width depends on settings) and which was meant to be a hard coded specific amount of space. The "reinstate" solution loses information. If that information matters, it isn't going to be a satisfactory solution. Even though that seems wasteful and stupid, is it worse than invoking a module to do something simple? Modules are cheap. Your time isn't. What you want to do is not as simple as it first seems. This is only the first of many complications you are likely to run into. As someone who has studied Mediawiki's markup parsing, I can almost promise you that you will end up with a lot of ugliness if you try to do everything with regexes. It doesn't take perl a lot to load in a module. It is designed for that sort of thing. It may not even take up extra space on your server. HTML::Parser is such a standard module, some distros and hosting companies just make it available as a matter of course. But even if you have to install it, if learning and using a module will help you do the job better and save you time over full course of your project, you should leap at it. For what you want to do, getting hands on experience with HTML::Parser will open a lot of doors for you. For one it will give you options about how much HTML you want to integrate into your markup. Using a module to do something simple in a way that gives you expansion room is a very smart move. I'm not a fan of using modules for every 5 line snippet I can write and test just as easily myself. However, a module like HTML::Parser represents a lot of work done for you testing and debugging a lot of corner cases and gotcha's. I'd also explore CPAN to see if there are are already parsing modules for the kind of blog markup you want to do. Why invent your own markup from the get go (unless this is a learning exercise), if it turns out that you can adapt the work of someone else who is 80% there?	[reply] [d/l]
Re^3: regex in form !regex->regex<-!regex by CountZero (Bishop) on Feb 23, 2011 at 17:23 UTC
If I understand it well, what you are trying to do is to transform some kind of pseudo-HTML into real HTML. Doesn't loathsxome have a module which does this? CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]


more useful options
	PerlMonks