Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Substitutions are happening in the wrong place

by Guildenstern (Deacon)
on Apr 25, 2001 at 21:16 UTC ( [id://75543]=perlquestion: print w/replies, xml ) Need Help??

Guildenstern has asked for the wisdom of the Perl Monks concerning the following question:

Well, progress marches on for my conversion of HTML to a proprietary ML. (See this node for more info.)

As it turns out, the source HTML needs quite a bit of doctoring before I can run it through the conversion process with a high certainty of success. One of the cleaning tasks I must perform is some link management. The markup language I am converting to only allows linking within a page, so external links must be edited. This part I can handle through an XSL transform. The real problem lies within the intra-page links.

The HTML defines a rather large number of <a name="foo"> anchors, with <a href="#foo"> links. What happens, however, is that there is also a large number of anchors defined that are not linked to, and links defined to anchors that have not been declared. What I do, then, is to parse the HTML and generate two hashes, one for anchors declared, and one for links that target anchors within the document. From there it's a simple task to see where the two hashes meet and count those links and anchors as valid.

For unlinked anchors, it makes sense just to remove the declaration, since there's nothing that will link to it due to the constriction of not being able to link to other documents. Removing invalid links is a bit tougher, but I worked up this simple regex to handle it:
foreach (keys %links) { if ($intext =~ s#$links{$_}([^</]+)</a>#$1#ig) { print "Link removed: $_\n"; } else { print "Problem removing link: $_\n"; } }

Basically, I wanted to be able to preserve the text within the link while removing the tags. %links is a hash that has the link target as the keys and the full tag as the values. e.g. foo => <a href="#foo">. This makes it easy to compare to the defined anchors to determine valid links.


The problem (finally!).

There are two entries in %links that are acting a bit strangeley:
use() => <a href="#use()"> use => <a href="#use">

The problem arises when the above regex is applied to these two entries. The use() entry replaces all instances of use, and the use entry fails to make any replacements. The resulting output is left with all occurrences of <a href="#use()"> instead of replacing them.

What I can't understand is why /<a href="#use()">/ is matching <a href="#use">. Is there something happening due to the parens? Am I just smoking crack? I'm really stuck at this point, and while I could manually fix the missed replacements, it kind of defeat the whole notion of an automated process.

Guildenstern
Negaterd character class uber alles!

Replies are listed 'Best First'.
Re: Substitutions are happening in the wrong place
by suaveant (Parson) on Apr 25, 2001 at 21:22 UTC
     /<a href="#use\(\)">/ your parens are putting '' into $1... you need to escape them

    Update sorry... putting \Q \E around a variable will take care of escaping non word characters in the variable
    so \Q$links{$_}\E in your regexp
                    - Ant

Re: Substitutions are happening in the wrong place
by hdp (Beadle) on Apr 25, 2001 at 21:32 UTC
    ()are metacharacters in regexes; for a full list, look in perldoc perlre. (I'm guessing you knew this but didn't expect them to be magical when interpolated, as this seems to catch people often.)

    As Ant says, you can use \Q. Also, look at quotemeta, which can be more readable sometimes (higher \w to punctuation ratio).

    hdp.

Re: Substitutions are happening in the wrong place
by Xxaxx (Monk) on Apr 25, 2001 at 23:50 UTC
    Your use of
    [^</]
    in the regex:
    if ($intext =~ s#$links{$_}([^</]+)</a>#$1#ig)
    leads me to wonder if you might be thinking that the </ is acting as a unit.

    The way it is written I believe you'll be looking for any character not equal to < and not equal to /.

    There could be a / in your text for some of the links.

    For example: "This stuff and/or that stuff"

    Claude

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://75543]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (None)
    As of 2024-04-19 00:00 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found