Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

appending a unique marker to each url in a file

by cat2014 (Monk)
on Aug 08, 2001 at 07:29 UTC ( [id://102966]=perlquestion: print w/replies, xml ) Need Help??

cat2014 has asked for the wisdom of the Perl Monks concerning the following question:

this should be a pretty simple problem, but it's got me at a dead end. i have a file with a set of semi-random marker strings, like so:
anfnf11 iopi1p83288 9032-jjjf
and a html page which has a bunch of links inside it. my problem seemed really simple when i started this- i want to take the first marker string, and stick it on the end of the first link, then take the second marker string, and append it to the end of the second link. I can guarantee that all the links start with http:// or https://, so i'm looking for that in my regex. this is what my text would look like before the script runs:
i have <a href="http://www.somewhere.org/foo/">a link</a> and <a href= +https://somewhere.else.net/bar/>the second link</a>.
after the script runs, it would look like this:
<body>this is <a href="http://www.somewhere.org/foo/anfnf11">a link</a +> and <a href=https://somewhere.else.net/bar/iopi1p83288>the second l +ink</a>
theoretically, the number of links in the html page is the same as the number of marker strings, but i'd like to fail gracefully if i run out of markers or if there are more markers than links. anyway, i thought it would be no problem- slurp the markers into an array, then do a substitution on the values of the links with the marker at the end. i started with something like this:
# @markers already holds each marker in each array spot #$htmlfile already has the text of my html file foreach my $m (@markers){ $htmlfile =~ s/\G<a\s+href\s+=\s+\"?(http[\s\>]+)"?>/<a href="$1$m +">/gi; }
That, of course, didn't work- the first link ended up with all the markers at the end of it. I tried a while loop, too:
# @markers already holds each marker in each array spot #$htmlfile already has the text of my html file my $count = 0; while ($htmlfile =~ m/<a\s+href\s*=\s*\"?(http[\s\>]+)"?>/gi){ $htmlfile =~ s/$1/$1$markers[$count]/; $count ++; }
The while loop feels like it's close, but it's not working,either- no markers end up in the output. So I'm kind of stuck here-- I feel like the solution to this is really simple, and I'm just missing it entirely. -- cat

Replies are listed 'Best First'.
Re: appending a unique marker to each url in a file
by thatguy (Parson) on Aug 08, 2001 at 07:54 UTC
    I think using a regex on the entire file may get a little complicated.

    I would use HTML::TokeParser to pull the links out of your file and then modify them from there, like so

    #!/usr/bin/perl -w use HTML::TokeParser; use strict; my $i=0; ## set marker definintions my @markers = qw/ anfnf11 iopi1p83288 9032-jjjf /; my $htmlfile = "index.html"; my $content; ## get contents of your html file open (FILE," $htmlfile") || die "Cannot open HTML file for parsing!: $ +!\n"; while(<FILE>) { $content .= $_; } close(FILE); my $parse = HTML::TokeParser->new(\$content); while (my $token = $parse->get_tag("a")) { my $url = $token->[1]{href} || "-"; ## put link into $url my $text = $parse->get_trimmed_text("/a"); ## put link de +sc into $text if ($markers[$i]) { print "<a href=$url/$markers[$i]>$text</a>\n"; } else { ## no more markers... } $i++; } exit;

    Update: fixed the way data was put into $content courtesy of Hofmator.

    -p

      open (FILE," $htmlfile") || die "Cannot open HTML file for parsing!: $ +!\n"; while(<FILE>) { $content="$content$_\n"; } close(FILE);
      This construct is not ideal. You are interpolating the variable $content into a new string for every line. You should use concatenation and just append to the string:
      while(<FILE>) { $content .= "$_\n"; }
      But what you are doing now is slurping in the whole file and adding an extra newline at the end of each line (for which I see absolutely no reason). The same thing can be achieved by undefing $/ like this:
      { local $/; # undefs $/ for this block of code only open (FILE," $htmlfile") || die "Cannot open HTML file for parsing!: + $!\n"; $content = <FILE>; # reads in whole file $content =~ s/\n/\n\n/g; # if really necessary to duplicate newlines close(FILE); }

      -- Hofmator

        How about just $content = join'', <FILE>; No need to mess with $/ and have to remember to localise it. Wo betide he who forgets to localise $, $" $/ $\

        Update

        As chipmunk points out this is slower than undef $/ for the gory details see Re: Re: Re: Re: Re: appending a unique marker to each url in a file. For big files the difference is significant, for small ones it is negligible but who wants to paint themselves into a scaling corner? It is better to undef $/, just remember to localise it.

        Ugh posted bad code again.

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: appending a unique marker to each url in a file
by Cubes (Pilgrim) on Aug 08, 2001 at 08:12 UTC
    As wonderful as regexes are, sometimes they're more trouble than they're worth. The snippet below will do what you want, regardless of whether your links start with http or not. It won't do the right thing if the href targets aren't quoted, or if you have an <a> tag without an href followed by some other tag with an href before the next link, but this is the 5-minute version.

    $pos = 0; while ($m = shift @markers) { # locate the beginning of the link last if (($pos = index $htmlfile, '<a', $pos) < 0); # ...then the start of the link's href last if (($pos = index $htmlfile, 'href="', $pos) < 0); # skip past the first " $pos += 6; # ...then the end of the quoted href target last if (($pos = index $htmlfile, '"', $pos) < 0); substr($htmlfile, $pos, 0) = $m; }

    At the end, $pos will be -1 and @markers will be empty if you ran out of links before you ran out of markers. If $pos is not -1, do one more index looking for <a and/or href=. If it hits (i.e., does not return -1), you ran out of markers before all of the links were done. If it does return -1, your links and @markers matched up perfectly.

    Update: Woops, my ending logic was broken (it's fixed now). The final index check has to be done if $pos is not -1, not just if there's anything left in @markers as I originally stated.

Re: appending a unique marker to each url in a file
by chipmunk (Parson) on Aug 08, 2001 at 17:50 UTC
    What you need to do is grab the next marker for each substitution. Here's one way of doing it, using /e so you can execute Perl code in the replacement.
    $htmlfile =~ s/<a\s+href\s*=\s*(["']?)(http.*?)\1>/ @markers or die "More URLs than markers.\n"; qq{<a href="$2} . shift(@markers) . '">'/gie; @markers and die "More markers than URLs.\n";
    However, this approach still has all the drawbacks of using a regex to match HTML. Using a proper HTML parser would give you a much more robust solution.
My end solution to appending a unique marker to each url in a file
by cat2014 (Monk) on Aug 09, 2001 at 02:27 UTC
    For the curious, I'll post what I ended up doing. One of the main requirements of this script was that it carefully preserved the formatting of the files that it ran on- the only change could be the addition of the url markers, so that influenced my solution. the script would be called with two files- one html and one plain text. Sample html file to call with:
    this a link: <br> <a href="http://www.somewhere.org/foo">somewhere</a> +<p> <a href= https://somewhere.else.net/bar/>another place to go</a>. +<p>
    And sample text file pair to the html file:
    this a link: http://www.somewhere.org/foo another place to go: https://somewhere.else.net/bar/
    so here's my code to tag the urls in these files:
    #$unprocessed_html holds the text of the html version of my file #$unprocessed_text holds the plain text version of my file #@markers holds the unique url markers, ie [/qfk33pe][/nnd92093] #handle html version my $processed_html = ""; while (length($unprocessed_html) > 0) { if($unprocessed_html =~ s/^(.*?\b(href|action)\s*=\s*)//si) { $processed_html .= $1; } else { $processed_html .= $unprocessed_html; last; } if (not(@markers)){ $processed_html .= $unprocessed_html; warn "no more markers available for remaining links\n"; last; } if($unprocessed_html =~ s/^([^\"\'][^<>\s]*)//) { my $url = $1 . shift(@markers); #strip double //s due to sloppy input $url =~ s|//|/|g; $processed_html .= $url; } elsif($unprocessed_html =~ s/^([\"\'])([^<>]*?)\1//) { my $url = $1 . $2 . shift(@markers) . $1; #strip double //s that can result from sloppy input $url =~ s|//|/|g; $processed_html .= $url; } else { die "something happened here"; } } #handle text version my $processed_text = ""; while (length($unprocessed_text) > 0) { if($unprocessed_text =~ s/^(.*?\b)http/http/si) { $processed_text .= $1; } else { $processed_text .= $unprocessed_text; last; } if (not(@markers)){ $processed_text .= $unprocessed_text; warn "no more markers available for remaining links\n"; last; } if($unprocessed_text =~ s/(http[\S]+)\s+//){ my $urlfound = $1 . shift (@markers); $urlfound =~ s|//|/|g; $processed_text .= "$urlfound\n"; } elsif($unprocessed_text =~ s/(http[\S]+)$//){ my $urlfound = $1 . shift (@markers); $urlfound =~ s|//|/|g; $processed_text .= "$urlfound\n"; } }
    -- cat

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://102966]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2024-04-19 11:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found