appending a unique marker to each url in a file

cat2014 has asked for the wisdom of the Perl Monks concerning the following question:

this should be a pretty simple problem, but it's got me at a dead end. i have a file with a set of semi-random marker strings, like so:

anfnf11
iopi1p83288
9032-jjjf
[download]

and a html page which has a bunch of links inside it. my problem seemed really simple when i started this- i want to take the first marker string, and stick it on the end of the first link, then take the second marker string, and append it to the end of the second link. I can guarantee that all the links start with http:// or https://, so i'm looking for that in my regex. this is what my text would look like before the script runs:

i have <a href="http://www.somewhere.org/foo/">a link</a> and <a href=
+https://somewhere.else.net/bar/>the second link</a>.
[download]

after the script runs, it would look like this:

<body>this is <a href="http://www.somewhere.org/foo/anfnf11">a link</a
+> and <a href=https://somewhere.else.net/bar/iopi1p83288>the second l
+ink</a>
[download]

theoretically, the number of links in the html page is the same as the number of marker strings, but i'd like to fail gracefully if i run out of markers or if there are more markers than links. anyway, i thought it would be no problem- slurp the markers into an array, then do a substitution on the values of the links with the marker at the end. i started with something like this:

# @markers already holds each marker in each array spot
#$htmlfile  already has the text of my html file

foreach my $m (@markers){
    $htmlfile =~ s/\G<a\s+href\s+=\s+\"?(http[\s\>]+)"?>/<a href="$1$m
+">/gi;
}
[download]

That, of course, didn't work- the first link ended up with all the markers at the end of it. I tried a while loop, too:

# @markers already holds each marker in each array spot
#$htmlfile  already has the text of my html file

my $count = 0;
while ($htmlfile =~ m/<a\s+href\s*=\s*\"?(http[\s\>]+)"?>/gi){
   $htmlfile =~ s/$1/$1$markers[$count]/;
      $count ++;
}
[download]

The while loop feels like it's close, but it's not working,either- no markers end up in the output. So I'm kind of stuck here-- I feel like the solution to this is really simple, and I'm just missing it entirely. -- cat

Comment on appending a unique marker to each url in a file Select or Download Code

Replies are listed 'Best First'.
Re: appending a unique marker to each url in a file by thatguy (Parson) on Aug 08, 2001 at 07:54 UTC
I think using a regex on the entire file may get a little complicated. I would use HTML::TokeParser to pull the links out of your file and then modify them from there, like so #!/usr/bin/perl -w use HTML::TokeParser; use strict; my $i=0; ## set marker definintions my @markers = qw/ anfnf11 iopi1p83288 9032-jjjf /; my $htmlfile = "index.html"; my $content; ## get contents of your html file open (FILE," $htmlfile") \|\| die "Cannot open HTML file for parsing!: $ +!\n"; while(<FILE>) { $content .= $_; } close(FILE); my $parse = HTML::TokeParser->new(\$content); while (my $token = $parse->get_tag("a")) { my $url = $token->[1]{href} \|\| "-"; ## put link into $url my $text = $parse->get_trimmed_text("/a"); ## put link de +sc into $text if ($markers[$i]) { print "<a href=$url/$markers[$i]>$text</a>\n"; } else { ## no more markers... } $i++; } exit; [download] Update: fixed the way data was put into $content courtesy of Hofmator. -p	[reply] [d/l]
Re: Re: appending a unique marker to each url in a file by Hofmator (Curate) on Aug 08, 2001 at 14:14 UTC
`open (FILE," $htmlfile") \|\| die "Cannot open HTML file for parsing!: $ +!\n"; while(<FILE>) { $content="$content$_\n"; } close(FILE);` [download] This construct is not ideal. You are interpolating the variable $content into a new string for every line. You should use concatenation and just append to the string: `while(<FILE>) { $content .= "$_\n"; }` [download] But what you are doing now is slurping in the whole file and adding an extra newline at the end of each line (for which I see absolutely no reason). The same thing can be achieved by undefing $/ like this: `{ local $/; # undefs $/ for this block of code only open (FILE," $htmlfile") \|\| die "Cannot open HTML file for parsing!: + $!\n"; $content = <FILE>; # reads in whole file $content =~ s/\n/\n\n/g; # if really necessary to duplicate newlines close(FILE); }` [download] -- Hofmator	[reply] [d/l] [select]
Re: Re: Re: appending a unique marker to each url in a file by tachyon (Chancellor) on Aug 08, 2001 at 17:36 UTC
How about just `$content = join'', <FILE>;` No need to mess with $/ and have to remember to localise it. Wo betide he who forgets to localise $, $" $/ $\ Update As chipmunk points out this is slower than undef $/ for the gory details see Re: Re: Re: Re: Re: appending a unique marker to each url in a file. For big files the difference is significant, for small ones it is negligible but who wants to paint themselves into a scaling corner? It is better to undef $/, just remember to localise it. Ugh posted bad code again. cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]
Re: Re: Re: Re: appending a unique marker to each url in a file by chipmunk (Parson) on Aug 08, 2001 at 17:59 UTC
Re: Re: Re: Re: Re: appending a unique marker to each url in a file by tachyon (Chancellor) on Aug 08, 2001 at 18:32 UTC
Re: appending a unique marker to each url in a file by Cubes (Pilgrim) on Aug 08, 2001 at 08:12 UTC
As wonderful as regexes are, sometimes they're more trouble than they're worth. The snippet below will do what you want, regardless of whether your links start with http or not. It won't do the right thing if the href targets aren't quoted, or if you have an <a> tag without an href followed by some other tag with an href before the next link, but this is the 5-minute version. `$pos = 0; while ($m = shift @markers) { # locate the beginning of the link last if (($pos = index $htmlfile, '<a', $pos) < 0); # ...then the start of the link's href last if (($pos = index $htmlfile, 'href="', $pos) < 0); # skip past the first " $pos += 6; # ...then the end of the quoted href target last if (($pos = index $htmlfile, '"', $pos) < 0); substr($htmlfile, $pos, 0) = $m; }` [download] At the end, `$pos` will be -1 and `@markers` will be empty if you ran out of links before you ran out of markers. If `$pos` is not -1, do one more `index` looking for `<a` and/or `href=`. If it hits (i.e., does not return -1), you ran out of markers before all of the links were done. If it does return -1, your links and `@markers` matched up perfectly. Update: Woops, my ending logic was broken (it's fixed now). The final index check has to be done if `$pos` is not -1, not just if there's anything left in `@markers` as I originally stated.	[reply] [d/l] [select]
Re: appending a unique marker to each url in a file by chipmunk (Parson) on Aug 08, 2001 at 17:50 UTC
What you need to do is grab the next marker for each substitution. Here's one way of doing it, using /e so you can execute Perl code in the replacement. `$htmlfile =~ s/<a\s+href\s=\s(["']?)(http.*?)\1>/ @markers or die "More URLs than markers.\n"; qq{<a href="$2} . shift(@markers) . '">'/gie; @markers and die "More markers than URLs.\n";` [download] However, this approach still has all the drawbacks of using a regex to match HTML. Using a proper HTML parser would give you a much more robust solution.	[reply] [d/l]
My end solution to appending a unique marker to each url in a file by cat2014 (Monk) on Aug 09, 2001 at 02:27 UTC
For the curious, I'll post what I ended up doing. One of the main requirements of this script was that it carefully preserved the formatting of the files that it ran on- the only change could be the addition of the url markers, so that influenced my solution. the script would be called with two files- one html and one plain text. Sample html file to call with: `this a link: <br> <a href="http://www.somewhere.org/foo">somewhere</a> +<p> <a href= https://somewhere.else.net/bar/>another place to go</a>. +<p>` [download] And sample text file pair to the html file: `this a link: http://www.somewhere.org/foo another place to go: https://somewhere.else.net/bar/` [download] so here's my code to tag the urls in these files: #$unprocessed_html holds the text of the html version of my file #$unprocessed_text holds the plain text version of my file #@markers holds the unique url markers, ie [/qfk33pe][/nnd92093] #handle html version my $processed_html = ""; while (length($unprocessed_html) > 0) { if($unprocessed_html =~ s/^(.?\b(href\|action)\s=\s)//si) { $processed_html .= $1; } else { $processed_html .= $unprocessed_html; last; } if (not(@markers)){ $processed_html .= $unprocessed_html; warn "no more markers available for remaining links\n"; last; } if($unprocessed_html =~ s/^([^\"\'][^<>\s])//) { my $url = $1 . shift(@markers); #strip double //s due to sloppy input $url =~ s\|//\|/\|g; $processed_html .= $url; } elsif($unprocessed_html =~ s/^([\"\'])([^<>]?)\1//) { my $url = $1 . $2 . shift(@markers) . $1; #strip double //s that can result from sloppy input $url =~ s\|//\|/\|g; $processed_html .= $url; } else { die "something happened here"; } } #handle text version my $processed_text = ""; while (length($unprocessed_text) > 0) { if($unprocessed_text =~ s/^(.?\b)http/http/si) { $processed_text .= $1; } else { $processed_text .= $unprocessed_text; last; } if (not(@markers)){ $processed_text .= $unprocessed_text; warn "no more markers available for remaining links\n"; last; } if($unprocessed_text =~ s/(http[\S]+)\s+//){ my $urlfound = $1 . shift (@markers); $urlfound =~ s\|//\|/\|g; $processed_text .= "$urlfound\n"; } elsif($unprocessed_text =~ s/(http[\S]+)$//){ my $urlfound = $1 . shift (@markers); $urlfound =~ s\|//\|/\|g; $processed_text .= "$urlfound\n"; } } [download] -- cat	[reply] [d/l] [select]


P is for Practical
	PerlMonks

appending a unique marker to each url in a file

Update