multiple matches per line *AND* multiple capture groups per match

Special_K has asked for the wisdom of the Perl Monks concerning the following question:

I have the following precompiled regexp:

$test_regexp = qr/url="(http:\/\/downloads\.bbc\.co\.uk\/podcasts\/wor
+ldservice\/globalnews\/(globalnews_${year}${mon}${mday}-\d{4}[a-z]\.m
+p3))"/;
[download]

As you can see, I have 2 capture groups built into this match. One captures the complete URL, the other capture gets just the filename itself. The webpage I use this particular regexp on may contain multiple valid matches on the same line. I would like to capture both the complete URL and the filename for each one. If it matters (and I don't believe it does), the precompiled regexp is passed to a function that dumps the webpage to a file, opens the file with the TEMP_XML_FILE handle, and searches for the $test_regexp matches on each line. Right now I have this:

while (<TEMP_XML_FILE>)
{

    if ((@complete_url, @filename) = ($_ =~ /$test_regexp/g))
    {
                
        printf("found %d matches\n", scalar(@filename));
        <>;
                
        for ($i = 0; $i < @filename; $i++)
        {
            printf("filename = %s, complete_url = %s\n", $filename[$i]
+, $complete_url[$i]);
            <>;
        }
[download]

The problem is that the printf statement is reporting 0 matches. After reading through the entire file, I want the @complete_url array to contain the complete list of URLs, and the @filename array to contain the complete list of filenames. How can I accomplish this? I realize I might be able to capture just the complete url and derive the filenames from it in a separate step, but for the sake of this discussion how can I capture both the filenames and urls into their respective arrays when there could be multiple matches per line?

Comment on multiple matches per line AND multiple capture groups per match Select or Download Code

Replies are listed 'Best First'.
Re: multiple matches per line AND multiple capture groups per match by hdb (Monsignor) on Dec 21, 2013 at 18:24 UTC
In the assignment `(@complete_url, @filename) = ($_ =~ /$test_regexp/g)` [download] all matches will be assigned to the array `@complete_url`. I would expect that you find the urls and filenames both in the array. So you have everything in one array and need to split it into two afterwards.	[reply] [d/l] [select]
Re^2: multiple matches per line AND multiple capture groups per match by educated_foo (Vicar) on Dec 21, 2013 at 23:25 UTC
This. Your solution is fine, but you need to do a bit of postprocessing, e.g. `if (@groups = /$test_regexp/g) { while (@groups) { ($url, $file) = splice @groups, 0, 2; # ... } }` [download] or use a loop: `while (/$test_regexp/g) { ($url, $file) = ($1, $2); # ... }` [download]	[reply] [d/l] [select]
Re^3: multiple matches per line AND multiple capture groups per match by Special_K (Monk) on Dec 22, 2013 at 01:41 UTC
Thanks, that also solves my problem.	[reply]
Re^2: multiple matches per line AND multiple capture groups per match by Special_K (Monk) on Dec 22, 2013 at 01:39 UTC
Thanks, this answers my original question. So in general, is it not possible to do a multi-array assignment in a single line, i.e.: (@a, @b) = <some_expression>	[reply]
Re^3: multiple matches per line AND multiple capture groups per match by kcott (Archbishop) on Dec 22, 2013 at 05:48 UTC
"So in general, is it not possible to do a multi-array assignment in a single line, i.e.: (@a, @b) = <some_expression>" That's correct with the syntax you're using there; however, you can do it with references. Here's a rather contrived example to demonstrate. `#!/usr/bin/env perl -l use strict; use warnings; my ($letters, $digits) = get_arrays(); print "Letters REF: $letters"; print "Letters: @$letters"; print for @$letters; print "Digits REF: $digits"; print "Digits: @$digits"; print for @$digits; sub get_arrays { my @three_letters = qw{A B C}; my @three_digits = qw{1 2 3}; return (\@three_letters, \@three_digits); }` [download] Output: `Letters REF: ARRAY(0x7ff684047ad0) Letters: A B C A B C Digits REF: ARRAY(0x7ff684047938) Digits: 1 2 3 1 2 3` [download] If you're unfamiliar with references, a good place to start is "perlreftut - Mark's very short tutorial about references". In the "The Rest" section, you'll find links to more detailed documentation on this topic. -- Ken	[reply] [d/l] [select]
Re: multiple matches per line AND multiple capture groups per match by roboticus (Chancellor) on Dec 21, 2013 at 18:10 UTC
Special_K: The URL part is always constant, so instead just capture the filenames, then build the URLs: `if (@filename = ($_ =~ /$test_regexp/g)) { my @complete_urls = map { "http://...." . $_ } @filename; ... }` [download] Alternatively, capture the URL and split off the filenames: `if (@complete_urls = ($_ =~ /$test_regexp/g)) { my @filename = map { s{^./}{}; $_ } @complete_urls; ... }` [download] ...roboticus When your only tool is a hammer, all problems look like your thumb.*	[reply] [d/l] [select]
Re: multiple matches per line AND multiple capture groups per match by Laurent_R (Canon) on Dec 21, 2013 at 18:54 UTC
Since you have so many slashes in your regex, I would suggest that you use some other character for delimiting your regex. This enables you to use slashes without escaping them and makes it much more readable. For example: `$test_regexp = qr[url="(http://downloads\.bbc\.co\.uk/podcasts/worldse +rvice/globalnews/ ... mp3))"];` [download]	[reply] [d/l]
Re^2: multiple matches per line AND multiple capture groups per match by Special_K (Monk) on Dec 22, 2013 at 01:43 UTC
Wow, I didn't even know you could change the beginning and ending delimiter of a regexp. Is that only if you use the qr function, or can you change the delimiter in any context?	[reply]
Re^3: multiple matches per line AND multiple capture groups per match (delimiters) by kcott (Archbishop) on Dec 22, 2013 at 05:57 UTC
"Is that only if you use the qr function, or can you change the delimiter in any context?" Not "any", but certainly "many". Take a look at "perlop: Quote and Quote-like Operators". -- Ken	[reply]
Re^3: multiple matches per line AND multiple capture groups per match by Laurent_R (Canon) on Dec 22, 2013 at 10:28 UTC
You can do that with the quote and quote-like operators, but also, for regexes, with the m// and the s/// operators, which can be written, for example, `m{...}` and `s[...]{...}` or even `m#...#`, etc, as shown in the following Perl one-liners: `$ perl -e 'print $1 if "foobar" =~ m{f(oo)ba}' oo $ perl -e 'print $1 if "foobar" =~ m#f(oo)ba#' oo` [download] Update: well, thinking again about what I wrote above, m// and s/// are in fact part of the quote and quote-like operators (so Ken's answer said it all), but I just wanted to point out that this can be done in direct regex constructs.	[reply] [d/l] [select]
Re: multiple matches per line AND multiple capture groups per match by AnomalousMonk (Archbishop) on Dec 21, 2013 at 19:48 UTC
I think I prefer roboticus's alternate suggestion to extract complete ULRs first, then extract the filename from each URL, but if it absolutely must be done "in one line", this might serve (Perl version 5.10+ needed for state built-in, but this could be an ordinary my variable in the `for`-loop outside the `if` statement): >perl -wMstrict -le "use 5.010; ;; use List::MoreUtils qw(part); ;; my $rx = qr{ \b ([[:alpha:]]+ (\d+)) \b }xms; ;; for my $s ( 'foo abc333 bar de4444 baz fghi22 xyzzy jk123 z', 'zzz123 xx yyyy12 xx xx1234', ) { if (my @matches = part { state $i = 0; $i++ % 2 } $s =~ m{ $rx }xm +sg) { print qq{matched: full (@{$matches[0]}); digits (@{$matches[1]})} +; } else { print 'no matches'; } } " matched: full (abc333 de4444 fghi22 jk123); digits (333 4444 22 123) matched: full (zzz123 yyyy12 xx1234); digits (123 12 1234) [download] See List::MoreUtils`::part`.	[reply] [d/l] [select]
Re: multiple matches per line AND multiple capture groups per match by johngg (Canon) on Dec 21, 2013 at 19:39 UTC
You can use a ternary to push onto either the complete URL array or the filename array. To save space I have simplified the URLs and pattern but the principle would still hold for your data. Things would get more complicated if your URLs broke across lines. $ perl -Mstrict -Mwarnings -MData::Dumper -e ' open my $xmlFH, q{<}, \ <<EOF or die $!; blarg http://a.b.co.uk/path/to/file.mp3 bloop http://x.y.com/stuff.mp3 blooble http://some.firm.com/downloads/glooble.mp3 sploffle EOF my $rx = qr{(?x) ( http:// .? ( [^/]+ \.mp3 ) ) }; my( @comp, @fn ); my $xmlText = do { local $/; <$xmlFH>; }; push @{ $_ =~ m{^http://} ? \ @comp : \ @fn }, $_ for $xmlText =~ m{$rx}g; print Data::Dumper->Dumpxs( [ \ @comp, \ @fn ], [ qw{ comp fn } ] ); +' @comp = ( 'http://a.b.co.uk/path/to/file.mp3', 'http://x.y.com/stuff.mp3', 'http://some.firm.com/downloads/glooble.mp3' ); @fn = ( 'file.mp3', 'stuff.mp3', 'glooble.mp3' ); $ [download] I hope this is helpful. Update:* Corrected unescaped dot in regex and added `(?x)` extended syntax to space things out for readability. Cheers, JohnGG	[reply] [d/l] [select]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks