Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Grabbing numbers from a URL

by htmanning (Friar)
on Jul 09, 2017 at 04:21 UTC ( [id://1194574]=perlquestion: print w/replies, xml ) Need Help??

htmanning has asked for the wisdom of the Perl Monks concerning the following question:

Monks, I'm currently using the following to grab numbers out of a URL. The URL is always similar but has 4 digits then .html. (i.e. something-1234.html). $itemID is the URL that includes the digits.
$url=q|$itemID|; $url=~/(\d{4})+\.htm/i; $num = $url;
This works perfectly, but soon the 4 digits will turn to 5 digits as we move from 999.html to 10000.html. I don't know how to accommodate the extra digit.

Thanks!

Replies are listed 'Best First'.
Re: Grabbing numbers from a URL
by Marshall (Canon) on Jul 09, 2017 at 05:37 UTC
    Usually in this situation, anchor the regex to the end of the string and report however many digits are before .htm or html. Unless there is some specific reason to disallow 1,2,3 digit numbers, I wouldn't code it that way. Go with something simple that works for 1,2,3,4,5,6,7,8 or more digits.
    #!/usr/bin/perl use strict; use warnings; foreach my $url ("abgc100.html", "xyz1000.htMl", "qwer10.html", "abc123.htm", "qrz12345.htm", "something-12341234.html") { my ($number) = $url =~ /(\d+)\.htm(l)?$/i; print "$number\n"; } __END__ 100 1000 10 123 12345 12341234
    You have to decide about something-12341234.html. The code above captures 12341234 which is usually what is desired. Do you really only want to have 41234, 5 digits in that situation? I suspect not.
      Your final comment brings up a bigger problem with the specification. If the current four-digit format allows more than four digits (only the last four of which are used), without more information, it will be impossible to distinguish between this case and the new five-digit numbers. All solutions so far assume that if there are at least five digits, all five are part of the number. I like your solution because it extends this assumption to any number of digits.
      Bill
        I was immediately thinking that this "number" is intended to be a "unique id number". I suppose that could be a wrong assumption, but in my experience if there is some number like 12389799 and somehow those last 5 digits are "special", then the name should be "....123_89799.htm". My algorithm will work with that. If I have any influence over the file naming convention, I will put an "_" in there to separate the fields. In my opinion, stuff like a fixed 5 digit deal, maybe 00123.html is a very bad idea. Often these software things grow and maybe at some point a 6th digit is needed? Then what? In general, I like the idea of having the basic parsing being one thing and if needed the validation of those fields being a separate thing. If two digits aren't allowed, then if ($number < 100) {}. "....123_89799" and "....123_897" should be valid if the naming convention guy has his eye on the ball. These details DO matter.
      Very slick! Thank you. Some of this is above my head.
Re: Grabbing numbers from a URL
by BrowserUk (Patriarch) on Jul 09, 2017 at 04:48 UTC

    Did you try changing \d{4} to \d{4,5}?

    m[(\d{4,5})\.htm] and print $1 for qw[ xxx1234.htm xxxx12345.htm ];; 1234 12345

    If you need to exclude urls that might contain more than 4 or 5 digits:

    m[(?<=\D)(\d{4,5})\.htm] and print $1 for qw[ xxx123.htm xxx1234.htm x +xxx12345.htm xxx123456789.htm ];; 1234 12345

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
Re: Grabbing numbers from a URL
by kevbot (Vicar) on Jul 09, 2017 at 05:10 UTC
    Hello htmanning,

    Take a look at the documentation for regex quantifiers, and capture groups.

    Your code will match numbers that come in multiples of 4 integers. For example something-1234.html will match as well as something-12341234.html. For matching only 4 digits, your pattern can be simplified to:

    $url=~/(\d{4})\.htm/i;
    Note, that the + has been removed from your regex. Also, as your code is written $num will not contain the number. It will contain the whole URL. To get just the number, you need to get the value of the first capture group
    $num = $1;

    To allow for 4 or more digits, use the following

    $url=~/(\d{4,})\.htm/i;

    To allow for only 4 or 5 digits, use the following

    $url=~/(\d{4,5})\.htm/i;

    UPDATE:I really like the named capture groups feature that comes with perl versions 5.10 and greater. They can be overkill when you are only dealing with one or two groups, but can make the code much more clear if you are dealing with multiple capture groups.

    #!/usr/bin/env perl use strict; use warnings; use v5.10; my $url = 'something-12345.html'; $url =~ /(?<num>\d{4,5})\.htm/i; my $num = $+{num}; print "$num\n"; exit;
      Thanks so much. Now I'm thoroughly confused. Apparently the code I posted grabs the contents of $itemID which includes a number, but is the entire string like something-1234.html. What I really need to do is simply capture the 4 numbers in the string, or 5 numbers if it is a 5 digit numbers. There would be no other numbers in any of the urls. Perhaps there is a better way to achieve this. Can you tell me what each of these lines do? I think the first line takes the query string and sets it to $url. I'm not sure I know what the second line does.
      $url=q|$itemID|; $url=~/(\d{4})+\.htm/i;
      Here's what I'm doing. I'm using this recipe in the .htaccess file to run a perl script to display a page, but it displays a .html file in the browser.
      RewriteEngine on RewriteRule ^(.*)$ $1 [nc] RewriteRule ^(.*)$ /cgi-bin/getpage.pl?itemID=$1
      So in the script getpage.pl, I grab the $itemID with the code above and turn it into a filename. I search the database for a filename field that includes the page. Sometimes it's 1234.html, other times it's something-something-1234.html. It would really be best to simply grab the 1234 but I don't know how to do that.
        According to Quote and Quote Like Operators, the q operator does not interpolate the string. So, this code
        $url = q|$itemID|;
        results in $url contains the literal string $itemID, not the contents of $itemID. To test this out, run the following code
        #!/usr/bin/env perl use strict; use warnings; my $itemID = 'something-12345.html'; my $url = q|$itemID|; $url =~ /(\d{4,5})\.htm/i; print "Item ID: $itemID\n"; print " URL: $url\n"; my $num = $1; print " NUM: $1\n"; exit;
        You should get an error, since there is no number found in the URL. If you change the code to use the qq operator, then the string is interpolated, the match succeeds and you get the number in the $1 capture group variable.
        #!/usr/bin/env perl use strict; use warnings; my $itemID = 'something-12345.html'; my $url = qq|$itemID|; $url =~ /(\d{4,5})\.htm/i; print "Item ID: $itemID\n"; print " URL: $url\n"; my $num = $1; print " NUM: $1\n"; exit;
        If you want a description of regular expression in plain english, then you can use the YAPE::Regex::Explain module. Running this one-liner on your regex
        perl -MYAPE::Regex::Explain -E 'say YAPE::Regex::Explain->new("(\d{4}) ++\.htm")->explain();'
        The result is You can compare this to one of the modified regex patterns that I gave you. For example,
        perl -MYAPE::Regex::Explain -E 'say YAPE::Regex::Explain->new("(\d{4,5 +})\.htm")->explain();'
Re: Grabbing numbers from a URL
by Your Mother (Archbishop) on Jul 09, 2017 at 13:19 UTC

    You got some good pointers for some one-off code that will likely work just fine for you. None of the answers is particularly robust for real-world URI handling though. So to round it out–

    #!/usr/bin/env perl use strictures; use URI; use Path::Tiny; my @uris = map URI->new($_), <DATA>; for my $uri ( @uris ) { if ( $uri =~ /(\d+)\.htm(l)?$/i ) { print "Naive matched -> $uri\n"; } else { print "Naive rejected -> $uri\n"; } my $file = path( $uri->path )->basename; if ( $file =~ /\A[1-9][0-9]{3}\.html?\z/i ) { print "Robust matched -> $uri\n"; } else { print "Robust rejected -> $uri\n"; } print "\n"; } __DATA__ https://moo.cow/queso/1234.HTM https://www.google.com/search?num=100&q=1234.htm http://moo.cow/queso/1234.htm?so=what#taco
    Naive matched -> https://moo.cow/queso/1234.HTM Robust matched -> https://moo.cow/queso/1234.HTM Naive matched -> https://www.google.com/search?num=100&q=1234.htm Robust rejected -> https://www.google.com/search?num=100&q=1234.htm Naive rejected -> http://moo.cow/queso/1234.htm?so=what#taco Robust matched -> http://moo.cow/queso/1234.htm?so=what#taco

    Update: removed superfluous, silly chomp from code.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1194574]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2024-04-23 06:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found