Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^3: Grab input from the user and Open the file

by AnomalousMonk (Archbishop)
on Nov 27, 2016 at 16:52 UTC ( [id://1176655]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Grab input from the user and Open the file
in thread Grab input from the user and Open the file

(Note: haukex, a faster typist than I, has already covered most of the points below, but I would just want to emphasize the fact that the regex in your code does match invalid IP addresses.)

Basically, you want to "associate" (hint, hint) a string that represents a "dotted decimal" IP address with a domain name and a count. The data structure (see perldsc) for this might look like

my %IP = ( # create and initialize '198.144.186.184' => { 'domain' => 'host.colocrossing.com', 'count' => 1, } ); ... # add to structure my $captured_IP = ...; my $captured_domain = ...; $IP{$captured_IP}{domain} = $captured_domain; $IP{$captured_IP}{count}++; ... # print structure -- see perldsc

Of course, you must already have captured an IP and a domain name from a "POSSIBLE BREAK-IN ATTEMPT!" record, which you seem to be able to identify. It's always best to use what's tried and tested: CPAN.

The module Regexp::Common::net (but first see Regexp::Common for how to use Regexp::Common::net) contains regexes for matching various forms of IP address. Note that the regex pattern you're using,
    /(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/
allows a match with, e.g., '999.999.999.999' or '99999.1.2.99999': the first is plain invalid; the second contains a possibly valid IP – or does it?. A better regex can be built using Regexp::Common::net, maybe something like:

use Regexp::Common qw(net); ... my $dotted_decimal_ip = qr{ (?<! \d) $RE{net}{IPv4} (?! \d) }xms; my $domain_name = qr{ ... }xms; ... my ($break_in_ip) = $break_in_message =~ m{ ($dotted_decimal_ip) } +xms; my ($break_in_domain) = $break_in_message =~ m{ ($domain_name) }xms; $IP{$break_in_ip}{domain} = $break_in_domain; $IP{$break_in_ip}{count}++; ...
Note that this approach only captures the most recent domain name associated with a particular IP address. I've been a bit vague about the domain-name regex. Such a regex can be quite involved if it must cover all possible domain names, but you may not need anything like this level of coverage. I leave it to you to search CPAN for an appropriate module.


Give a man a fish:  <%-{-{-{-<

Replies are listed 'Best First'.
Re^4: Grab input from the user and Open the file
by valerydolce (Novice) on Nov 27, 2016 at 18:23 UTC
    Thanks again for your inputs. I think i have a lot to learn as some concepts that i haven't covered are included in your respective posts.
Re^4: Grab input from the user and Open the file
by valerydolce (Novice) on Nov 27, 2016 at 20:33 UTC

    Since I haven't explored modules, i used an if statement to match all possible URLs as shown below:

    if (m/^(((ht|f)tp(s?))\://)?(www.|[a-zA-Z].)[a-zA-Z0-9\-\.]+\.(com|edu +|gov|m +il|net|org|biz|info|name|museum|us|ca|uk)(\:[0-9]+)*(/($|[a-zA-Z0-9\. +\,\;\?\'\\\+&amp;%\$#\=~_\-]+))*$/){ my($host,$path) = ($4,$5); print "$host => $path\n"; }

    I'm having the following error after testing

  • Unmatched ( in regex; marked by <-- HERE in m/^( <-- HERE ((ht|f)tp(s?))\:/ at challenge2.pl line 44.
      if (m/^(((ht|f)tp(s?))\://)?(www.|[a-zA-Z].)[a-zA-Z0-9\-\.]+\.(com|edu +|gov|m +il|net|org|biz|info|name|museum|us|ca|uk)(\:[0-9]+)*(/($|[a-zA-Z0-9\ +. +\,\;\?\'\\\+&amp;%\$#\=~_\-]+))*$/){ my($host,$path) = ($4,$5); print "$host => $path\n"; }

      Some thoughts:

      • You cannot use a  m// delimiter character (e.g.,  / forward slash) within a regex without escaping it. Your quoted regex
            m/^(((ht|f)tp(s?))\://)?...(/($|[a-z...]+))*$/
        has  / delimiter characters within it. I have rewritten the regex below with my favorite, balanced curlies, as delimiters.
      • The  /x regex modifier and whitespace are your friends (see Modifiers). As it stands, your regex is almost unreadable (at least by me) even if it were correct. Rewriting the regex as syntactically (but not necessarily semantically) correct:
        if (m{ ^ ( # open capturing group 1 (was unbalanced) ((ht|f)tp(s?) \: //)? # removed close paren after (s?) (www.|[a-zA-Z].) [a-zA-Z0-9\-\.]+ \. (com|edu|gov|mil|net|org|biz|info|name|museum|us|ca|uk) (\:[0-9]+)* (/($|[a-zA-Z0-9\.\,\;\?\'\\\+&amp;%\$#\=~_\-]+))* ) $ # (maybe?) added for balance: close capture group 1 }x ) { my($host,$path) = ($4,$5); print "$host => $path\n"; }
        I can identify parts of this regex, but what, for instance, is  (www.|[a-zA-Z].) ('www' followed by any character, or else a single alpha character followed by any character) supposed to match in a URL? (This is immediately followed by  [a-zA-Z0-9\-\.]+ \. without any delimiter.)
      • In  (com|...|museum|us|ca|uk) the  us ca uk bit is suspicious. These are (I think) country codes, and should appear as part of a ccTLD like www.bbc.co.uk or www.what.ever.ac.ca and not on their own, as your regex allows. (Again, I'm sure CPAN has a module with regexes for matching URLs.)
      • Is  (/($|[a-zA-Z0-9\.\,\;\?\'\\\+&amp;%\$#\=~_\-]+))* really supposed to contain &amp;? (This may be an artifact of Perlmonks site rendering.)
      • Other parts I just can't figure out. (This may simply be due to ignorance on my part.)
      • A nit: Non-capturing groups (see  (?:pattern) in Extended Patterns in perlre) are your friends. Entirely too much stuff is captured unnecessarily in this regex for my taste.
      • A similar nit: Far too many characters are escaped unnecessarily; not every non-alphanumeric needs escaping.
      Bottom line: I doubt this regex would do what you intend even if it would compile (and the rewritten version is at least syntactically correct). Yet again: CPAN.


      Give a man a fish:  <%-{-{-{-<

      Hi valerydolce,

      Designing your own regex is certainly very good practice. For a production system I'd recommend existing modules, for example Regexp::Common to find URIs and URI to parse them. For example:

      use warnings; use strict; my $str = <<'END_STR'; I am an example http://www.perlmonks.org/?parent=1176663;node_id=3333 +text that contains <https://perlmonks.pair.com/?node_id=1176651> two URIs END_STR use Regexp::Common qw/URI/; use URI; while ($str=~/$RE{URI}{-keep}/g) { my $uri = URI->new($1); print "$uri\n"; print " Scheme: ", $uri->scheme, "\n"; print " Host: ", $uri->host, "\n"; print " Path: ", $uri->path, "\n"; print " Query: ", $uri->query, "\n"; }

      See the URI documentation for lots more ways to access the different parts of the URI. I did notice that unfortunately Regexp::Common apparently doesn't match the #fragment part of the URI, so here's an attempt at an alternate solution, using a regex based on the characters allowed in URIs from RFC 3986.

      # NOTE this is based on a quick skim of RFC 3986 and may not be comple +te! my $url_re = qr{ # https://tools.ietf.org/html/rfc3986#section-2 # URI = scheme ":" hier-part ...; hier-part = "//" ... # scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) [A-Za-z][A-Za-z0-9+\-.]* :// # gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" # sub-delims = "!" / "$" / "&" / "'" / "(" / ")" # / "*" / "+" / "," / ";" / "=" # unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" ( [:/?#\[\]@!\$&'()*+,;=A-Za-z0-9\-._~] # pct-encoded = "%" HEXDIG HEXDIG | %[0-9A-Fa-f]{2} )* }x; while ($str=~/($url_re)/g) { my $uri = URI->new($1); print "$uri\n"; }

      Hope this helps,
      -- Hauke D

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1176655]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2024-04-23 23:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found