Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

regexp with repetition of non-capturing group

by pelagic (Priest)
on Mar 16, 2016 at 14:19 UTC ( [id://1157933]=perlquestion: print w/replies, xml ) Need Help??

pelagic has asked for the wisdom of the Perl Monks concerning the following question:

Monks!
I got a string that I want to analyse, preferably with a regexp.
It looks like this e.g.:
somthing apache24 up 11572 Mar 15 16:25 161.20.224.243:8 +808 161.20.224.243:8802 161.20.224.243:8809

The number of 'ip-addresse:port' strings at the end of the line can vary.
I seem to have difficulties to catch all information, especialy all occurencies of 'ip-addresse:port' strings.
I tried the following regexp:
qr/ ^\s* (\S+) ## something \s+ (\S+) ## apache24 \s+ (\S+) ## up \s+ (\S+) ## 11572 \s+ (.*?) ## date-time (?: ## start non-cap +turing group for ports* \s+ ( ## PORTS (?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2}) [.] (?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2}) [.] (?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2}) [.] (?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2}) [:] \d+ ) )+ ## end non-captu +ring group for ports* $ /x

The thing catches only the last occurence, but iI want all.
Any ideas about this?


pelagic

Replies are listed 'Best First'.
Re: regexp with repetition of non-capturing group
by choroba (Cardinal) on Mar 16, 2016 at 14:22 UTC
    Quantifying a capture group doesn't create several capture groups, and a quantified capture group always returns only the last match.

    Use another group to extract all the IP:port strings, and use split to extract them later.

    qr/ ^\s* (\S+) ## something \s+ (\S+) ## apache24 \s+ (\S+) ## up \s+ (\S+) ## 11572 \s+ (.*?) ## date-time ((?: ## start non +-capturing group for ports* \s+ (?: ## PORTS (?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2}) [.] (?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2}) [.] (?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2}) [.] (?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2}) [:] \d+ ) )+) ## end non-c +apturing group for ports* $ /x; my @ips = split ' ', $6;

    Update: fixed the regex.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: regexp with repetition of non-capturing group
by Old_Gray_Bear (Bishop) on Mar 16, 2016 at 15:51 UTC
    You said:
    I got a string that I want to analyse, preferably with a regexp.
    Insert obligatory line about when you only have a hammer....

    This is just what split was made for. Start with the seventh element of the array returned by split and proceed until you reach the end of the array. No need to confuse the issue with a fancy and fragile Regex.....

    ----
    I Go Back to Sleep, Now.

    OGB

Re: regexp with repetition of non-capturing group
by Marshall (Canon) on Mar 17, 2016 at 00:00 UTC
    I agree with Old_Gray_Bear, split looks like the right solution here.
    my $line = 'somthing apache24 up 11572 Mar 15 16:25 161. +20.224.243:8808 161.20.224.243:8802 161.20.224.243:8809'; my ($col1,$col2,$col3,$size,$month,$date,$time,@ips) = split /\s+/, $line; print "@ips"; __END__ 161.20.224.243:8808 161.20.224.243:8802 161.20.224.243:8809
    @ips will be all of the tokens left after assigning the scalars at the beginning of the line. Will work fine even if there are a 100 of them!

    It is possible to write a regex that does exactly what the split would do. But since splitting on white space works so well in this example, I didn't bother to do any regex testing. The key regex feature that is applicable here is "match global". If you had trouble say getting everything to work as a single regex, there is nothing wrong with writing one to get the initial scalars, and then run a second regex to get all of the ip's.  @ips=/($ipregex)/g; That "g" at the end is what does the "match global" magic. In most cases, running a couple of regex's on the same line makes no performance problem.

    Update: You didn't show your left hand assignment statement, but what I show above is what you want whether using split or a regex. Of course you should have better names than $col1, $col2, etc. In general I recommend not messing around with $1,$2 etc. Get these values into a readable name ASAP. I very seldom have any code like: $tokens[3]. In that case, you need a comment to describe what index 3 means... Far better is to have a name like $size which I demonstrate.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1157933]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-03-29 05:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found