Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Filtering one array using another array

by teabag (Pilgrim)
on Sep 04, 2003 at 13:07 UTC ( [id://288854]=perlquestion: print w/replies, xml ) Need Help??

teabag has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monks, lend me your ears...

I'm trying to write an urlgrabber and I thought it would be nice to include a sort of blacklisting system with it. Now the code for the grabbing/webscraping works perfectly. But I'm having problems with the blacklisting system.

I have 2 arrays. One (@sites) contains the grabbed urls, the other one (@blacklist) contains the blacklisted keywords. Now the code I came up with does "kinda" spot the blacklisted urls, but is clumsy, slow, inefficient and has to be filtered.

Could anyone point out an easier (and faster) way?

#!/usr/bin/perl # example blacklisting @sites = ( "http://www.rtfm.com", "http://www.alottatax.com", "http://www.kingdom.com/cgi-bin/script.pl" ); @blacklist = ( "cgi", "blabla", "testme" ); foreach $site (@sites) { &blacklist(); } sub blacklist { foreach $blacklist (@blacklist) { if ( $site =~ m/$blacklist/gi ) { print "$site blacklisted - $blacklist\n"; } else { print "$site ok\n"; } } }

Teabag
I'm sure there's more than one way, but /me just needs one anyway - Teabag

Replies are listed 'Best First'.
Re: Filtering one array using another array
by Abigail-II (Bishop) on Sep 04, 2003 at 13:27 UTC
    You mean that a site is blacklisted, if one of the elements of @blacklist matches against the url? In that case, I'd make a single regex out of @blacklist. Something like (untested):
    my $re = join"|" => map {"(?:$_)"} @blacklist; my @blacklisted = grep {/$re/} @sites;

    Abigail

      Excellent Abigail-II!
      That works perfectly for me, thanks!

      Teabag
      Sure there's more than one way, but one just needs one anyway - Teabag
Re: Filtering one array using another array
by broquaint (Abbot) on Sep 04, 2003 at 13:33 UTC
    How's about
    use strict; my @sites = ( "http://www.rtfm.com", "http://www.alottatax.com", "http://www.kingdom.com/cgi-bin/script.pl" ); my $blacklist = join '|', map quotemeta, qw/ cgi blabla testme /; for(@sites) { printf "%s %s\n", $_, $_ =~ /($blacklist)/ ? "blacklisted - $1" : 'ok'; } __output__ http://www.rtfm.com ok http://www.alottatax.com ok http://www.kingdom.com/cgi-bin/script.pl blacklisted - cgi
    See. perlre and quotemeta for more info. Also a module that might be of interest - Regex::Presuf.
    HTH

    _________
    broquaint

      Perfect broquaint!
      I understand the very simple regexes, but get lost when they get complicated.

      I'll check out that module you mentioned right away. Sounds very handy.

      Thanks

      Teabag
      Sure there's more than one way, but one just needs one anyway - Teabag

Re: Filtering one array using another array
by gjb (Vicar) on Sep 04, 2003 at 13:24 UTC

    Put the blacklisted sites in a hash rather than a list, that way you can do a fast lookup. A bit cleaner would be to put them in a Set::Scalar, but this is a matter of taste. Since you need approximate matching, you could have a look at Tie::Hash::Approx.

    If you'd like to get a list of URLs not in the blacklist, you could use a grep.

    Hope this helps, -gjb-

    Update: Oops, I should thoroughly read a post before replying, I didn't notice that a regex match was required. Thanks Abigail-II.

      Put the blacklisted sites in a hash rather than a list, that way you can do a fast lookup.

      Eh, hashes are great for exact matches, but exact matching isn't what the OP is doing. He's matching against regexes, and hashes aren't going to help him there.

      Abigail

Re: Filtering one array using another array
by Kimi_1973 (Initiate) on Sep 04, 2003 at 13:40 UTC
    Try the following:
    #!/usr/bin/perl # example blacklisting @sites = ( "http://www.rtfm.com", "http://www.alottatax.com", "http://www.kingdom.com/cgi-bin/script.pl" ); @blacklist = ( "cgi", "blabla", "testme" ); foreach $site (@sites) { &blacklist(); } sub blacklist { foreach $blacklist (@blacklist) { if ( $site =~ m/$blacklist/gi ) { print "$site blacklisted - $blacklist\n"; return; #Change1 } else { print "$site ok\n"; return; #Change2 } } }
      Yup, that works in this example. But it seems to be working only for the first word in @blacklist?


      Teabag
      Sure there's more than one way, but one just needs one anyway - Teabag

      Plese Ingore the Changes1,2 .

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://288854]
Approved by gjb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (4)
As of 2024-04-19 04:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found