Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Improving script's speed and performance...

by Anonymous Monk
on Dec 28, 2015 at 11:20 UTC ( [id://1151249]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

hi perlmonks

I have implemented the following perl script that compares each entry of an array1 with all entries of array2 in order to find the longest common subsequence in terms of words. I guess that the script works fine, but it is too slow. Please note that array1 contains 1.000 entries while array2 more than 300.000. Is it possible to make it faster without change the order of array1?

Thanks in advance for your help.

This is my code

#!/usr/bin/perl use String::LCSS_XS qw( lcss ); use utf8; use warnings; my $f1 = shift ; my $f2 = shift ; open (FILE1, "<:encoding(UTF-8)", "$f1") or die "can't open file '$f1' + $!"; open (FILE2, "<:encoding(UTF-8)", "$f2") or die "can't open file '$f2' + $!"; my @array1 = <FILE1>; chomp(@array1); close (FILE1); my @array2 = <FILE2>; chomp(@array2); close (FILE2); my $subseq; my($bestsource, $besttarget, $bestalignment, $bestlength); for my $i (0 .. $#array1) { my $best_subseq = ""; my $best_subseq_words = 0; my $best_subseq_chars = 0; my $found = 0; for my $j (0 .. $#array2) { $subseq = lcssw ("$array1[$i]", "$array2[$j]"); my $num_words = count_words ($subseq); my $num_chars = count_chars ($subseq); if ($num_words > $best_subseq_words && $num_chars > $best_subs +eq_chars) { $best_subseq = $subseq; $best_subseq_words = $num_words; $best_subseq_chars = $num_chars; $found = 1; } } if ($found == 1) { print "$best_subseq is the lcssw of $array1[$i]\n" } } sub lcssw { my ($s1, $s2) = @_; my $i; my %codes; my %words; for ($s1, $s2) { $_ = join '', map { $codes{$_} = chr(++$i) if !exists($codes{$_}); $codes{$_} } $_ =~ /\w+/g; } my $lcss = lcss($s1, $s2); $lcss = "" if (!defined $lcss); @words{values %codes} = keys %codes; return join ' ', @words{ $lcss =~ /./sg }; } sub count_words { my $line = shift ; my @text_words = split(/\s+/, $line); return scalar(@text_words); } sub count_chars { my $line = shift ; my @text_words = split(//, $line); return scalar(@text_words); }

Replies are listed 'Best First'.
Re: Improving script's speed and performance...
by RichardK (Parson) on Dec 28, 2015 at 12:17 UTC

    How slow is too slow ? You are running through the inner loop 300+ million times, so it might take a while.

    Have you tried Profiling your code.? It will tell you where most of the time is being spent, and then you know where the improvements are needed.

    A couple of things that might help a bit -

    * You could use length to count the characters in a string, which should be quicker the creating an array each time.

    * You don't care about the indexes of your arrays so you could write your loops like this :-

    for my $lhs (@array1) { ... for my $rhs (@array2) { lcssw($lhs, $rhs); ... } }
Re: Improving script's speed and performance...
by Discipulus (Canon) on Dec 28, 2015 at 12:35 UTC
Re: Improving script's speed and performance...
by Anonymous Monk on Dec 28, 2015 at 14:51 UTC
    sub lcssw { my ($s1, $s2) = @_; my $i; my %codes; my %words; for ($s1, $s2) { $_ = join '', map { $codes{$_} = chr(++$i) if !exists($codes{$_}); $codes{$_} } $_ =~ /\w+/g; } my $lcss = lcss($s1, $s2); $lcss = "" if (!defined $lcss); @words{values %codes} = keys %codes; return join ' ', @words{ $lcss =~ /./sg }; }
    Well, whatever you do, you should start right here. You call this function 300000000 times, and it does a lot of things, . Does it actually increase performance (compared to just using lcss on strings?) I mean, it might, but it allocates hashes and strings, accesses keys and values in hashes, runs the regex engine...

    Also, while your lcssw could be improved (for example, return '' if not defined $lcss), real optimizations depend on your data. Are there many strings that won't match? You could try to filter them out using some simpler algorithm, and then run the full version on others. How gigantic are your strings? If not too big, you can preprocess them (that is, do most of the work that lcssw does, but just once per array - at least for the smaller array).

    The whole thing looks pretty parallelizable, threads or forks could help with using more CPUs to process your arrays.

    Anyway, without some representative examples of your data it's hard to tell, so how about posting some.

Re: Improving script's speed and performance...
by GrandFather (Saint) on Dec 28, 2015 at 20:52 UTC

    Give us some context so we better understand what you want to do. We can endlessly play 20,000 questions to try and home in on an efficient solution for your data and objective. Or you can tell us what you need to achieve so we can determine things like "Does it always match from the start" and "are they really only 30 character sentences".

    The more you tell us about your actual problem the better we are likely to be able to help you.

    Premature optimization is the root of all job security
Re: Improving script's speed and performance...
by Lennotoecom (Pilgrim) on Dec 28, 2015 at 12:09 UTC
    please provide examples of what exactly is in your original files.
    and what you consider being "longest common subsequence in terms of words"
    cheers
Re: Improving script's speed and performance...
by Anonymous Monk on Dec 28, 2015 at 15:21 UTC

    Hi, thank you for your replies.

    Here is an example of my files. Considering that array1 contains the following sentence:

    This sentence is stored in array1

    and array the following sentences:

    Thank you for looking on this. This sentence is a test sentence. This sentence is stored in array2 This is an other sentence. All variables are stored in array2

    I would like to compare the sentence of array1 with all the sentences of array2 so that my algorithm will extract the longest common subsequence (i.e. This sentence is stored in).

    I will try to change the code and make it more efficient as you described above but I would appreciate your help on this.

      OK. So you're saying your strings are pretty short and consist almost entirely of the words of English language. The unused variable $bestalignment in your code makes me doubt that, but OK.

      Here's one approach (just an illustration, you should definitely debug and improve it). It depends on certain properties of the English language, such as the fact that there are not very many unique words (even allowing for stuff like "array1"); certainly English has less then a million unique words, and only several tens of thousands are actually used. So you should preprocess your data:

      use strict; use warnings; use String::LCSS_XS qw( lcss ); my @array1 = ('This sentence is stored in array1') x 1_000; my @array2 = ( 'Thank you for looking on this.', 'This sentence is a test sentence.', 'This sentence is stored in array2', 'This is an other sentence.', 'All variables are stored in array2', ) x 20_000; preprocess( \@array1 ); preprocess( \@array2 ); process( \@array1, \@array2 ); exit 0; my %words; my %codes; my $word_num; sub process { my ( $array1, $array2 ) = @_; for my $elem1 (@$array1) { my $best_subseq = ""; my $best_subseq_words = 0; my $best_subseq_chars = 0; my $found = 0; for my $elem2 (@$array2) { my $subseq = lcss( $elem1->[1], $elem2->[1] ); if ( $subseq && length($subseq) > $best_subseq_words ) { my $real_subseq = join ' ', map $codes{$_}, split //, $subseq; if ( length($real_subseq) > $best_subseq_chars ) { $best_subseq = $real_subseq; $best_subseq_words = length($subseq); $best_subseq_chars = length($real_subseq); $found = 1; } } } if ( $found == 1 ) { printf "<<%s>> is the lcss of\n\t<<%s>>\n", $best_subseq, $elem1->[0]; } } } sub preprocess { my ($array) = @_; for my $i ( 0 .. $#$array ) { my $packed = join '', map { my $code = $words{$_}; if ( not defined $code ) { $code = chr( $word_num += 1 ); $words{$_} = $code; $codes{$code} = $_; } $code; } $array->[$i] =~ /\w+/g; $array->[$i] = [ $array->[$i], $packed ]; } }
      on my ancient laptop, time perl lcss.pl > /dev/null finished in 1 min 33 seconds; that's for 100 millions iterations (1000 elements in @array1, 20000 * 5 in @array2).

      Note that if you actually use some wildly different data, this approach may break (and I didn't test it very much).

Re: Improving script's speed and performance...
by Anonymous Monk on Dec 29, 2015 at 19:30 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1151249]
Approved by Discipulus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-04-19 21:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found