Improving script's speed and performance...

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

hi perlmonks

I have implemented the following perl script that compares each entry of an array1 with all entries of array2 in order to find the longest common subsequence in terms of words. I guess that the script works fine, but it is too slow. Please note that array1 contains 1.000 entries while array2 more than 300.000. Is it possible to make it faster without change the order of array1?

Thanks in advance for your help.

This is my code

#!/usr/bin/perl

use String::LCSS_XS qw( lcss );
use utf8;
use warnings;


my $f1 = shift ; 
my $f2 = shift ; 

open (FILE1, "<:encoding(UTF-8)", "$f1") or die "can't open file '$f1'
+ $!";
open (FILE2, "<:encoding(UTF-8)", "$f2") or die "can't open file '$f2'
+ $!";

my @array1 = <FILE1>;
chomp(@array1);
close (FILE1);

my @array2 = <FILE2>;
chomp(@array2);
close (FILE2);

my $subseq;
my($bestsource, $besttarget, $bestalignment, $bestlength);

for my $i (0 .. $#array1) {
    my $best_subseq = "";
    my $best_subseq_words = 0;
    my $best_subseq_chars = 0;
    my $found = 0;

    for my $j (0 .. $#array2) {

        $subseq = lcssw ("$array1[$i]", "$array2[$j]");

        my $num_words = count_words ($subseq);
        my $num_chars = count_chars ($subseq);

        if ($num_words > $best_subseq_words && $num_chars > $best_subs
+eq_chars) {
            $best_subseq = $subseq;
            $best_subseq_words = $num_words;
            $best_subseq_chars = $num_chars;
            $found = 1;
        }
    }
    
    if ($found == 1) {
        print "$best_subseq is the lcssw of $array1[$i]\n"
    }
}

sub lcssw {
   my ($s1, $s2) = @_;

   my $i;
   my %codes;
   my %words;

   for ($s1, $s2) {
      $_ = join '',
           map { $codes{$_} = chr(++$i) if !exists($codes{$_});
                 $codes{$_}
               }
           $_ =~ /\w+/g;
   }

   my $lcss = lcss($s1, $s2);
   $lcss = "" if (!defined $lcss);

   @words{values %codes} = keys %codes;
   return join ' ', @words{ $lcss =~ /./sg };
}

sub count_words {
    my $line = shift ;
    my @text_words = split(/\s+/, $line);
    return scalar(@text_words); 
}

sub count_chars {
    my $line = shift ;
    my @text_words = split(//, $line);
    return scalar(@text_words); 
}
[download]

Comment on Improving script's speed and performance... Download Code

Replies are listed 'Best First'.
Re: Improving script's speed and performance... by RichardK (Parson) on Dec 28, 2015 at 12:17 UTC
How slow is too slow ? You are running through the inner loop 300+ million times, so it might take a while. Have you tried Profiling your code.? It will tell you where most of the time is being spent, and then you know where the improvements are needed. A couple of things that might help a bit - * You could use length to count the characters in a string, which should be quicker the creating an array each time. * You don't care about the indexes of your arrays so you could write your loops like this :- `for my $lhs (@array1) { ... for my $rhs (@array2) { lcssw($lhs, $rhs); ... } }` [download]	[reply] [d/l]
Re: Improving script's speed and performance... by Discipulus (Canon) on Dec 28, 2015 at 12:35 UTC
some sparse tought: can you put the biggest file's data into a database? can you compare times running against the DB and against the file? sysread can be faster than the implicit read of the diamond operator. A Super Search for big file and huge file (plus some link from my homenode) When the input file is huge !!! Huge files manipulation Searching Huge files Getting/handling big files w/ perl Edit huge file Faster grep in a huge file(10 million) read huge file Optimising processing for large data files. Re: Recommendations for efficient data reduction/substitution application Using indexing for faster lookup in large file L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]
Re: Improving script's speed and performance... by Anonymous Monk on Dec 28, 2015 at 14:51 UTC
`sub lcssw { my ($s1, $s2) = @_; my $i; my %codes; my %words; for ($s1, $s2) { $_ = join '', map { $codes{$_} = chr(++$i) if !exists($codes{$_}); $codes{$_} } $_ =~ /\w+/g; } my $lcss = lcss($s1, $s2); $lcss = "" if (!defined $lcss); @words{values %codes} = keys %codes; return join ' ', @words{ $lcss =~ /./sg }; }` [download] Well, whatever you do, you should start right here. You call this function 300000000 times, and it does a lot of things, . Does it actually increase performance (compared to just using `lcss` on strings?) I mean, it might, but it allocates hashes and strings, accesses keys and values in hashes, runs the regex engine... Also, while your `lcssw` could be improved (for example, `return '' if not defined $lcss`), real optimizations depend on your data. Are there many strings that won't match? You could try to filter them out using some simpler algorithm, and then run the full version on others. How gigantic are your strings? If not too big, you can preprocess them (that is, do most of the work that `lcssw` does, but just once per array - at least for the smaller array). The whole thing looks pretty parallelizable, threads or forks could help with using more CPUs to process your arrays. Anyway, without some representative examples of your data it's hard to tell, so how about posting some.	[reply] [d/l] [select]
Re: Improving script's speed and performance... by GrandFather (Saint) on Dec 28, 2015 at 20:52 UTC
Give us some context so we better understand what you want to do. We can endlessly play 20,000 questions to try and home in on an efficient solution for your data and objective. Or you can tell us what you need to achieve so we can determine things like "Does it always match from the start" and "are they really only 30 character sentences". The more you tell us about your actual problem the better we are likely to be able to help you. Premature optimization is the root of all job security	[reply]
Re: Improving script's speed and performance... by Lennotoecom (Pilgrim) on Dec 28, 2015 at 12:09 UTC
please provide examples of what exactly is in your original files. and what you consider being "longest common subsequence in terms of words" cheers	[reply]
Re: Improving script's speed and performance... by Anonymous Monk on Dec 28, 2015 at 15:21 UTC
Hi, thank you for your replies. Here is an example of my files. Considering that array1 contains the following sentence: `This sentence is stored in array1` and array the following sentences: `Thank you for looking on this. This sentence is a test sentence. This sentence is stored in array2 This is an other sentence. All variables are stored in array2` [download] I would like to compare the sentence of array1 with all the sentences of array2 so that my algorithm will extract the longest common subsequence (i.e. This sentence is stored in). I will try to change the code and make it more efficient as you described above but I would appreciate your help on this.	[reply] [d/l] [select]
Re^2: Improving script's speed and performance... by Anonymous Monk on Dec 28, 2015 at 17:53 UTC
OK. So you're saying your strings are pretty short and consist almost entirely of the words of English language. The unused variable `$bestalignment` in your code makes me doubt that, but OK. Here's one approach (just an illustration, you should definitely debug and improve it). It depends on certain properties of the English language, such as the fact that there are not very many unique words (even allowing for stuff like "array1"); certainly English has less then a million unique words, and only several tens of thousands are actually used. So you should preprocess your data: use strict; use warnings; use String::LCSS_XS qw( lcss ); my @array1 = ('This sentence is stored in array1') x 1_000; my @array2 = ( 'Thank you for looking on this.', 'This sentence is a test sentence.', 'This sentence is stored in array2', 'This is an other sentence.', 'All variables are stored in array2', ) x 20_000; preprocess( \@array1 ); preprocess( \@array2 ); process( \@array1, \@array2 ); exit 0; my %words; my %codes; my $word_num; sub process { my ( $array1, $array2 ) = @_; for my $elem1 (@$array1) { my $best_subseq = ""; my $best_subseq_words = 0; my $best_subseq_chars = 0; my $found = 0; for my $elem2 (@$array2) { my $subseq = lcss( $elem1->[1], $elem2->[1] ); if ( $subseq && length($subseq) > $best_subseq_words ) { my $real_subseq = join ' ', map $codes{$_}, split //, $subseq; if ( length($real_subseq) > $best_subseq_chars ) { $best_subseq = $real_subseq; $best_subseq_words = length($subseq); $best_subseq_chars = length($real_subseq); $found = 1; } } } if ( $found == 1 ) { printf "<<%s>> is the lcss of\n\t<<%s>>\n", $best_subseq, $elem1->[0]; } } } sub preprocess { my ($array) = @_; for my $i ( 0 .. $#$array ) { my $packed = join '', map { my $code = $words{$_}; if ( not defined $code ) { $code = chr( $word_num += 1 ); $words{$_} = $code; $codes{$code} = $_; } $code; } $array->[$i] =~ /\w+/g; $array->[$i] = [ $array->[$i], $packed ]; } } [download] on my ancient laptop, `time perl lcss.pl > /dev/null` finished in 1 min 33 seconds; that's for 100 millions iterations (1000 elements in `@array1`, 20000 * 5 in `@array2`). Note that if you actually use some wildly different data, this approach may break (and I didn't test it very much).	[reply] [d/l] [select]
Re: Improving script's speed and performance... by Anonymous Monk on Dec 29, 2015 at 19:30 UTC
So which one is it— Longest common subsequence problem or Longest common substring problem? You speak of subsequences, but the chosen tool is looking for substrings. If the intention was to look for the longest common substring, then you could essentially `my $s1 = join $stop_character1, @array1; my $s2 = join $stop_character2, @array2; lcss($s1, $s2);` [download] ... and be done with that.	[reply] [d/l]


"be consistent"
	PerlMonks