experiencing slowness due to matching algorithm

ovedpo15 has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks!
I'm facing some performance issues with my algorithm so I ask for your help. I'm using the Graph module to implement a graph that represents the Linux filesystem. Each vertex is a directory/file/link path from root and each edge is the "relationship" between them. For example for /a/b/c (assuming c is a file) you will get 3 vertices a and /a/b (dirs) and /a/b/c (file) and two edges a->/a/b and /a/b->/a/b/c (Actually there is another vertex "/" which is a directory and points to "/a").
The way I initialize the graph is:

my $graph = Graph->new;
$graph->set_vertex_attributes("/",  { "type" => "dir" });
my $self = {
    graph => $graph
};
[download]

Once I fill the graph, I want to have a sub that returns all the possible paths, including links. So for example, assume you have /a/b/c again and /p->/a/b then I want it to return /a,/a/b,/a/b/c,/p,/p/c.
So what I tried to do: after going over all the vertices, I want to find all the links and check if I can replace the target with the link. I also need to support recursive links so if I found changes, I'll do another iteration. The code:

sub extract_paths_from_graph {
    my ($self,$paths_href) = @_;

    foreach my $vertex ($self->{"graph"}->unique_vertices) {
        $paths_href->{$vertex}++;
    }

    while (1) {
        my $found_changes = 0;

        foreach my $vertex ($self->{"graph"}->unique_vertices) {
            my $type = $self->{"graph"}->get_vertex_attribute($vertex,
+ 'type');
            if (index($type,"link") != -1) {
                # Ignore cycle in graph to prevent infinite loop
                my $target = ($self->{"graph"}->successors($vertex))[0
+];
                if (path($target)->subsumes($vertex)) {
                    next;
                }

                foreach my $subpath (keys(%{$paths_href})) {
                    if ($subpath =~ /^$target\// || $subpath =~ /^$tar
+get$/) {
                        my $new_vertex = $subpath =~ s/^$target/$verte
+x/gr;
                        $found_changes = 1 unless (defined($paths_href
+->{$new_vertex}));
                        $paths_href->{$new_vertex}++;
                    }
                }
            }
        }
        last unless ($found_changes);
    }
}
[download]

My code was very slow and I managed to see that it comes from this method. I used NYTProf to create a profiling report and you can find it here. It looks like the problem is with the following line which takes to much time:

if ($subpath =~ /^$target\// || $subpath =~ /^$target$/) {
[download]

My assumptions are correct? Can you please suggest a better alternative way to perform the check? I though regex here will be the fastest.

Comment on experiencing slowness due to matching algorithm Select or Download Code

Replies are listed 'Best First'.
Re: experiencing slowness due to matching algorithm by hippo (Bishop) on Jan 09, 2023 at 09:53 UTC
I though regex here will be the fastest. I have no idea why you thought that. Perhaps you could explain your reasoning? An exact string match and an index match should beat 2 dynamic regexen into a cocked hat for speed. `if ($subpath eq $target \|\| 0 == index $subpath, "$target/") {` [download] Try profiling that and see if it improves matters for you. 🦛	[reply] [d/l]
Re^2: experiencing slowness due to matching algorithm by ovedpo15 (Pilgrim) on Jan 09, 2023 at 10:36 UTC
I will try it thanks! The problem is that `0 == index $subpath, "$target/"` is not quite the same as `$subpath =~ /^$target\//` because I want it to start with `$target`. Is there something similar to `startswith` of Python?	[reply] [d/l] [select]
Re^3: experiencing slowness due to matching algorithm by hippo (Bishop) on Jan 09, 2023 at 11:28 UTC
Please provide a value each for $subpath and $target where `0 == index $subpath, "$target/"` is true and where $subpath does not start with $target. `use strict; use warnings; use Test::More; my @tests = ( { subpath => 'foo/bar', target => 'bar' }, { subpath => 'foo/bar', target => 'ba' }, { subpath => 'foo/bar', target => 'oo' }, ); plan tests => scalar @tests; for my $t (@tests) { isnt index ($t->{subpath}, "$t->{target}/"), 0, "False, as expecte +d" }` [download] 🦛	[reply] [d/l] [select]
Re: experiencing slowness due to matching algorithm by NERDVANA (Deacon) on Jan 09, 2023 at 15:57 UTC
To add a little more context than Hippo's answer: Because you have a variable substitution on the pattern, perl can't pre-compile the regex and so each time it hits that line it performs two regex compiles and two regex comparisons. While in this case "index" will be the ideal approach, if you ever have a similar case where you do need a regex, I recommend this: `if ($subpath =~ m:^\Q$target\E(/\|$):) {` [download] The \Q and \E perform escaping of the $target so that regex special characters in $target don't get interpreted as regex structure, I used `m:...:` instead of `/.../` because you have a literal / in the pattern and this way you don't need to escape it, and the alternation ending of / or $ lets you match with a single regex.	[reply] [d/l] [select]
Re: experiencing slowness due to matching algorithm by bliako (Monsignor) on Jan 10, 2023 at 08:35 UTC
Once you build your Graph there are algorithms to return all paths https://metacpan.org/dist/Graph/view/lib/Graph.pod#All-Pairs-Shortest-Paths-(APSP) (see https://metacpan.org/dist/Graph/view/lib/Graph.pod#all_paths therein). If standard algorithms exist for what you want to do then they will most-likely beat any low-level DIY algorithms. So, use Graph algorithms if you have a Graph. Also, links can be resolved at construction time with, say, https://metacpan.org/pod/Cwd#abs_path-and-friends or readlink. Caveat: querying the filesystem with said functions will be slower than your methods I guess. But it will be safer. bw, bliako	[reply]

Back to Seekers of Perl Wisdom