Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Masking part of a string

by citromatik (Curate)
on Jun 27, 2007 at 12:53 UTC ( [id://623608]=perlquestion: print w/replies, xml ) Need Help??

citromatik has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, monks

I'm trying to mask part of a string based on a bit vector. i.e:

Str: AGACGAGTA Mask: 001111100 --------------- Res: AGxxxxxTA

To do that I wrote the following script (to deal with a dummy example):

use strict; use warnings; my $seq = "AGACGAGTA"; my $mask=''; vec ($mask,$_,1)=1 for (2..6); print "Str: $seq\n"; print "Mask: ", unpack ("b*",$mask),"\n";

At this point the output of the program is:

Str: AGACGAGTA Mask: 00111110

To apply the mask I found two possible solutions:

#Solution 1: my @mask = split "",unpack ("b*",$mask); my @seq = split "",$seq; print "Res: "; print map {! shift @mask ? $_ : "x"} @seq; print "\n";

and...

# Solution 2: my $s_mask = unpack ("b*",$mask); my $pos=0; print "Res: "; while ($pos < length($seq)){ no warnings; ## $s_mask could be shorter than $seq print ! substr($s_mask,$pos,1) ? substr($seq,$pos,1) : "x"; $pos++; } print "\n";

Both outputs as expected:

Res: AGxxxxxTA

But I do not feel to taste with any of the two: In real examples (strings to mask having lengths ~10e5), step through both strings or arrays (the string to mask and the bit vector) could be very inefficient and one of the reason to use bit vectors is efficiency building them ($mask1 | $mask2, $mask1 & $mask2, etc...).

Is there any way to do the masking without having to traverse all the sequence? Am I in the wrong direction for doing the job?

Thanks in advance!

citromatik

Replies are listed 'Best First'.
Re: Masking part of a string
by ikegami (Patriarch) on Jun 27, 2007 at 13:31 UTC

    You're dealing with characters (or octects), not bits, so you need to set 8 bits in the mask instead of 1 for each letter. substr is a good alternative to vec for doing that. Perl even has Bitwise String Operators (see perlop) that allow you to do bit operations on entire strings "at once".

    my $str = 'AGACGAGTA'; my $mask = "\xFF" x length($str); substr($mask, $_, 1, "\x00") for 2..6; my $res = $str & $mask; $res =~ s/\x00/x/g; print("$res\n");

    You can avoid the regexp call:

    my $str = 'AGACGAGTA'; my $mask = "\xFF" x length($str); substr($mask, $_, 1, "\x00") for 2..6; my $x = 'x' x length($str); my $res = ($str & $mask) | ($x & ~$mask); print("$res\n");

    Of course, you really don't need the mask at all.

    my $str = 'AGACGAGTA'; my $res = $str; substr($res, $_, 1, "x") for 2..6; print("$res\n");

    Update: Added the last two snippets.

      I've had a go at running a benchmark. If I've not made a mess of it, the pure masking solution seems to be by far the quickest.

      use strict; use warnings; use Benchmark q{cmpthese}; my $str = q{AGACGAGTA} x 12000; my $mask = qq{\xFF} x length $str; my $x = q{x} x length $str; my @toMask = ( 2 .. 6, 78, 506 .. 1473, 4863, 26290 .. 37907, 107889 .. 107996); substr $mask, $_, 1, qq{\x00} for @toMask; my $resAndOrAndNot = andOrAndNot(); my $resAndRegex = andRegex(); my $resSubstrList = substrList(); die qq{Inconsistent\n} unless $resAndOrAndNot eq $resAndRegex and $resAndOrAndNot eq $resSubstrList; cmpthese (-3, { andOrAndNot => \&andOrAndNot, andRegex => \&andRegex, substrList => \&substrList, }); sub andOrAndNot { my $masked = ($str & $mask) | ($x & ~ $mask); return $masked; } sub andRegex { my $masked = $str & $mask; $masked =~ s{\x00}{x}g; return $masked; } sub substrList { my $masked = $str; substr $masked, $_, 1, q{x} for @toMask; return $masked; }

      produces

      Rate substrList andRegex andOrAndNot substrList 26.0/s -- -8% -91% andRegex 28.3/s 9% -- -90% andOrAndNot 278/s 969% 884% --

      Cheers,

      JohnGG

        A relevant benchmark would factor in the time to build the mask while preserving the real application's ratio of mask rebuilds to the number of operations found. Right now, you ignore the time to build the mask in two tests, while including it in the third test.

        Depending on that ratio, substrList is faster than andOrAndNot.

Re: Masking part of a string
by ferreira (Chaplain) on Jun 27, 2007 at 19:44 UTC

    I could not resist giving a functional and expensive solution to the problem with pairwise function from List::MoreUtils.

    my $mask = '001111100'; my $str = 'AGACGAGTA'; my @mask = split( '', $mask ); my @str = split( '', $str ); my $masked = join '', pairwise { $a ? 'x' : $b } @mask, @str; print "Masked: $masked\n";
    And now using regexes:
    my $mask = '001111100'; my $str = 'AGACGAGTA'; my $masked = ''; while ( $mask =~ /\G(.)/gc ) { # walk the mask my $bit = $1; if ( $str =~ /\G(.)/gc ) { # walk the string to be masked $masked .= $bit ? 'x' : $1; } } print "Masked: $masked\n";

    Probably it could be improved in space if mask is represented as a bit vector and there is an iterator to walk the mask just like $mask =~ /\G(.)/gc is doing.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://623608]
Approved by grep
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (2)
As of 2024-04-26 06:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found