Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Remove double bracket and singe quotes

by lobs (Acolyte)
on May 02, 2016 at 14:50 UTC ( [id://1162017]=perlquestion: print w/replies, xml ) Need Help??

lobs has asked for the wisdom of the Perl Monks concerning the following question:

So I am trying to remove double brackets and singe quotes. Here is example text
'C-3PO' or 'See-Threepio' is a humanoid robot character from the [[Sta +r Wars]] universe who appears in the original ''Star Wars'' films, th +e prequel trilogy and the sequel trilogy.
What I have done is
$doc =~ s/\[\[//g; $doc =~ s/\]\]//g; $doc =~ s/\'//g;
Does not work at all. Please help.

Replies are listed 'Best First'.
Re: Remove double bracket and singe quotes
by toolic (Bishop) on May 02, 2016 at 14:55 UTC
    Works for me...
    use warnings; use strict; my $doc = q('C-3PO' or 'See-Threepio' is a humanoid robot character fr +om the [[Star Wars]] universe who appears in the original ''Star Wars +'' films, the prequel trilogy and the sequel trilogy.); $doc =~ s/\[\[//g; $doc =~ s/\]\]//g; $doc =~ s/\'//g; print "$doc\n"; __END__ C-3PO or See-Threepio is a humanoid robot character from the Star Wars + universe who appears in the original Star Wars films, the prequel tr +ilogy and the sequel trilogy.

    Basic debugging checklist

Re: Remove double bracket and singe quotes
by AnomalousMonk (Archbishop) on May 02, 2016 at 16:37 UTC

    As a general rule, "It doesn't work!" is not a useful problem description. What the monks would like to see is something along the lines of:

    Here is some code:
    my $doc = '...'; $doc =~ s/.../.../g; ...; print ">>$doc<< \n";
    As you can see, I am getting  >>...<< as output, but I really want  >>...<< instead. Can you please help?
    Ideally, the example code and data should be brief and runnable. Please see the Short, Self Contained, Correct (Compilable), Example discussion. Please also see How do I post a question effectively? and How (Not) To Ask A Question. Please help us to help you.


    Give a man a fish:  <%-{-{-{-<

Re: Remove double bracket and singe quotes
by Marshall (Canon) on May 02, 2016 at 16:06 UTC
    Also confirming that your code does indeed work.

    Your test case doesn't have simple [something] in it. If you are willing to delete all of the brackets, not just the double ones, then a shorter more simple, tr statement can be used. Tr is in general faster than a regex with substitution and because its features are so limited, there is no need to escape the characters, so the expression is more readable.

    use warnings; use strict; my $doc = q('C-3PO' or 'See-Threepio' is a humanoid robot character fr +om the [[Star Wars]] universe who appears in the original ''Star Wars +'' films, the prequel trilogy and the sequel trilogy.); $doc =~ tr/[]'//d; print "$doc\n"; __END__ C-3PO or See-Threepio is a humanoid robot character from the Star Wars + universe who appears in the original Star Wars films, the prequel tr +ilogy and the sequel trilogy.

      Update: Added more tests. Thank you, wee.

      Indeed, tr is fast. I compared the 3 regex statements to tr against a 724 MB string. Testing was done on a 2.6 GHz Core i7 machine with Perl v5.16.2.

      use strict; use warnings; use Time::HiRes 'time'; my $doc = "'C-3PO' or 'See-Threepio' is a humanoid robot character fro +m the [[Star Wars]] universe who appears in the original ''Star Wars' +' films, the prequel trilogy and the sequel trilogy.\n"; $doc .= $doc for 1 .. 22; ## expand string to 724 MB print "length : ", length($doc), "\n"; # 759169024 my $start = time; # $doc =~ s/\[\[//g; ## 8.626 secs. # $doc =~ s/\]\]//g; # $doc =~ s/\'//g; # $doc =~ s/\[//g; ## 10.493 secs. # $doc =~ s/\]//g; # $doc =~ s/\'//g; # $doc =~ s/\[+//g; ## 7.050 secs. # $doc =~ s/\]+//g; # $doc =~ s/\'+//g; # $doc =~ s/(?:\[|\]|\')//g; ## 19.559 secs. # $doc =~ s/(?:\[|\]|\')+//g; ## 56.150 secs. <- did not expect this # $doc =~ s/[\[\]\']//g; ## 9.072 secs. # $doc =~ s/[\[\]\']+//g; ## 6.915 secs. $doc =~ tr/[]'//d; ## 1.908 secs. printf "duration : %7.03f secs.\n", time - $start; print "length : ", length($doc), "\n"; # 708837376

      It's unfortunate that Perl doesn't know to optimize the following automatically :(

      $doc =~ s/(?:\[|\]|\')//g --> $doc =~ s/[\[\]\']//g $doc =~ s/(?:\[|\]|\')+//g --> $doc =~ s/[\[\]\']+//g
        Very cool on the benchmarks!

        As we see, fewer lines in Perl doesn't always mean faster execution speed. Sometimes 3 lines can beat 1 line, as demonstrated by your code. That is important and this is often missed here.

        I figure Benchmark #2 is slower than BenchMark #1 because more thinking has to go on for each encounter with a bracket. In this case, specifiying an action with double bracket is faster than an action upon each bracket.

        The fastest is tr, which I expected. This thing is "dumb", but fast.

        That's a good start, but I wouldn't really count on such a limited benchmark too heavily. Take a look at Re: Faster way to do this? for an example of how trustworthy a quick benchmark with one data set really isn't.

        The first question should be whether 8 seconds is too long in the first place. If someone's only running this once, you've already spent far more time saving 6 seconds than it's worth.

        It's unfortunate that Perl doesn't know to optimize the following automatically :(
        $doc =~ s/(?:\[|\]|\')//g --> $doc =~ s/[\[\]\']//g $doc =~ s/(?:\[|\]|\')+//g --> $doc =~ s/[\[\]\']+//g

        Maybe unfortunate in your specific example, but really not surprising at all. The statements (whilst the same in this very specific instance) are not actually the same at all. Consider:

        use 5.16.2; use warnings; my $doc = "'C-3PO' or 'See-Threepio' is a humanoid robot character fro +m the [[Star Wars]] universe who appears in the original ''Star Wars' +' films, the prequel trilogy and the sequel trilogy.\n"; $doc =~ s/(?:\[\[|\]\]|'')//g; # Only replace doubles, to make sequenc +e longer than a single character say $doc; $doc =~ "'C-3PO' or 'See-Threepio' is a humanoid robot character from +the [[Star Wars]] universe who appears in the original ''Star Wars'' +films, the prequel trilogy and the sequel trilogy.\n"; $doc =~ s/[\[\]']//g; # set based replacement say $doc;

        In English, the first match could be described as Match sequence x or sequence y or sequence z, do not keep the matching group. The second match could be described as Match any characters in set a.

        Sometimes there is a trade-off in just how much work the optimiser will do. Obvious conversions due to simple style differences are easy and cheap. Less obvious conversions like this, where in some edge cases it is faster to optimise, I suspect you'll find the decision is to leave the optimisation up to the developer more often than not.

      I came across FFI::TinyCC and FFI::Platypus recently. The Alien::TinyCC module is what compiles the C code and does so very quickly in memory.

      This demonstration completes in 1.832 seconds, ahead of tr.

      use strict; use warnings; use Time::HiRes qw( time ); # ------------------------------------------------------------------- use FFI::TinyCC; use FFI::Platypus::Declare qw( string int ); my $tcc = FFI::TinyCC->new; $tcc->compile_string (q{ int filter_str ( char* str ) { char* p = str; int i = 0; /* strip chars in-place */ while ( *str ) { if ( *str == '[' || *str == ']' || *str == '\'' ) { str++; continue; } *p++ = *str++; i++; } return i; } }); my $address = $tcc->get_symbol('filter_str'); attach [ $address => 'filter_str' ] => [ string ] => int; # ------------------------------------------------------------------- my $doc = "'C-3PO' or 'See-Threepio' is a humanoid robot character fro +m the [[Star Wars]] universe who appears in the original ''Star Wars' +' films, the prequel trilogy and the sequel trilogy.\n"; $doc .= $doc for 1 .. 22; ## expand string to 724 MB my $doclen = length $doc; print "length : $doclen\n"; ## 759169024 my $start = time; my $newlen = filter_str($doc); ## resize string to its new length substr $doc, $newlen, $doclen - $newlen, ''; printf "duration : %7.03f secs.\n", time - $start; print "length : $newlen\n"; ## 708837376


      Well, curiosity got to me and the reason for trying TinyCC for the first time :)

Re: Remove double bracket and singe quotes
by wee (Scribe) on May 02, 2016 at 19:59 UTC
    This works for me:
    #!/usr/bin/perl use 5.010; use strict; use warnings; my $doc = "'C-3PO' or 'See-Threepio' is a humanoid robot character fro +m the [[Star Wars]] universe who appears in the original ''Star Wars' +' films, the prequel trilogy and the sequel trilogy."; $doc =~ s/[\[\]']+//g; say $doc;
    Did you mean that you want to only remove one single quote and not two of them?
      I like how the OP hasn't come back to clarify his problem and qualify what "Doesn't work at all" really means.

      But, I think the answer to your question is "yes"; only a single quote and not double quotes. Which then begs the question (from me) HOW DO YOU do this? As soon as you put a single quote followed by the "g" option on the search-and-replace, it removes all of them.
        $doc =~ s/(?<!\')\'(?!\')//g;

        The above seems to work for OP's example. Only matches single "'" by using negative look-behind and negative look-ahead assertions to avoid matches of more than one consecutive single quote.

Re: Remove double bracket and singe quotes
by sedninja (Initiate) on May 05, 2016 at 16:25 UTC
    You can do this with just one regex:
    use strict; use warnings; my $input = "'C-3PO' or 'See-Threepio' is a humanoid robot character f +rom the [[Star Wars]] universe who appears in the original ''Star War +s'' films, the prequel trilogy and the sequel trilogy."; my $regex = qr/('|\[\[|\]\])/; $input =~ s/$regex//g; print $input;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1162017]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-04-16 22:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found