http://qs321.pair.com?node_id=49053


in reply to Supersplit

OK, I took the idea and did basically a complete rewrite. In particular I noticed the following:
  1. I made the interfaces less magic. For instance you have this magic stuff on the filehandle. I made that a separate function. This will work with tied filehandles as well. For the same reason I stopped using $/ because the author of a tied method may not pay attention to that.
  2. If you are a module, there is no need to do initializations in a BEGIN block.
  3. I would have moved your functions into @EXPORT_OK as Exporter suggests, but you want this for one-offs. OK, TIMTOWTDI. But if I was using it I would have made that change.
  4. I wondered if your @VERSION was meant to be $VERSION.
  5. I note that there is no equivalent to the third argument to split. I played both ways with that then left it alone. Just note that trailing blanks will get split.
  6. I am doing a rewrite and didn't include any POD. You should.
  7. I made this n-dimensional because, well, because I can.
  8. You were not completely clear what the argument order was, and naming the first one $second and the second one $first is IMO confusing. I made it recursive, but still you should note the naming issue. If you wanted 2-dim I would suggest $inner and $outer as names.
  9. You are using explicit indexes. I almost never find that necessary. In this version I use map. Otherwise you could push onto the anon array. Avoiding ever thinking about the index leads to fewer opportunities to mess up, and often results in faster code as well!
  10. I am using qr// to avoid recompiling REs. Given the function call overhead this probably isn't a win. I did it mainly to mention that if you are going to do repeated uses of an RE, you can and should avoid compilation overhead.
  11. The reason for my wrappers is so that my recursion won't mess up on the defaults. :-)
  12. I considered checking wantarray, but the complication in the interface did not seem appropriate for short stuff.
  13. Note that this entire approach is going to fail miserably on formats with things like escape characters and escape sequences. For instance the CSV format is never going to be easily handled using this. Something to consider before using this for an interesting problem.
Oh right, and you want to see code? OK.
package SuperSplit; use strict; use Exporter; use vars qw( @EXPORT @ISA $VERSION ); $VERSION = 0.02; @ISA = 'Exporter'; @EXPORT = qw( superjoin supersplit supersplit_io ); # Takes a reference to an n-dim array followed by n strings. # Joins the array on those strings (inner to outer), # defaulting to "\t", "\n" sub superjoin { my $a_ref = shift; push (@_, "\t") if @_ < 1; push (@_, "\n") if @_ < 2; _join($a_ref, @_); } sub _join { my $a_ref = shift; my $str = pop; if (@_) { @$a_ref = map {_join($_, @_)} @$a_ref; } join $str, @$a_ref; } # Splits the input from a filehandle sub supersplit_io { my $fh = shift; unless (defined($fh)) { $fh = \*STDIN; } unshift @_, join '', <$fh>; supersplit(@_); } # n-dim split. First arg is text, rest are patterns, listed # inner to outer. Defaults to /\t/, /\n/ sub supersplit { my $text = shift; if (@_ < 1) { push @_, "\t"; } if (@_ < 2) { push @_, "\n"; } _split($text, map {qr/$_/} @_); } sub _split { my $text = shift; my $re = pop; my @res = split($re, $text); # Consider the third arg? if (@_) { @res = map {_split($_, @_)} @res; } \@res; } 1;
Cheers,
Ben

PS Please take the quantity and detail of my response as a sign that I liked the idea enough to critique it, and not as criticism of the effort you put in...

Replies are listed 'Best First'.
(tye)Re: Supersplit
by tye (Sage) on Dec 31, 2000 at 09:33 UTC

    If you are a module, there is no need to do initializations in a BEGIN block.

    Maybe you haven't found any but I certainly have. They aren't easy to run into but I still find it a valuable habit. Leaving a potentially large window for a race condition to fit into isn't my idea of a good programming technique. I suspect that future features of Perl will make this even more important to avoid.

    Along those lines, I'd use base qw(Exporter) myself, though I look forward to just being able to dispatch &import and avoid the sledgehammer of inheritance. (:

    I consider qr// to be new enough of a feature that if I were to use it in a module I would make it optional:

    BEGIN { my $sub; if( $] < 5.005 ) { $sub= 'return $_[0]'; } else { $sub= 'return qr/$_[0]/'; } eval "sub _qr { $sub }; 1" or die $@; } # ... _split($text, map {_qr($_)} @_);
    (untested, though).

            - tye (but my friends call me "Tye")
      I disagree quite strongly on the BEGIN issue.

      When someone loads your module via use or require, there is no gap between the finish of parsing your module and the execution of code in your module. Therefore there is no possibility of a race if you don't play games with BEGIN in your module, and stupid games are not played while checking who is loading you. I assume, of course, that you are nt using an unstable experimental feature. (ie Threads. And if Perl 6 gets into races with initialization code while loading modules, then that is a showstopper in my books!)

      If I am wrong then please show me how exactly this race can happen in Perl. If it is something which I think could possibly happen to me, then I will start blocking it. But not unless. In general I won't put energy into defensively coding against things that I don't think I will get wrong. Conversely if it is something that I can conceivably get wrong by accident, I will become a paranoid nut. :-)

      Now I will give you a very good reason not to move your initialization into BEGIN. If you don't move your manipulation of @ISA into BEGIN, then on 5.005 (and if they fix the stupid $Carp::CarpLevel games in Exporter on 5.6 as well) if you mess up a use statement then you will by default get it correctly reported from the correct package in your module rather than from the module which uses you. If you move the manipulation of @ISA into BEGIN or use base to achieve the same effect you will mess that up. (Note that the fixed Carp that will appear in 5.8 does not have this issue.) Therefore by not playing games with what parts of your initializations occur before your module is done compiling, you will get error reporting which is more likely to be informative.

      So the BEGIN time initialization not only doesn't buy me anything that I care about, it loses me something that I consider very important!

      The qr// point is a matter of taste and environment. No, it is not supported in older Perl's. If that is an issue for you, then it is easy enough to drop it and just split on /$re/.

        I hate having to say, "Oops".

        Everything that I said about BEGIN is true. But my promise that Perl 5.8 will have a version of Carp which fixes the begin-time Exporter issue is incorrect. Here are some details on the situation.

        Suppose that you have 3 packages, A, B, and C. Suppose that B and C both inherit from A but C does not inherit from B. Today when C makes a call into B and then B makes a call into A which generates an error, Carp will not label the error as coming from A, B, or C, but instead will report it from whoever called C. In 5.8 the error will in this case be reported from C because Carp has not been told that B trusts C.

        This is good because the fact that B inherits from A is an internal implementation detail which C should not have to know about. What is not good at the moment is that the trust relationship is synonymous with "inherits". But the code for Carp has been written in a way where this can be fixed by just replacing the trusts_directly() function in Carp::Heavy with something smarter. This is all intended.

        What I had not thought through, though, is the situation where C calls a method in B, which actually is inherited from A. Now when Carp looks at the information, what it sees from caller is that C called a function in A. But C trusts A, so that call is trusted. In fact if the call generates an error, then C should be fingered because B does not trust C, and where the error is reported from should not depend on that implementation detail. But the critical piece of information required is not reported by caller, on a method call we don't know what call was intended, we only know what function was called. If we had a way to get that information, then set $called to the appropriate thing in the short_error_loc() function in Carp::Heavy, and the underlying problem would go away in Perl 5.8. But unless someone feels motivated to champion this on p5p, that won't happen and in this situation the error will continue to be incorrectly reported.

        This last situation is exactly what we get with Exporter and playing games with @ISA during compile time. If you do not play the games, then while your code is being compiled you do not inherit from Exporter during your use invocations, so you can be correctly fingered as causing errors. If you do set @ISA bright and early, your use of other packages will go to Exporter's import() function, and then when Carp tries to figure out the error it will decide not to finger you because what it sees is you calling a function in a package you inherit from.

        An incidental note. Capturing information about method calls in caller would allow that information to be properly propagated in Perl's SUPER mechanism. This is a necessary step towards having Perl's SUPER mechanism play nicely with multiple inheritance. Of course making that change in Perl 5 raises backwards compatibility issues, so likely a new keyword would need to be created. That would require even more championining. Since I am personally of the opinion that multiple inheritance leads to overly fragile and complex design issues, I won't be seen caring much either way, but some others might...

Re: Re (tilly) 1: Supersplit
by jeroenes (Priest) on Dec 31, 2000 at 20:11 UTC
    Well, tilly, thanx a lot for the thorough critics. And also for the last remark, I needed that ;-). I have taken your code in the script, but merged it to some of my code where appropiate.

    Of course, I have some counter-remarks, here we go.

    1. Why reverse the order for the arguments? join and split both first start with the separator, and than the input. So I changed the order back to strings, input.

    2. I don't like to pre-compile the regex's, otherwise the split couldn't cope with changing delimiters, as in a text file (see the SYNOPSIS), or with sprintf'ed data. So I changed that back. Furthermore, I couldn't find any reference to qr// in manpages. Could you please explain?

    3. On tie's comments: more dimensional arrays are a perl 5 feature, so I should check for that anyway. Out of time now, so next version of supersplit.

    4. I really like the recursive approach.

    5. I don't see the need for a separate IO version, so I changed that back, too. I just try to treat the string as a filehandle, or try to open it as file (new feature). I didn't succeed to get supersplit( INPUT ), with INPUT as a filehandle, to work. That's peculiar, because the manpage tells me that <$fh>, with $fh='INPUT', should work.

    6. You are totally right on the matter of the inner/ outer naming convention.

    7. And ++ for the join( $_, @_) stuff. I never would have dared to use it. But of course $_ and @_ have different namespaces...

    8. I removed the BEGIN blocks. Is this something for the manpages (perldoc perlmod)?

    Finally, I tested the code with 2D-arrays. It works. I'm leaving home for the remainder of this year, so we'll continue next year.

    Happy new year everyone, best wishes, and thanx for the comments!

    Jeroen

    The new code, with POD, are here:

      Here is explanation for my feedback:
      1. Why reverse the order for the arguments? join and split both first start with the separator, and than the input. So I changed the order back to strings, input.

        Because if you have positionally determined arguments with one list being variable length, it is usual to have the variable list at the end of your argument list. In this case when I made it recursive (and therefore capable of handling n-dim arrays) you had a variable length list of things to join and split. So I moved those arguments to the end.

      2. I don't like to pre-compile the regex's, otherwise the split couldn't cope with changing delimiters, as in a text file (see the SYNOPSIS), or with sprintf'ed data. So I changed that back. Furthermore, I couldn't find any reference to qr// in manpages. Could you please explain?

        Pre-compiling the regexes should not pose a problem for having patterns that handle multiple delimiters. Could you try it and report back? As for documents, the docs on this server are 5.003 specific. (Same as Camel 2.) Most people are on 5.005 or 5.6. On those machines you can find out about the feature from the local documentation using the perldoc utility. In fact in this case: perldoc -f qr directs you to perlop/"Regexp Quote-Like Operators". So try perldoc perlop and then type /Quote-Like to get to the relevant section. Then /qr and hit 'n' until you get to the right spot. (The same search/paging tricks work with utilities like man and less on *nix systems.)

      3. ...more dimensional arrays are a perl 5 feature, so I should check for that anyway. Out of time now, so next version of supersplit.

        I already did that with the recursion. :-)

      4. I really like the recursive approach.

        So did I. :-)

      5. I don't see the need for a separate IO version, so I changed that back, too. I just try to treat the string as a filehandle, or try to open it as file (new feature). I didn't succeed to get supersplit( INPUT ), with INPUT as a filehandle, to work. That's peculiar, because the manpage tells me that <$fh$gt;, with $fh='INPUT', should work.

        The need is due to your having overloaded the input too much. For instance if someone tried to use your current version of supersplit() on an uploaded file from CGI they would fail miserably. I also really don't like trying an open and silently failing.

        Additionally it is generally a bad idea to limit how your caller can pass information. What if I really want to pass you data from a socket? Or from IO::Scalar? Or from a string I have already pre-processed? Having two functions, one of which is a wrapper around the other, for that situation leaves you with a consistent interface and more flexibility.

        As for your comment on what you are surprised is failing, I would not expect that to work. Which manpage led you to expect that it would?

      6. You are totally right on the matter of the inner/outer naming convention.

        Get bitten often enough and you become sensitive to potential confusions in names. :-)

        The real issue here is the same one which makes it hard for programmers to find their own bugs. You need to step out of your own pre-conceptions of how you are supposed to be working and thinking and see the problem from what another person's PoV is likely to be. This is frequently much easier for another person to do...

      7. And ++ for the join( $_, @_) stuff. I never would have dared to use it. But of course $_ and @_ have different namespaces..

        :-)

      8. I I removed the BEGIN blocks. Is this something for the manpages (perldoc perlmod)?

        Well it is something that I know because I looked in some detail at Carp and Exporter a while ago. While the principles of what happens when are documented, I don't think that the conclusion is stated anywhere. I certainly had to learn it by reading and thinking through the code.

        I have to ponder this for a while, and my time for perl is limited at the moment. But for the time being I will like to go into more detail about the filehandle stuff. I read CGI, and I understand the problem, because returns a variable containing both name and filehandle. I would guess that a check for reference should avoid this problem.

        My assumption about the workings of $fh='INPUT'; join '',<$fh>; are based on perldoc:perlop, saying 'If the string inside the angle brackets is a reference to a scalar variable (e.g., <$foo>), then that variable contains the name of the filehandle to input from, or a reference to the same.'.

        I understand the argument about the pipe and stuff, and I think a open.. "$fh" || open .. "<$fh" should do the trick here. I don't have IO::Scalar here, I'll ponder that later.

        You worry about a trying open and silently failing. Well, I tried it, and the code simply returns the name if the open failed (and there is nothing to split on) as the first element. That leaves room for a check.

        Furthermore, the code returns a prepared string as well, if reading as a filehandle fails. I quickly glanced at IO::Scalar, and I think the ref check should allow that as well. I think the following code should work.

        sub _text{ my $fh = shift; unless (defined($fh)) { $fh = \*STDIN; } if ( (! ref $fh) && ((open INPUT, "$fh") || open INPUT, "<$fh" )) +{ $fh = join '', <INPUT>; close INPUT; } no strict 'refs'; (join '', <$fh>) || $fh; }

        Bye,

        Jeroen
        I was dreaming of guitarnotes that would irritate an executive kind of guy (FZ)

        After some CB discussion, tilly convinced me why the open was not such good magic. So I devoted a separate routine to that. The filehandle still is in the normal supersplit routine.

        I personally vote for the split order of arguments, because people are more familiar with that. I think the variable length list can be coped with by the pop's. If that's unusual, so be it. I know that this order is mostly not a good idea.

        Moreover, you can provide a limit option now. Because there is some magic involved, I doubled a function that doesn't try to find a limit parameter at the end.

        The code has been posted more than enough now, so I hooked it up to my homenode. POD has been updated. Code has been tested.

        Cheers,

        Jeroen
        I was dreaming of guitarnotes that would irritate an executive kind of guy (FZ)