http://qs321.pair.com?node_id=988118


in reply to Re: Get random unique lines from file
in thread Get random unique lines from file

The problem with this is that it only works with individual lines, but FASTA files contain multi-line records each consisting of a one header line and a variable number of payload lines.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

  • Comment on Re^2: Get random unique lines from file

Replies are listed 'Best First'.
Re^3: Get random unique lines from file
by roboticus (Chancellor) on Aug 17, 2012 at 23:50 UTC

    BrowserUk:

    Ah, well, I know jack about FASTA files, so I didn't consider that. Of course, by changing the reader to accumulate records instead of lines, it could be adapted. Though since there are already a couple working examples from you and Marshall, and since mine has a bias in it, there's no real reason to do so.

    I know that *you* know how to do the changes, but if someone tripping across this node in the future wants to do it, you can do so something (untested!) like this:

    my @record; while (<$FH>) { if (/start of record marker/) { ++$cnt_recs; if ($num/$cnt_recs > rand) { my $i=@samples; if ($i > $num) { $i = rand @samples; } $samples[$i]=[$cnt_recs, [@record]]; } } else { # Accumulate record push @record, $_; } }

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      I don't think that this is the best way...
      BrowserUk and I both used the core module: List::Util::shuffle;

      He understood the FASTA format better than I did and that is fine given that the format of the OP's question was hard to "decode".

      The main point is that is that this core shuffle() function works very well, is very fast (a core function that is implemented in 'C') and who's interface is easy to understand. I recommend using it rather than trying to "roll your own".

      Oh, BTW, "Core Function" means that this is available on all Perl systems as part of the language - no "extra module installation" is required. .... Well I don't know exactly about "all", but I figure since Perl 5.6 (for more than decade).

        Marshall:

        I wasn't really worried whether it was the best way or not, nor whether it used modules or not. I was just amused by the technique for getting a single random line from a file with equal probability, and wanted to generalize it so I could use it for multiple lines.

        Unfortunately, I haven't come up with any ideas that don't introduce a bias. (I haven't thought about it really hard for the last few days, but I've given it occasional thought during my daily commutes.)

        I might be able to come up with something if I would sit down and analyze the probabilities, but it's not quite interesting enough to work *that* hard on it! ;^)

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.