Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: Would Perl be a good choice for this?

by Discipulus (Canon)
on Oct 02, 2017 at 19:42 UTC ( [id://1200546]=note: print w/replies, xml ) Need Help??


in reply to Would Perl be a good choice for this?

> ..where I would even start.

Hello Speed_Freak, you question is confusing me: too much data, no code at all, no code from your part, no expected results and I do not really well understand this subgroups and the goal..

But since you are asking where to start.. know your data is a good suggestion and and another good quote sounds like: when you know deeply your data, then algorithm is a matter of simply implementation.

So where to start? ordering => array and indexing => hash

I mean that when you are processing your data you split up elements and fill a datastructure that suits your needs. So the basic is a simple loop that consumes lines of data:

use strict; use warnings; while (<DATA>){ chomp; my @ele = split /\s/,$_;

Now that you has @ele you need to coherce it to your logic: so supposing you need to store which ID ( $ele[0] ) has $ele[1] + $ele[2] you can indexing the $ele[1] $ele[2] presence and use it as key of an hash and pushing IDs as values of an anonymous array:

use strict; use warnings; my %res; while (<DATA>){ chomp; my @ele = split /\s/,$_; push @{ $res{"$ele[1] $ele[2]"} }, $ele[0]; } __DATA__ 1 monkey cow hammer nail 2 monkey sheep hammer nail 3 dog cat hammer nail 4 monkey cow hammer nail

this leads you to a datastructure like: ("dog cat", [3], "monkey sheep", [2], "monkey cow", [1, 4])

If you just need to know which ID has monkey you'll loop keys of the hash searching the pattern monkey as in:

foreach my $key (keys %res){ if ($key =~ /monkey/) { print "monkey [occurence in $key] found in IDs:", (join ', ', @{$ +res{$key}}), "\n";

This is my where to start

L*

PS perldsc and (2004)Using Perl for Statistics: Data Processing and Statistical Computing as readmore suggestions.

L*

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Replies are listed 'Best First'.
Re^2: Would Perl be a good choice for this?
by Speed_Freak (Sexton) on Oct 02, 2017 at 20:32 UTC

    Thanks for the response! Sorry for not including any code, I haven't even gotten that far yet.

    Maybe I can try to better explain what I am doing if you're interested... The markers are actually genetic sequences (1-138k, yes/no for presence), the items are samples, and the sub-groups are animals. I'm using an R program that uses a gibbs sampler to look for the commonality between the know sub-groups and an unknown sample... The idea being, that you can identify proportions of the known sub-groups in the unknown sample.

    I currently have a large library of known samples that correspond to various sub-groups of animals. But the 138k markers are causing the R script to bog down substantially. (4+ days per unknown due to single core limitations.) So I want to choose a subset of the 138k markers to run. Ideally this subset would have markers that are unique to each sub-group, but the "uniqueness" could be variable. As in, total list output per subgroup, and % unique from other subgroups. (By altering parameters, I would be able to request a list of 10k ID's from each subgroup that are 80% dissimilar from every other sub-group. Or a list of 5k that are 95% dissimilar...etc.) I definitely need to read up on statistics to figure out what I'm actually asking for!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1200546]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2024-04-19 01:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found