Would Perl be a good choice for this?

Speed_Freak has asked for the wisdom of the Perl Monks concerning the following question:

I am looking to do some pattern recognition in datasets. I have ~138,000 markers ID's that I identify with either a 1 (yes) or a 0 (no.) I have a collection of items that have their own sub-groups. I need to find out what markers are distinct for each sub-group, but different from the other groups. In the example below, Items 1,2 would be one sub-group, 3, 4 another, and 5,6 a third. (This dataset is a poor representation because of it's small size and these are all actually from one sub-group in my collection, so they should be fairly similar.

My goal is a script that I can run on updated collections with definable sub-groups that outputs a set number (user definable) list of marker ID's that are common to each sub group, but not the other subgroups. Ex, I want 10k ID's for each subgroup that are relevant to the subgroup, but not the others. I wasn't sure if this was something that Perl would be good for, and if it was, where I would even start.

 Example data:
ID  Item 1 Item2 Item3 Item4 Item5 Item6
1    0    0    0    0    0    0
2    1    1    0    1    1    0
3    1    0    1    1    1    1
4    0    0    0    0    0    0
5    1    1    1    1    1    0
6    0    0    0    0    0    0
7    0    0    0    0    0    0
8    0    0    0    0    0    1
9    1    1    1    1    1    1
10    0    0    0    0    0    0
11    1    1    1    1    1    1
12    0    1    0    1    0    0
13    0    0    0    0    0    0
14    1    1    1    1    1    1
15    1    1    1    1    1    1
16    1    1    1    1    1    1
17    1    1    1    1    1    1
18    1    1    1    1    1    1
19    0    0    0    1    1    1
20    1    1    1    1    1    1
21    0    0    0    0    0    0
22    0    0    0    0    0    0
23    1    1    1    1    1    1
24    1    1    1    1    1    1
25    1    1    0    1    1    1
26    1    1    1    1    1    1
27    0    0    0    0    1    1
28    0    1    0    1    1    1
29    0    0    0    0    0    0
30    0    0    0    0    0    0
31    1    1    0    1    0    0
32    0    0    0    0    0    0
33    1    0    1    0    1    1
34    0    0    0    1    0    1
35    1    0    0    1    1    1
36    0    0    0    0    0    0
37    0    0    0    0    0    0
38    0    0    1    0    0    0
39    1    0    0    0    0    0
40    0    0    0    0    0    0
41    1    1    0    1    1    1
42    0    0    0    0    0    0
43    0    0    0    0    0    0
44    0    0    0    0    0    1
45    1    0    0    0    0    0
46    1    0    0    1    0    0
47    1    1    1    1    1    1
48    0    0    0    0    0    0
49    1    0    1    1    0    1
50    1    1    1    1    1    1
51    0    0    0    0    0    0
52    1    0    0    0    0    1
53    0    0    0    0    0    0
54    1    0    1    0    0    0
55    0    0    0    0    0    0
56    1    0    0    1    1    1
57    0    0    0    0    0    0
58    0    0    0    0    0    0
59    0    0    0    0    0    0
60    0    0    0    0    0    0
61    0    0    0    0    0    0
62    0    0    0    0    0    0
63    1    0    0    0    1    0
64    1    1    0    0    1    1
65    1    0    0    0    0    0
66    1    1    1    1    1    1
67    1    1    1    1    1    1
68    0    0    0    1    1    1
69    1    0    1    1    1    0
70    0    0    0    0    0    0
71    0    0    0    0    0    0
72    1    0    1    0    1    0
73    0    0    0    0    1    1
74    0    0    0    1    1    0
75    1    1    1    1    1    1
76    1    1    1    1    1    1
77    1    0    0    0    0    0
78    1    1    1    1    1    1
79    0    0    0    0    0    0
80    0    0    0    0    0    0
81    0    1    1    0    1    1
82    1    1    1    1    1    1
83    1    1    1    1    1    0
84    0    0    0    0    0    0
85    1    1    1    1    1    1
86    0    0    0    0    0    0
87    1    1    1    1    1    1
88    0    0    0    1    1    1
89    0    0    0    0    0    0
90    0    0    0    0    0    0
91    0    0    0    0    0    0
92    1    1    1    1    1    1
93    0    0    0    0    0    0
94    1    1    1    1    1    1
95    0    0    0    0    0    0
96    1    1    1    1    1    1
97    0    0    0    0    0    0
98    0    0    0    0    0    0
99    1    0    0    1    1    1
100    0    0    0    0    0    0
101    1    1    1    1    1    1
102    1    0    0    0    0    0
103    0    0    0    0    0    0
104    0    0    0    0    0    1
105    0    0    0    0    0    0
106    0    0    0    0    1    1
107    1    1    1    1    1    1
108    1    1    1    1    1    1
109    0    0    0    0    0    0
110    0    0    0    0    0    0
111    0    0    0    0    0    0
112    0    0    0    0    0    0
113    1    1    1    1    1    1
114    1    1    1    1    1    1
115    0    0    0    0    0    0
116    0    0    0    0    0    0
117    0    0    0    0    0    0
118    0    0    0    0    0    0
119    0    0    0    0    0    0
120    0    0    0    0    0    0
121    0    0    0    0    0    0
122    1    1    0    1    1    1
123    1    1    1    1    1    1
124    1    1    0    1    1    1
125    0    0    0    0    0    0
126    0    1    0    0    1    1
127    0    0    0    0    0    1
128    1    1    1    1    1    1
129    1    0    0    1    1    1
130    1    0    0    0    0    0
131    0    0    0    0    0    0
132    1    0    0    1    1    0
133    1    1    1    1    1    1
134    1    1    1    1    1    1
135    0    0    0    0    1    1
136    0    0    0    0    0    0
137    0    0    0    0    0    0
138    0    0    0    0    0    0
139    0    0    0    0    0    0
140    1    1    1    1    1    1
141    0    0    0    0    0    0
142    0    0    0    0    0    0
143    1    1    1    1    1    1
144    1    1    1    1    1    1
145    0    0    0    0    0    0
146    0    0    0    0    0    0
147    1    1    1    1    1    1
148    1    1    1    1    1    1
149    0    1    1    1    1    1
150    0    0    0    0    0    0
151    0    0    0    0    0    0
152    1    1    1    1    1    1
153    0    0    0    0    0    0
154    0    0    0    0    0    0
155    1    1    1    1    1    1
156    1    1    1    1    1    1
157    0    0    0    0    0    0
158    1    1    1    1    1    1
159    0    0    0    0    0    0
160    0    0    0    0    0    0
161    1    1    1    1    1    1
162    0    0    0    0    0    0
163    0    0    0    0    0    0
164    0    0    0    0    0    0
165    0    0    0    0    0    0
166    0    0    0    0    0    0
167    0    0    0    0    0    0
168    1    1    0    1    0    0
169    1    1    1    1    1    1
170    1    1    1    1    1    1
171    0    1    0    1    0    0
172    0    0    0    0    0    0
173    1    1    1    1    1    0
174    0    0    0    0    0    0
175    0    0    0    0    0    0
176    0    0    0    0    0    0
177    0    0    0    0    0    0
178    0    0    0    0    0    0
179    0    0    0    0    0    0
180    1    1    1    1    1    1
181    0    0    0    0    0    0
182    1    1    1    1    1    1
183    0    0    0    0    0    0
184    1    1    1    1    1    1
185    1    1    1    1    1    1
186    0    0    0    1    0    0
187    1    1    1    1    1    1
188    1    1    1    1    1    1
189    0    0    0    0    0    0
190    0    0    0    0    0    0
191    0    0    0    0    0    0
192    1    1    0    1    1    1
193    1    1    1    1    0    0
194    0    0    0    0    0    0
195    0    0    0    0    0    0
196    1    1    1    1    1    1
197    1    1    1    1    1    1
198    1    0    0    0    0    0
199    1    1    1    1    1    1
200    0    0    0    0    0    0

...138k
[download]

Comment on Would Perl be a good choice for this? Download Code

Replies are listed 'Best First'.
Re: Would Perl be a good choice for this? by Discipulus (Canon) on Oct 02, 2017 at 19:42 UTC
> ..where I would even start. Hello Speed_Freak, you question is confusing me: too much data, no code at all, no code from your part, no expected results and I do not really well understand this subgroups and the goal.. But since you are asking where to start.. know your data is a good suggestion and and another good quote sounds like: when you know deeply your data, then algorithm is a matter of simply implementation. So where to start? ordering => array and indexing => hash I mean that when you are processing your data you split up elements and fill a datastructure that suits your needs. So the basic is a simple loop that consumes lines of data: `use strict; use warnings; while (<DATA>){ chomp; my @ele = split /\s/,$_;` [download] Now that you has `@ele` you need to coherce it to your logic: so supposing you need to store which ID ( `$ele[0]` ) has `$ele[1] + $ele[2]` you can indexing the `$ele[1] $ele[2]` presence and use it as key of an hash and pushing IDs as values of an anonymous array: `use strict; use warnings; my %res; while (<DATA>){ chomp; my @ele = split /\s/,$_; push @{ $res{"$ele[1] $ele[2]"} }, $ele[0]; } __DATA__ 1 monkey cow hammer nail 2 monkey sheep hammer nail 3 dog cat hammer nail 4 monkey cow hammer nail` [download] this leads you to a datastructure like: `("dog cat", [3], "monkey sheep", [2], "monkey cow", [1, 4])` If you just need to know which ID has `monkey` you'll loop keys of the hash searching the pattern `monkey` as in: `foreach my $key (keys %res){ if ($key =~ /monkey/) { print "monkey [occurence in $key] found in IDs:", (join ', ', @{$ +res{$key}}), "\n";` [download] This is my where to start L* PS `perldsc` and (2004)Using Perl for Statistics: Data Processing and Statistical Computing as readmore suggestions. L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re^2: Would Perl be a good choice for this? by Speed_Freak (Sexton) on Oct 02, 2017 at 20:32 UTC
Thanks for the response! Sorry for not including any code, I haven't even gotten that far yet. Maybe I can try to better explain what I am doing if you're interested... The markers are actually genetic sequences (1-138k, yes/no for presence), the items are samples, and the sub-groups are animals. I'm using an R program that uses a gibbs sampler to look for the commonality between the know sub-groups and an unknown sample... The idea being, that you can identify proportions of the known sub-groups in the unknown sample. I currently have a large library of known samples that correspond to various sub-groups of animals. But the 138k markers are causing the R script to bog down substantially. (4+ days per unknown due to single core limitations.) So I want to choose a subset of the 138k markers to run. Ideally this subset would have markers that are unique to each sub-group, but the "uniqueness" could be variable. As in, total list output per subgroup, and % unique from other subgroups. (By altering parameters, I would be able to request a list of 10k ID's from each subgroup that are 80% dissimilar from every other sub-group. Or a list of 5k that are 95% dissimilar...etc.) I definitely need to read up on statistics to figure out what I'm actually asking for!	[reply]
Re: Would Perl be a good choice for this? by SuicideJunkie (Vicar) on Oct 02, 2017 at 19:29 UTC
Problems are generally language agnostic, aside from being anti-English (or any other human language) Getting the intent refined down to an algorithm is the hard part. Do you need to identify subgroups, or filter items into known subgroups? Either way, I'd suggest you keep those processes separate. Also, how fuzzy are these groups? You said the examples are all from the same greater subgroup, but some have all zeros and others have all ones. Are the groups defined by simple sets of must-have/can't-have/don't care, or are there more complex conditions where the relationship between two Items determines the group membership (eg: Item1 XOR Item2)?	[reply]
Re^2: Would Perl be a good choice for this? by Speed_Freak (Sexton) on Oct 02, 2017 at 20:33 UTC
I think I answered these questions in the response above? Let me know if I need to elaborate. Thanks!	[reply]
Re: Would Perl be a good choice for this? by Anonymous Monk on Oct 02, 2017 at 18:18 UTC
Perl is a great choice for getting started with this kind of problem. It's much easier to experiment with different algorithms and techniques in a high-level language like Perl than a low-level one like C. Once you've gotten comfortable with things, you might decide that a different language is better for your purposes, but Perl is a great place to start! This is a huge and very interesting field. You might start by reading about statistical classification.	[reply]
Re: Would Perl be a good choice for this? by LanX (Saint) on Oct 02, 2017 at 18:08 UTC
Yes. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re: Would Perl be a good choice for this? by Anonymous Monk on Oct 03, 2017 at 01:29 UTC
Yes, a Maserati can get you to the end of the curb ... but that is not the question. The actual question is, do you know how to drive? And, do you know where you want to go, and do you have even the very-slightest idea how to get there?	[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks