Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Consolidating nstore arrays

by Speed_Freak (Sexton)
on Aug 30, 2017 at 20:52 UTC ( [id://1198355]=perlquestion: print w/replies, xml ) Need Help??

Speed_Freak has asked for the wisdom of the Perl Monks concerning the following question:

This is going to be a terrible post, because I'm starting from scratch here... But I have a variable number of files housed in a scratch directory that has been created in whatever project directory I am working out of at the time. (the project directory will change, and so will the file count.) I need to target a specific set of arrays (some of the files were created from arrays) based on their name, and those arrays have been created using nstore.

So I need to search the project directory for the scratch folder, then find all of the files that match this pattern: "thing1_(ABC-def1g234567).foo__bar_ar"

Then I need to pull out the data from one column(lets say 5)for every row from each array and place them in a report.(there are 150k rows in each file.

the final report would be something like:
Number, thing1, thing2, thing3, thing4
1, 0.55, 0.345, 0.243, 0.567
2, 0.678, 0.3, 0.4563, 0.546
3, 0.3243, 0.533, 0.44, 0.7

the numbers in the first column would need to be populated, but would correspond to the row that the data was read from. (each file has the same format, but the number isn't stored in those arrays.)

#this isn't really code, just my thoughts #!/usr/bin/perl -w use strict; use FindBin; use Data::Dumper; use DBI; use Storable qw(nstore retrieve); #something that looks for the scratch directory in the current directo +ry #something that looks for any files matching the pattern thing1_(ABC-d +ef1g234567).foo__bar_ar #something that cycles through each file one at a time, pulling all of + the rows for [5] and pushing them in order into an excel file matchi +ng the report format

I'm not asking for turn key code, I'm just looking for some guidance towards the donkey before I start trying to stick a tail on it. Thanks in advance!

Replies are listed 'Best First'.
Re: Consolidating nstore arrays
by 1nickt (Canon) on Aug 30, 2017 at 21:16 UTC

    You might like Path::Iterator::Rule; it's very powerful. See the SEE ALSO section for a discussion of alternatives.


    The way forward always starts with a minimal test.
Re: Consolidating nstore arrays
by tybalt89 (Monsignor) on Aug 31, 2017 at 01:14 UTC

    This is not turn key code, but hopefully it will provide some guidance.
    Some problems:
    I don't understand your pattern, it doesn't look perlish.
    Certainly comment out test data generation.
    There will be problems if input files have different sizes.
    This uses my current favorite Path::Tiny but could be adopted to others.

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1198355 use strict; use warnings; use Path::Tiny; use Storable qw(nstore retrieve); my $column = 1; # these could be passed i +n... my $filepattern = qr/(thing.*).foo__bar_ar/; my $projectdirectory = './some'; # make test data path("$projectdirectory/$_")->mkpath for qw( one two/three ); nstore [[1.34, 2.53], [3.26, 4.001]], "$projectdirectory/one/thing1.fo +o__bar_ar"; nstore [[5.55, 6.911], [7.373, 8.808]], "$projectdirectory/two/three/thing3.foo__bar_ar"; # end make test data my @answer; my $header = 'Number'; path($projectdirectory)->visit( sub { my ($path) = @_; /$filepattern/ or return; $header .= ", $1"; my $arrayref = retrieve($path); my $number = 0; $answer[$number++] .= ", " . $_->[$column] for @$arrayref; }, { recurse => 1 } ); my $number = 1; $_ = $number++ . "$_\n" for @answer; print "$header\n", @answer;

      I've been able to swap in my project directory and get this to partially work. I get 12x error "Use of uninitialized value in concatenation (.) or string at test.pl line 35." And line 35 corresponds to this line:

       $answer[$number++] .= ", " . $_->[$column] for @$arrayref;

      I am searching the directory for 12 files, so that seems to make sense that there are 12 errors. Once the errors print,I get the header row, and then I get the number 1 row as this: 1, , , , , , , , , , , ,. Then all of the following rows print just like intended.

      When I try to point to the column I want...(in this case, the 11th column. $column = 11) I get 3,635 of those errors before getting the blank row 1, and then all of the data that I want.

      nohup: Use of uninitialized value in concatenation (.) or string at test.pl + line 35. Number, thing1, thing2, thing3, etc 1, , , , , , , , , , , , 2, 0.954370854376634, 0.342822118341448, 0.572790744083124, 0.59224361 +0847652, 0.8851590415068, 0.991382122632363, 0.507483754645392, 0.911 +63751726292, 0.746774421369453, 0.834212591847216, 0.405652973225827, + 0.816633320526871 3, 0.991624960933286, 0.991327116256435, 0.941102605547385, 0.99452435 +3181868, 0.994089351524488, 0.936750116018622, 0.984664588090976, 0.9 +91809613768339, 0.995157524403383, 0.99036040081599, 0.5066156863869, + 0.559803287167354 4, 0.971949973836183, 0.98269241861483, 0.98789028053774, 0.9698163919 +63118, 0.967753189092842, 0.951648687904388, 0.993352530560803, 0.988 +214472795065, 0.989701083946332, 0.982533372295779, 0.271015752055197 +, 0.563841526116344 5, 0.035742211264196, 0.365226753494403, 0.2865774134715, 0.2099523224 +1765, 0.472126907503153, 0.380517314225538, 0.281019146329213, 0.2162 +24917348467, 0.0665524840406343, 0.175285452052, 0.4819820327753, 0.4 +23563619759855 6, 0.756579058259939, 0.890650837623651, 0.931449842413731, 0.76140477 +0602313, 0.704484524751191, 0.399501625385289, 0.516747548785127, 0.8 +25989268175438, 0.590967055354945, 0.524160278838733, 0.1661424041710 +08, 0.000491664086159459 7, 0.712831728096357, 0.254713361402047, 0.539928335806198, 0.08034247 +23962962, 0.0344976765160731, 0.182113998255013, 0.14620549377983, 0. +0195129814144615, 0.532640937604125, 0.283745467826306, 0.34889286275 +5017, 0.483644189994046

      One other problem I am having, I need the pattern to read more like my $filepattern = qr/(*).foo__bar_ar/ because the file names can change substantially, but the extension is always the same. But I get this error "Quantifier follows nothing in regex; marked by <-- HERE in m/(* <-- HERE ).foo__bar_ar/ at test.pl line 11." when I try things like * or *.*

      I've posted a small set of data sets in a reply to the main thread.

        You are missing a . in your pattern:

        my $filepattern = qr/(.*).foo__bar_ar/

        Also, I'm totally confused about what is in your retrieve files.
        Please show (in code blocks) at least the first twenty lines from a Data::Dumper or Data::Dump print of one of your "retrieve" files as it is immediately after it is read in.

Re: Consolidating nstore arrays
by thanos1983 (Parson) on Aug 30, 2017 at 23:36 UTC

    Hello Speed_Freak,

    I would like also to propose a possible idea to your solution. Regarding the part So I need to search the project directory for the scratch folder, then find all of the files that match this pattern: "thing1_(ABC-def1g234567).foo__bar_ar". I would suggest to use the module File::Find::Rule.

    Why to use this module? For many good reasons, from documentation Specifies names that should match. May be globs or regular expressions. also Do not apply any tests at levels less than $level (a non-negative integer). and many many other features. I other words you can define what files to find how deep to search recursively in subdirectories etc...etc...etc...

    Sample of code in relevant question Re: Capturing and then opening multiple files.

    Hope this helps, BR.

    Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: Consolidating nstore arrays
by Speed_Freak (Sexton) on Aug 31, 2017 at 16:39 UTC

    Thanks for the replies! I am putzing along making slow progress. I went with File::Find::Rule to find the files for now.

    #!/usr/bin/perl use strict; use warnings; use File::Find::Rule; use Storable qw(retrieve); use Data::Dumper; my @files = File::Find::Rule->file() ->name( '*.foo__bar_ar' ) ->in('/home/foo/bar/snafu'); #print Dumper (@files); foreach my $row (@files) { my @total_data = retrieve($row); my $target_data = map $_->[11], @total_data; print "Target data: $target_data\n"; }

    I am unsuccessfully attempting to print the data from the nstore files.(trying to test the concept, haven't even started down the path of cramming them into an array for eventual dump to csv) Instead I am getting the number "1" the same amount of times as I have files in the directory.

    Input files (arrays stored by storable)

    thing1.foo__bar_ar 1,18.4,7.6,10.8,0.584615384615385,22,4.0,18,0.307692307692308,0.664861 +632672521,0.968405008381221,0.816633320526871 0,31.5,18.9,12.6,0.75,199.7,29.2,170.5,0.255133245958934,0.15079674831 +7197,0.968809826017511,0.559803287167354 0,115.2,35.9,79.3,0.475181998676373,13.7,8.3,5.4,0.754545454545455,0.8 +55054749249092,0.272628302983597,0.563841526116344 0,969.7,1034.6,-64.8999999999999,1.03238038217832,1607.6,582.0,1025.6, +0.531603945926197,0.0340815410482703,0.81304569847144,0.4235636197598 +55 0,3.2,13.2,-10,1.60975609756098,22.2,58.2,-36,1.44776119402985,0.00018 +9855866797165,0.000793472305521753,0.000491664086159459 thing2.foo__bar_ar 0,124,24.9,99.1,0.334452652787105,533.5,764.2,-230.7,1.17777606534638, +0.959457725728336,0.00783065425975528,0.483644189994046 0,23.1,21.3,1.8,0.959459459459459,111.4,35.7,75.7,0.485384092454113,0. +051736839732654,0.841995362489232,0.446866101110943 0,65.2,106.7,-41.5,1.24141942990111,10.5,23.1,-12.6,1.375,0.0045736005 +1269834,0.00151695997213462,0.00304528024241648 0,4309.7,162.2,4147.5,0.0725418725821239,5949.4,350.9,5598.5,0.1113915 +21038681,0.995577034355054,0.993485851801997,0.994531443078526 0,10.5,17.7,-7.2,1.25531914893617,9.9,17.1,-7.2,1.26666666666667,0.001 +24657203727433,0.00112269442213042,0.00118463322970237 thing3.foo__bar_ar 0,3384.5,129.2,3255.3,0.0735407120698978,19718.2,2209.1,17509.1,0.2014 +93115887501,0.995983099858049,0.988053036660467,0.992018068259258 0,2483.6,571.2,1912.4,0.373968835930339,139.1,23.8,115.3,0.29220380601 +5961,0.995670903509154,0.998383366800592,0.997027135154873 1,13.7,26.3,-12.6,1.315,12.3,3.2,9.1,0.412903225806452,0.0002556958960 +69042,0.821140634122146,0.410698165009108 0,11323.1,1750.2,9572.9,0.26775183006586,1886.3,49.6,1836.7,0.05124231 +62353427,0.964837401973032,0.994660051822814,0.979748726897923 0,18789.6,2845.0,15944.6,0.26300463146996,2834.9,86.2,2748.7,0.0590188 +627571804,0.966390332824062,0.99432887614011,0.980359604482086 thing4.foo__bar_ar 0,9239.1,2341.0,6898.1,0.404314297803991,8755.6,920.3,7835.3,0.1902251 +98689528,0.938134184255461,0.986710551917049,0.962422368086255 0,640.2,29.2,611,0.0872423065431731,291.6,19.5,272.1,0.125361620057859 +,0.992917934751572,0.990319885849402,0.991618910300487 0,96.4,24.3,72.1,0.402651201325601,315.1,62.7,252.4,0.331921651667549, +0.873629030952342,0.935006962479539,0.904317996715941 1,44.7,46.5,-1.8,1.01973684210526,19.6,53.9,-34.3,1.46666666666667,0.0 +629076470018427,0.000324736116046742,0.0316161915589447 1,66.5,17.8,48.7,0.422301304863582,29.6,26.0,3.6,0.935251798561151,0.9 +08927363824235,0.0637831300468857,0.48635524693556

    Desired Output (the 11th spot (0-11) for each row of each array)

    Number,thing1, thing2, thing 3, thing4

    1,0.816633320526871,0.483644189994046,0.992018068259258,0.962422368086 +255 2,0.559803287167354,0.446866101110943,0.997027135154873,0.991618910300 +487 3,0.563841526116344,0.00304528024241648,0.410698165009108,0.9043179967 +15941 4,0.423563619759855,0.994531443078526,0.979748726897923,0.031616191558 +9447 5,0.000491664086159459,0.00118463322970237,0.980359604482086,0.4863552 +4693556

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1198355]
Approved by sundialsvc4
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (7)
As of 2024-04-19 10:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found