Re^6: Best way to store/access large dataset?

I am getting the following field in the dumper " 'ID' => $VAR1->[0]{'ID'}," added to the top of the output list for each attribute

$VAR1 = [
          {
            'ID' => [
                      '1',
                      1,
                      0,
                      1,
                      1,
                      0,
                      1,
                      1,
                      1,
                      1,
                      1,
                      0,
                      0,
                      1,
                      1,
                      1,
                      1
                    ],
            'Circle' => 4,
            'Triangle' => 0,
            'Rectangle' => 4,
            'Square' => 4
          },
          {
            'ID' => $VAR1->[0]{'ID'},
            'Circle' => 4,
            'Triangle' => 0,
            'Rectangle' => 0,
            'Square' => 4
          },
[download]

I am using the following script:

use strict ;
use warnings ;
use Data::Dumper ;

open my $dataIn1, "<", "Attributes_ID.txt" or die "NO ID FILE: $!";
open my $dataIn2, "<", "Attributes.txt" or die "NO ATTR FILE: $!";

my $data = () ;
my $attrs = () ;

sub getdata {
    my ( $fileName, $type ) = split /\t/, $_[1] ;
    push @{$data}, $type unless !defined $fileName ;
}

sub getattrs {
    my @attrs = split /\t/, $_[1] ;
    #shift @attrs ;
    push @{$attrs}, \@attrs unless !defined $attrs[0] ;
}

while( <$dataIn1> ) {
    chomp ;
    getdata( 0, $_ ) ;
}

while( <$dataIn2> ) {
    chomp ;
    getattrs( 0, $_ ) ;
}

my @result;
for( my $j = 0 ; $j < @{$attrs} ; ++$j ) {
    my %subres ;
    @subres{@{$data}} = ( 0 ) x @{$attrs->[0]} ;
    $subres{ID} = $attrs->[0] ;
    for( my $i = 1 ; $i < @{$attrs->[$j]} ; ++$i ) {
        if ( $attrs->[$j][$i] == 1 ) {
            ++$subres{ $data->[$i-1]}  ; 
        }
    } ;
    push @result, \%subres ;
}
print Dumper(\@result) ;
[download]

I'll continue looking at that to see if I can see why it isn't carrying the attribute ID forward. But I wanted to ask some more questions, and answer yours!

Within the code you have written, is it possible to move the attribute ID "outside" of the grouped data? As in:

 {5}{
            'Circle' => 0,
            'Triangle' => 0,
            'Rectangle' => 0,
            'Square' => 0
          },
 {6}         {
            'Circle' => 0,
            'Triangle' => 0,
            'Rectangle' => 0,
            'Square' => 0
          },
 {7}         {
            'Circle' => 4,
            'Triangle' => 4,
            'Rectangle' => 0,
            'Square' => 4
          },
[download]

The above may answer your question about the attribute ID's being defined. The numbers in the left column of the attribute demo dataset (1-30) are identifiers for that attribute. They could be names, or serial numbers...etc. But they are how I identify that attribute so I can look at it later. At the end of this, I actually need a list of the attributes that pass a True/False statement based on a series of percentages.

This is where the grouping comes in. Your script groups the datasets by their category by shifting the category in place of the file name. Which works great, as I am not so concerned with carrying the file name forward. So the categories in this case are "Square" "Circle" Rectangle" "Triangle." What I would need to do then is look at each attribute...So for attribute 7 in the code block above. I would have a series of True/False statements that asked for each category: "Does this attribute occur in Circle more than 50% of the time, and less than 10% of the time in Triangle, Rectangle, and Square?" Then I would have another statement asking the same question about that attribute for the next category. "Does this attribute occur in Square more than 50% of the time, and less than 10% of the time in Triangle, Rectangle, and Circle?" And so on and so forth for each unique category identified in the Attributes_ID file.

At the end of that, I would generate a list of attributes that scored "True" for each category.

Which brings me to my two questions.
1) Would it be better to "melt" this data? Create a four column data structure that consists of 1)File Name 2)Category Name, 3)Attribute ID, 4)Binary Value.

FILE            CATEGORY   ATTRIB   SCORE
1.file.ext    Square     1        1
2.file.ext    Triangle   1        0
3.file.ext    Circle     1        1
4.file.ext    Square     1        1
5.file.ext    Triangle   1        0
etc...
[download]

Or, 2) Would it be better to do this one line at a time with the True/False qualifiers built into the loop? As in, read in the first attribute row, with the categories, and evaluate the attribute for each category and store that True/False for that quartet in each category before moving to the next attribute? (It would be good to note that the categories change, but are defined in the attribute_ID file. SO it would be based on unique entries there.)

Down the rabbit hole!

Comment on Re^6: Best way to store/access large dataset? Select or Download Code

Replies are listed 'Best First'.
Re^7: Best way to store/access large dataset? by Veltro (Hermit) on Jun 27, 2018 at 10:26 UTC
Hi Speed_Freak, This line was not correct, sorry for that: `$subres{ID} = $attrs->[0] ;` and should have been `$subres{ID} = $attrs->[$j][0] ;`. But it doesn't matter so much because that line is changed again in the next code. I will give you some suggestions commented in the code how to require the results you described, but you'll have do some programming there yourself: use strict ; use warnings ; use Data::Dumper ; open my $dataIn1, "<", "Attributes_ID.txt" or die "NO ID FILE: $!" ; open my $dataIn2, "<", "Attributes.txt" or die "NO ATTR FILE: $!" ; my $data = () ; my $attrs = () ; sub getdata { my ( $fileName, $type ) = split /\t/, $_[0] ; push @{$data}, $type unless !defined $fileName ; } sub getattrs { my @attrs = split /\t/, $_[0] ; push @{$attrs}, \@attrs unless !defined $attrs[0] ; } sub calcPercentages { # INPUT: Hash reference # Determine the total amount of attributes # Walk through each category: Circle, Triangle, ... # Take the hit count divided by the total amount of attributes (mu +ltiplied by 100?) # For each category add something to the hash to store the percent +age # e.g. CircleChance, TriangleChance, .... # askQuestions could potentially be called here } sub askQuestions { # INPUT: Hash reference # my $h = ... # Question 1: Does this attribute occur in Circle more than 50% of + the time, and less than 10% of the time in Triangle # if ( $h->{ CircleChance } > 50 && $h->{ TriangleChance } < 10 ) +{ # Do something here. # E.g. Store another result $h # } } while( <$dataIn1> ) { chomp ; getdata( $_ ) ; } while( <$dataIn2> ) { chomp ; getattrs( $_ ) ; } my @result; for( my $j = 0 ; $j < @{$attrs} ; ++$j ) { my %subres ; my $id = $attrs->[$j][0] ; @subres{@{$data}} = ( 0 ) x @{$attrs->[0]} ; for( my $i = 1 ; $i < @{$attrs->[$j]} ; ++$i ) { if ( $attrs->[$j][$i] == 1 ) { ++$subres{ $data->[$i-1]} ; } } ; # You could potentially start calculating hit count percentages pe +r category here: calcPercentages( \%subres ) ; push @result, { $id => \%subres } ; } print Dumper(\@result) ; [download] The results now look like this, but more work is needed to calculate the hit count percentages (as indicated in the code above): `$VAR1 = [ { '1' => { 'Circle' => 4, 'Rectangle' => 4, 'Square' => 4, 'Triangle' => 0 } }, { '2' => { 'Circle' => 4, 'Square' => 4, 'Rectangle' => 0, 'Triangle' => 0 } }, ];` [download]	[reply] [d/l] [select]
Re^8: Best way to store/access large dataset? by Speed_Freak (Sexton) on Jun 27, 2018 at 14:10 UTC
Thank you! I'll work on this and see if I can make some progress. `sub calcPercentages { # INPUT: Hash reference # Determine the total amount of attributes # Walk through each category: Circle, Triangle, ... # Take the hit count divided by the total amount of attributes (mu +ltiplied by 100?)` [download] For clarification, "# Take the hit count divided by the total amount of attributes (multiplied by 100?)" would actually be divide by the total number of files in each category. So if there are four files in the square category (four instances of square in this case) then the percentage would be the hit count for the current attribute divided by the number of possible hits for that attribute in that category. So if attribute 1 has 3 total hits across the 4 occurrences of square, the percentage would be 3/4, or .75.	[reply] [d/l]
Re^9: Best way to store/access large dataset? by poj (Abbot) on Jun 28, 2018 at 14:32 UTC
If the result you want is along these lines Attribute : 1 Category : Circle Rectangle Square Triangle Sum : 4 4 4 0 Count : 4 4 4 4 Percent : 100.00% 100.00% 100.00% 0.00% Attribute : 2 Category : Circle Rectangle Square Triangle Sum : 4 0 4 0 Count : 4 4 4 4 Percent : 100.00% 0.00% 100.00% 0.00% Attribute : 3 Category : Circle Rectangle Square Triangle Sum : 0 4 0 4 Count : 4 4 4 4 Percent : 0.00% 100.00% 0.00% 100.00% then try this #!/usr/bin/perl use strict; use warnings; #use Data::Dump 'pp'; my $t0 = time; # start # load categ look up my $fileID = 'Attributes_ID.txt'; open IN,'<',$fileID or die "$!"; my %id2categ = (); my $count = 0; while (<IN>){ chomp; next unless /^\d/; # skip junk #1.file.ext Square my ($id,$cat) = split /\s+/,$_; $id2categ{$id} = $cat; ++$count; } close IN; print "$fileID : $count records loaded\n"; #pp \%file2categ; # read header to get fileid for each column my $fileA = 'Attributes.txt'; open IN,'<',$fileA or die "$!"; chomp (my $line1 = <IN>); my @fileid = split /\s+/,$line1; # convert fileid to category my (undef,@col2categ) = map{ $id2categ{$_} }@fileid; #pp \@col2categ; # count no of cols for each categ once my %count=(); $count{$_} +=1 for @col2categ; #pp \%count; # process each attribute in turn my $PAGESIZE = 100_000 ; # show progress open OUT,'>','report.txt' or die "$!"; my $total = 0; $count = 0; while (<IN>){ chomp; next unless /^\d/; # skip junk my ($attr,@score) = split /\s+/,$_; # aggregate by category my %sum=(); for my $col (0..$#score){ my $categ = $col2categ[$col]; $sum{$categ} += $score[$col]; } #pp \%result; # calc pcent; my %pcent; my @category = sort keys %count; for (@category){ $pcent{$_} = sprintf "%9.2f%%",100*$sum{$_}/$count{$_} unless $cou +nt{$_}==0; $sum{$_} = sprintf "%10d",$sum{$_}; $count{$_} = sprintf "%10d",$count{$_}; } # output print OUT "\nAttribute : $attr\n"; print OUT join "","Category : ",map{ sprintf "%10s",$_} @category,"\ +n"; print OUT join "","Sum : ",@sum{@category},"\n"; print OUT join "","Count : ",@count{@category},"\n"; print OUT join "","Percent : ",@pcent{@category},"\n"; # progress monitor if (++$count >= $PAGESIZE ){ $total += $count; $count = 0; print "Processed $total records\n"; }; } close IN; $total += $count; my $dur = time-$t0; printf "%s records time = %s seconds\n",$total,$dur; [download] poj	[reply] [d/l]
Re^10: Best way to store/access large dataset? by Speed_Freak (Sexton) on Jun 28, 2018 at 19:53 UTC
Re^10: Best way to store/access large dataset? by Speed_Freak (Sexton) on Aug 02, 2018 at 14:57 UTC
Re^9: Best way to store/access large dataset? by Speed_Freak (Sexton) on Jun 27, 2018 at 16:56 UTC
So I added the following block to count up the instances of each category `my %categories = (); while ( my $name = <$dataIn1>) { chomp($name); my ($file, $category) = split(/\t/, $name); if (exists($categories{$category})) { $categories{$category} += 1; } else { $categories{$category} = 1; } } foreach my $category (sort { $categories{$b} <=> $categories{$a} } key +s %categories) { printf "%s: %s\n", $category, $categories{$category}; }` [download] My variable naming is terrible because I picked that up from stack overflow and adjusted it to fit. But my question then becomes how do I pluck data from this data structure, and the %subres structure? Most of my limited playing has been with arrays, so I'm struggling when it comes to this. I know this isn't anywhere near correct, but could this be considered heading in the right direction? `something like: my $score = $subres->[2] Foreach $category in $categories {where %subres [0] = $category} ($sco +re = ($subres [1] / ($categories{$category}))` [download] If you can see through the non-existent formatting, my intent is to match each category from the categories list to it's corresponding category in the attribute results list, and then divide the result value by the category count value, and push that answer into the list.	[reply] [d/l] [select]


Perl Monk, Perl Meditation
	PerlMonks