Data Structure Question

bohrme has asked for the wisdom of the Perl Monks concerning the following question:

I think my brain disappeared temporarily.

I'm trying to figure out the best way to to organize my data so that I can produce output in a more flexible way.

For example, let's say that I have 3 pieces of data: Employee Number, Form Number, Date and I want to be able to produce a summary based on either the Employee Number or Form Number. E.g., For employee 1, list all the form numbers and when they were signed and, likewise, list the employees that have signed a particular form with date signed.

I know that I could stick everything in an sequential array and just offset by three until the third offset counter equals the size of the array but looking at the code is making me ill. Of course, this assumes that every set of data has exactly 3 elements, which is the case here (seems like a poor coding practice to me though).

Unfortunately, I've never really dealt with complex data structures so I'm a little at lost as to how to structure this data. A hash of arrays, hash of hashes, etc.

If that doesn't make sense here's some test data:

Employee  Form   Date
10001      10    20090101
10002      10    20080515
10003      10    20090323
10001      20    20090412
10002      20    20090711
[download]

I'm trying to make the output look something like this:

10001
      10    20090101
      20    20090412
10002
      10    20080515
      20    20090711
10003
      10    20090323
[download]

10
      10001    20090101
      10002    20090412
      10003    20090323
20
      10001    20090412
      10002    20090711
[download]

Hopefully, that makes more sense than my word-based explanation.

Thanks

Comment on Data Structure Question Select or Download Code

Replies are listed 'Best First'.
Re: Data Structure Question by GrandFather (Saint) on Nov 25, 2009 at 21:12 UTC
Your data could be stored in many ways. The key to choosing a data structure tends to relate to how you most often want to access it and how easy it is to write and maintain reliable code to manage it. You could for example use an array of arrays - one entry per record where each record is an array containing three elements. That's good if you want to process all the data every time you perform a query, but is a maintenance nightmare if you ever need to change the number of fields in a record. Even just coding in the first place can be nasty unless you use named constants to access the individual elements in a record. You could use an array of hashes which has most of the advantages of the AOA above, but provides named access to the fields in the records making coding and maintenance easier at the cost of needing more memory for storing the data. If you need to access the data by some key field then a HOA or HOH is appropriate. If you need to access the data by more than one key or there is more data than you really want to fit into memory, then you should use a database. That can actually be a lot simpler than you might think. Consider: use strict; use warnings; use DBI; unlink 'db.SQLite'; # Build the database my $dbh = DBI->connect ("dbi:SQLite:dbname=db.SQLite","",""); $dbh->do ('CREATE TABLE employees (employee TEXT, form TEXT, date TEXT +)'); my $sth = $dbh->prepare ('INSERT INTO employees (employee, form, date) + VALUES (?, ?, ?)'); $sth->execute (do {chomp; split}) while <DATA>; print "Access by employee\n"; $sth = $dbh->prepare ( 'SELECT * FROM employees ORDER BY employee, form, date' ); $sth->execute (); my $employee = ''; while (my $row = $sth->fetchrow_hashref ()) { if ($employee ne $row->{employee}) { $employee = $row->{employee}; print "$employee\n"; } printf " %-6s %s\n", @{$row}{qw(form date)}; } print "Access by form\n"; $sth = $dbh->prepare ( 'SELECT * FROM employees ORDER BY form, employee, date' ); $sth->execute (); my $form = ''; while (my $row = $sth->fetchrow_hashref ()) { if ($form ne $row->{form}) { $form = $row->{form}; print "$form\n"; } printf " %-8s %s\n", @{$row}{qw(employee date)}; } __DATA__ 10001 10 20090101 10002 10 20080515 10003 10 20090323 10001 20 20090412 10002 20 20090711 [download] Prints: `Access by employee 10001 10 20090101 20 20090412 10002 10 20080515 20 20090711 10003 10 20090323 access by form 10 10001 20090101 10002 20080515 10003 20090323 20 10001 20090412 10002 20090711` [download] True laziness is hard work	[reply] [d/l] [select]
Re: Data Structure Question by kyle (Abbot) on Nov 25, 2009 at 21:52 UTC
One way to handle this would be to put your data into a database and query it out. That can be useful especially if you have many many records or you have a data set that grows over time, and you don't want to build it repeatedly. I'd probably represent your data with an array of hashes. `my @records = ( { employee => 10001, form => 10, date => 20090101, }, { employee => 10002, form => 10, date => 20080515, }, { employee => 10003, form => 10, date => 20080323, }, );` [download] What's nice about this is that each hash can expand to have more fields as necessary. When you want to summarize by any given field, you can do this: `sub summarize_by { my $field_name = shift @_; my %out; for my $r ( @records ) { push @{ $out{$r->{$field_name}} }, $r; } return \%out; }` [download] What you'd get from that is a hash of arrays of hashes. Each key of the top level hash is a unique value of the field you specified, and that hash's values are references to an array of records that had that key-value combination.	[reply] [d/l] [select]
Re: Data Structure Question by zwon (Abbot) on Nov 25, 2009 at 21:29 UTC
You should use a database as GrandFather suggested, but here's example how you could do it using array of hashes: use strict; use warnings; use 5.010; use List::MoreUtils qw(uniq); my @data; while (<DATA>) { my %row; @row{qw(employee form date)} = split /\s+/; push @data, \%row; } my @employees = uniq sort map { $_->{employee} } @data; for my $employee (@employees) { say $employee; my @signed = sort { $a->[0] <=> $b->[0] } map { [ $_->{form}, $_->{date} ] } grep { $_->{employee} == $employee } @data; for (@signed) { printf "\t%s\t%s\n", @$_; } } __DATA__ 10001 10 20090101 10002 10 20080515 10003 10 20090323 10001 20 20090412 10002 20 20090711 [download] Update: kyle suggested more elegant solution for AoH	[reply] [d/l]
Re: Data Structure Question by scorpio17 (Canon) on Nov 25, 2009 at 21:39 UTC
Here's my version (hash of hash): #!/usr/bin/perl use strict; my %data_by_employee; my %data_by_form; # read in tab-delimited data fields into # two hashes-of-hashes (actually, two hashes of hash refs) while(my $line = <DATA>) { chomp $line; my ($employee, $form, $date) = split(/\t/,$line); next unless ($employee && $form && $date); # skips blank lines $data_by_employee{$employee}{$form} = $date; $data_by_form{$form}{$employee} = $date; } print "By Employee:\n"; for my $employee (sort keys %data_by_employee) { print "$employee\n"; # note: $data_by_employee{$employee} is a hash reference, # so we have to dereference it by using %{ } for my $form (sort keys %{ $data_by_employee{$employee} } ) { my $date = $data_by_employee{$employee}{$form}; print "\t$form\t$date\n"; } } print "\n"; print "By Form:\n"; for my $form (sort keys %data_by_form) { print "$form\n"; # note: $data_by_form{$form} is a hash reference, # so we have to dereference it by using %{ } for my $employee (sort keys %{ $data_by_form{$form} } ) { my $date = $data_by_form{$form}{$employee}; print "\t$employee\t$date\n"; } } print "\n"; __DATA__ 10001 10 20090101 10002 10 20080515 10003 10 20090323 10001 20 20090412 10002 20 20090711 [download]	[reply] [d/l]
Re: Data Structure Question by bichonfrise74 (Vicar) on Nov 25, 2009 at 22:42 UTC
Try this. `#!/usr/bin/perl use strict; use Data::Dumper; my %record; while (<DATA>) { next if ( /^Employee/ ); my ($employee, $form, $date) = split( /\s+/ ); $record{$employee}->{$form} = $date; } print Dumper \%record; __DATA__ Employee Form Date 10001 10 20090101 10002 10 20080515 10003 10 20090323 10001 20 20090412 10002 20 20090711` [download]	[reply] [d/l]


"be consistent"
	PerlMonks