Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Pulling out oldest entries from a text file

by Angharad (Pilgrim)
on Sep 04, 2007 at 14:12 UTC ( [id://636944]=perlquestion: print w/replies, xml ) Need Help??

Angharad has asked for the wisdom of the Perl Monks concerning the following question:

I have a text file that looks like this.
item group entry_date 34 gr1 2003-03-02 12 gr1 1990-03-14 39 gr3 2002-04-11 66 gr4 2006-03-16 32 gr3 1998-02-13 90 gr1 2004-06-15 55 gr4 1999-06-15 etc ...
What I need to do is to pull out the oldest entry for each group (gr1 etc) and print them all to a file. Is there some sort of date function or other easy way in whuch I can do this? Any suggestions appreciated.

Replies are listed 'Best First'.
Re: Pulling out oldest entries from a text file
by duff (Parson) on Sep 04, 2007 at 14:35 UTC

    Since your dates appear to be in a nice comparable format, you can just do ordinary string comparisons on them to find out which is oldest. So ... something like this should work fine:

    #!/usr/bin/perl use strict; use warnings; my (%oldest_date, %oldest_entry); while (<DATA>) { my ($item,$group,$date) = split; if (!exists $oldest_date{$group} || $date lt $oldest_date{$group}) + { $oldest_date{$group} = $date; $oldest_entry{$group} = $_; } } for my $g (keys %oldest_entry) { print $oldest_entry{$g}; } __DATA__ 34 gr1 2003-03-02 12 gr1 1990-03-14 39 gr3 2002-04-11 66 gr4 2006-03-16 32 gr3 1998-02-13 90 gr1 2004-06-15 55 gr4 1999-06-15
Re: Pulling out oldest entries from a text file
by Anno (Deacon) on Sep 04, 2007 at 14:43 UTC
    You don't need a date function to determine the oldest date. Your dates are formatted so that string comparison works. Here is a way to extract the oldest entry for each group:
    use List::Util qw( maxstr); my %tb; while ( <DATA> ) { my ( undef, $group, $entry_date) = split; $tb{ $group}->{ $entry_date} = $_; } print $_->{ maxstr keys %$_} for values %tb; __DATA__ 34 gr1 2003-03-02 12 gr1 1990-03-14 39 gr3 2002-04-11 66 gr4 2006-03-16 32 gr3 1998-02-13 90 gr1 2004-06-15 55 gr4 1999-06-15
    Update: Code cleaned up

    Anno

Re: Pulling out oldest entries from a text file
by misc (Friar) on Sep 04, 2007 at 14:40 UTC
    Update: Seems I'm too slow today...

    here is my quick hack..
    #!/usr/bin/perl -w use strict; my $entries; while ( my $line = <DATA> ){ $line =~ /\d?\W*(gr\d)\W*(\d*-\d\d-\d\d)/; next if ( !$2 ); my $group = $1; my $date = $2; $date =~ s/-//g; if ( ! defined( $entries->{$group}) || ( $entries->{$group}->{date} < $date ) ){ $entries->{$group}->{date} = $date; $entries->{$group}->{entry} = $line; } } foreach (keys( %{$entries} )){ print "entry: $entries->{$_}->{entry}"; } __DATA__ item group entry_date 34 gr1 2003-03-02 12 gr1 1990-03-14 39 gr3 2002-04-11 66 gr4 2006-03-16 32 gr3 1998-02-13 90 gr1 2004-06-15 55 gr4 1999-06-15 etc ...


    2nd Update: On the other hand, my code is the onlyone which will not get confused by misformatted lines yet .. :-)

    3rd Update:
    Seems I'm bored..
    I just did some benchmarking..
    I created some testdata with the code below:
    #!/usr/bin/perl -w open F, ">testdata"; for ( 0..1000000 ){ print F "$_ gr".int(rand(10))." ". (1990+int(rand(25))) . '- +0'. (int(rand(10))) . '-' . (10 + int(rand(20)) )."\n"; } close F;

    After this I did some measures:
    my code: time ./latestentries.pl entry: 15970 gr5 2014-09-29 entry: 79485 gr8 2014-09-29 entry: 135788 gr7 2014-09-29 entry: 221 gr2 2014-09-29 entry: 18669 gr9 2014-09-29 entry: 46760 gr1 2014-09-29 entry: 4960 gr3 2014-09-29 entry: 9486 gr0 2014-09-29 entry: 19710 gr4 2014-09-29 entry: 56757 gr6 2014-09-29 real 0m8.689s user 0m8.617s sys 0m0.060s ------------------- anno's code: micha@laptop ~/prog/perl/test $ time perl test-anno.pl 962757, gr0, 2014-09-29 964472, gr1, 2014-09-29 984704, gr2, 2014-09-29 980128, gr3, 2014-09-29 985851, gr4, 2014-09-29 931318, gr5, 2014-09-29 976880, gr6, 2014-09-29 988367, gr7, 2014-09-29 992654, gr8, 2014-09-29 962175, gr9, 2014-09-29 real 0m4.556s user 0m4.424s sys 0m0.036s ------------------- and duff's entry: micha@laptop ~/prog/perl/test $ time perl test-duff.pl 100154 gr5 1990-00-10 5654 gr8 1990-00-10 2318 gr7 1990-00-10 9789 gr2 1990-00-10 19151 gr9 1990-00-10 91314 gr1 1990-00-10 124846 gr3 1990-00-10 14858 gr0 1990-00-10 175946 gr4 1990-00-10 95691 gr6 1990-00-10 real 0m3.497s user 0m3.452s sys 0m0.036s

    The winner is duff.. :-)
    He's the only one who looks for the eldest entry, AND wrote the fastest code...
Re: Pulling out oldest entries from a text file
by moritz (Cardinal) on Sep 04, 2007 at 14:35 UTC
    You can just compare the dates as strings.

    The reading should be straight forward, you can use split to access the individual fields.

    Since you only want to write one item per group, I'd suggest you use a hash with the group as the keys, and every time you read a line you compare if the read date is older than the current date in the hash. If yes, you replace it.

Re: Pulling out oldest entries from a text file
by toolic (Bishop) on Sep 04, 2007 at 14:22 UTC
    I find Date::Simple to be quite useful for date comparisons.
Re: Pulling out oldest entries from a text file
by sgt (Deacon) on Sep 05, 2007 at 08:38 UTC

    What does happen when you get two identical dates?

    As you don't say anything about the context, supposing unix-like, I thought I could mention various one-liners to get a feeling of your data:

  • UN*X golf. It is always worth playing with your system sort as it is often optimized for speed.
  • a Minimal Perl approach.
  • % steph@apexPDell2 (/home/stephan/t) % % cat data.txt # I added the last line + item group entry_date 34 gr1 2003-03-02 12 gr1 1990-03-14 39 gr3 2002-04-11 66 gr4 2006-03-16 32 gr3 1998-02-13 90 gr1 2004-06-15 55 gr4 1999-06-15 10 gr1 2003-03-02 % steph@apexPDell2 (/home/stephan/t) % % LC_ALL=C sort -k 3 data.txt | perl -lna -e 'print if $F[1] eq q{gr1} + and $F[0] == 34' 34 gr1 2003-03-02 % steph@apexPDell2 (/home/stephan/t) % % sort -k 3 data.txt | grep gr1 | sort -n | head -n1 10 gr1 2003-03-02

    The last one reads as sort on the date, select group gr1, select on the first numerically and keep tghe first line. In this particular case it is faster to grep first.

    cheers --stephan

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://636944]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (1)
As of 2024-04-19 00:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found