Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Sorting A File Of Large Records

by dingus (Friar)
on Dec 10, 2002 at 20:01 UTC ( [id://218888]=note: print w/replies, xml ) Need Help??


in reply to Sorting A File Of Large Records

This could be a good choice to use An APL trick for the Schwartzian Transform as that is almost exactly the reason why I thought it up in the first place...

There are some subtelties, mostly specified above that help. The first trick (suggested by dmitri) is to set $/ to '-------'.$/ so that you get each record.

If the file is far too big to fit into memory then the second is to create the sort based on the location within the file and the zip code - use tell() with each record.

Other tricks may depend on whether the file is local or not (whether you can afford to read it multiple times) whether you want to sort on a secondary key as well and so on.

Actually thinking about, assuming that you have sufficient memory and no secondary key you wish to sort on my prefered solution would be a two phase sort. Phase one is an insertion sort into a hash of zip codes. Then you read the file again and write out the sorted version.

local $/ = '----------'.$/; my %zips; open (FILE,"<filename") or die "error $!"; my $teller = tell(FILE); NB need position before file read! while (<FILE>); die "no zip code found record at $teller\n" unless (/Zip:\d{5}/s); push @{$zips{$1}}, $teller; $teller = tell(FILE); # FILE is optional here but a good idea! } open (SORTED,">newfile") or die "Can't open newfile: $!"; for (sort keys %zips) { for (@{$zips{$_}) { seek FILE, $_, 0; print SORTED <FILE>; } } close FILE; close SORTED;
Disclaimer - code untested may contain horrible bugs

Dingus


Enter any 47-digit prime number to continue.

Replies are listed 'Best First'.
Re: Re: Sorting A File Of Large Records
by vek (Prior) on Dec 10, 2002 at 23:18 UTC
    Nicely done dingus. I'd offer one tweak. I assume that the OP wanted to print just the current record to the "sorted" file and not the whole file from the seek offset:
    for my $zipCode (sort{ $a <=> $b }(keys(%zips))) { for my $offset (@{$zips{$zipCode}}) { seek FILE, $offset, 0; while (<FILE>) { last unless (/Zip:$zipCode/); print SORTED $_; } } }
    -- vek --

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://218888]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (4)
As of 2024-04-25 22:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found