Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Sort lines in a file

by blackdragoen (Novice)
on Feb 14, 2008 at 12:44 UTC ( [id://667919] : perlquestion . print w/replies, xml ) Need Help??

blackdragoen has asked for the wisdom of the Perl Monks concerning the following question:

Hi I need to sort a file in.txt as shown below
05,sometext5 02,sometext2 10,sometext10 03,sometext3
and need the sorted one as below
02,sometext2 03,sometext3 05,sometext5 10,sometext10
i used tell and seek function to swap this. But one of my main aim is performance ,because its going to handle a large files about 500 Mb to 1Gb I wrote the below code. It works fine but takes time for large files. Any suggestions please Regards, blackdragoen
open(IN,"in.txt"); while(<IN>) { if($_=~m/^\n/) { $tell_val=tell(); } if($_=~m/^\d+/) { $has{$&}=$tell_val if($tell_val ne ''); } } open(OUT,">output.txt"); foreach(sort{$a<=>$b}(keys %has)) { seek(IN,$has{$_},0); $line=<IN>; print OUT $line; print $_; }

Replies are listed 'Best First'.
Re: Sort lines in a file
by lidden (Curate) on Feb 14, 2008 at 14:03 UTC
    On a unix like system you can do sort in.txt > out.txt.
Re: Sort lines in a file
by Erez (Priest) on Feb 14, 2008 at 14:18 UTC

    This is well known, but DON'T use $&, as it will grind your program to a complete halt.

    Next, do the sorting while you're processing the input:

    open(my $IN,'<','in.txt') || die "cannot open in.txt - $!\n"; my @output; while(<$IN>) { next if m/^\n/; push (@output, $_); @output = sort @output; }

    Software speaks in tongues of man.
    Stop saying 'script'. Stop saying 'line-noise'.
    We have nothing to lose but our metaphores.

      Next, do the sorting while you're processing the input:
      Isn't it faster to do the sorting outside of the loop? What am I missing?
      while(<$IN>) { next if m/^\n/; push (@output, $_); } @output = sort @output;

      I don't know that you can say that in general. Some sort algorithms have worst case timings when sorting a pre-sorted (or nearly pre-sorted) list.

      That being said, however, Perl may take that into account.

      I do not, however, without benchmark data, buy that running sort() inside of a loop is faster than running sort outside of the loop.

      Does anyone have a benchmark framework that can test inside and outside sorts with input data that is sorted, reverse-sorted, nearly sorted, and random?

      --MidLifeXis

        But these things will dump the array and occupies more memory. This should not take memory because it should handle a large files.