Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Sorting and subsituting a data file, one pass

by tsk1979 (Scribe)
on Jun 21, 2010 at 05:50 UTC ( [id://845669]=perlquestion: print w/replies, xml ) Need Help??

tsk1979 has asked for the wisdom of the Perl Monks concerning the following question:

I can do this with 2 passes over the file, but I was looking for a way to do this in one pass as files are very big. This is what I intend to do. Imagine a txt file with the following data
Some garbage More garbage data -start <some string> \ -intermediate <some string> \ -intermadiate <some string> \ . . -end <some string> Some garbage More garbage data -start <some string> \ -intermediate <some string> \ -intermadiate <some string> \ . . -end <some string> Some garbage More garbage data -start <some string> \ -intermediate <some string> \ -intermadiate <some string> \ . . -end <some string> . . .
I want the output file to contain
data -start <string> -end <string> data -start <string> -end <string> data -start <string> -end <string> . . .
The catch? After removing intermediates, there will be lots of duplicates, which I want to remove. In my current flow, I read in the file, write out an array, and then unique the array 2 pass process seems to be a waste of time. If I can get a one pass algo, it will be great!

Replies are listed 'Best First'.
Re: Sorting and subsituting a data file, one pass
by ikegami (Patriarch) on Jun 21, 2010 at 06:05 UTC
    my %seen; while (<>) { if (s/\\\n\z/ /) { my $next = <>; if (defined($next)) { if ($next =~ /^\s*-end\s/) { $_ .= $next; } elsif ($next =~ /(\\\n)\z/) { $_ .= "\\\n"; } else { $_ .= "\n"; } redo; } } print if /^data\s/ && !$seen{$_}++; }
      This should be the solution. But I am stumped at <>. Won't this stop for user input at every stage?
        Only if you don't pass filenames, or redirect a file
        $ perl myprogram.pl file1 file2 file3 $ myprogram.pl < file4
Re: Sorting and subsituting a data file, one pass
by CountZero (Bishop) on Jun 21, 2010 at 06:27 UTC
    Go through your data file line by line, assembling your data -start <string> -end <string> as you go along. Once each item is assembled store it as the key of a hash (the value can be anything you like or left empty). Duplicate hash keys will disappear automatically and you can then sort the keys.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://845669]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-04-26 00:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found