Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Perl solution for current batch file to extract specific column text

by oryan (Initiate)
on Aug 03, 2015 at 17:14 UTC ( [id://1137275]=perlquestion: print w/replies, xml ) Need Help??

oryan has asked for the wisdom of the Perl Monks concerning the following question:

I was in need of extracting a ton of data from an output file that is fairly systematic in nature so I wrote it as a batch file with a little help (I can hack my way through batch files, I'm definitely not an expert). The problem is there are 500,000 lines in the output file, several files to run the batch on and the batch takes several hours to run each time - not very efficient.

I've seen things done in perl that are very fast, but I don't have any experience and I really don't know how to get it done. I assume there is no direct translator for batch->>perl, as that wouldn't really make sense. Any help would be fantastic:

Problem:

1. Find the string INTERPOLATED HYDROGRAPH

2. Copy next data column to first column of new “Output.txt” (CAC40 in this case)

3. Go down 6 rows and copy 2nd column (1223.) to column 2 of new “Output.txt”

4. Go down 2 more rows (8 total) and copy 6th column (1456.) to columns 3 of new “Output.txt”

5. repeat several thousand times

---------------------------

SAMPLE OF INPUT FILE (text I need extracted in bold)

*** *** *** *** ***

INTERPOLATED HYDROGRAPH AT CAC40

blank line here

PEAK FLOW TIME MAXIMUM AVERAGE FLOW

6-HR 24-HR 72-HR 166.58-HR

+ (CFS) (HR)

(CFS)

+ 1223. 12.67 890. 588. 245. 106.

(INCHES) .154 .408 .509 .509

(AC-FT) 441. 1166. 1456. 1456.

CUMULATIVE AREA = 53.67 SQ MI

*** *** *** *** ***

blank line here

My super crazy slow and inefficient batch file solution (though it works) is:

-----------------------------------------------------------------

@echo off>output.txt & setlocal enabledelayedexpansion set input=input.txt rem this finds the text INTERPOLATED HYDROGRAPH in the input file rem sets k0 = to that line# rem sets x = to that line# + 6 and rem sets y = to that line# + 8 for /f "tokens=1,5 delims=[] " %%a in ('find /n "INTERPOLATED HYDROGRAPH"^<%input%') do ( set /a x=%%a+6 set k0=%%b echo line !x! set /a y=%%a+8 call :xx ) goto :eof :xx rem this line takes the line #s and extracts the following: rem 2nd column of line !x! and the 6th columns of line !y! rem find /n /v ""<%input%|findstr "^\[!x!\]" for /f "tokens=2 delims= " %%a in ('find /n /v ""^<%input%^|findstr "^\[!x!\] "') do set k=%%a for /f "tokens=6 delims= " %%a in ('find /n /v ""^<%input%^|findstr "^\[!y!\] "') do set k2=%%a rem this writes the values to a text file >>output.txt echo %k0% %k% %k2%
-----------------------------------------------------------------

Any help would be greatly appreciated, thanks.

  • Comment on Perl solution for current batch file to extract specific column text
  • Download Code

Replies are listed 'Best First'.
Re: Perl solution for current batch file to extract specific column text
by 1nickt (Canon) on Aug 03, 2015 at 17:44 UTC

    Hi oryan, welcome to the monastery. It's not a soup kitchen.

    Your problem would be trivial to solve in Perl, and you could implement a solution that split the job into chunks that could be worked on simultaneously by several processes. However, you are unlikely to be handed a working solution for free.

    The first question that springs to my mind is whether the sets of lines can overlap, or whether it is guaranteed that one set completes before the next matching set begins. If it is the former, the job will be more complicated.

    In any case, using very standard Perl techniques for opening files, looping through them, matching patterns and writing out the results, you could learn how to do this in short order.

    Here are some links to the documentation to get you started:
    Introduction to Perl
    open a file
    read ines from a file
    split a line into chunks
    pattern matching quick start
    write to a filehandle (and more on opening and closing files)
    Data::Dumper, an indispensable tool for development
    The Basic Debugging Checklist by toolic
    How to post an effective question on PerlMonks after you have some code written

    Good luck, hope this helps!

    The way forward always starts with a minimal test.
Re: Perl solution for current batch file to extract specific column text
by Laurent_R (Canon) on Aug 03, 2015 at 20:25 UTC
    Hi oryan,

    a number of people will tell you that this site is not a free code-writing service and that you should show some efforts at doing things yourself, and I agree.

    Having said that, even if you did not show any Perl code, I appreciate that you have shown code in another scripting language. I consider this as a real effort. And, although I do not have any benchmark, I am fairly sure that a Perl program will be much much faster than a DOS batch script (well, you don't say it, but I guess your script is DOS batch).

    So, you should really look at the links that were provided to you, but I am willing to help and I can give you the skeleton of a Perl program for what you want (quick draft, untested):

    use strict; use warnings; my $input = "input.txt"; open my $IN, "<", $input or die "cannot open $input $!"; my $output = "output.txt"; open my $OUT, ">", $output or die "cannot open $output $!"; while (my $line = <$IN>) { chomp $line; # remove newline character from end of line if ($line =~ /INTERPOLATED HYDROGRAPH AT (\w+)$/){ print $OUT $1; next for 1..5; # skip 5 lines my $val2 = (split / /, $line)[1]; get the second column print $OUT $val2; # a separator may be needed here next; # skip one line my $val3 = (split / /, $line)[-1]; # get the last column print $OUT "$val3\n"; # a separator may be needed here } }
    This is very rough and untested, but I hope it will get you going.

    Update: There are two errors in the code above (thanks to poj for pointing them out. Please see below (Re^3: Perl solution for current batch file to extract specific column text) a corrected version.

      This prints line 1 twice
      #!perl use strict; while (my $line = <DATA>){ if ($line =~ /line 1/){ print $line; next for 1..5; # skip 5 lines print $line; } } __DATA__ line 1 line 2 line 3 line 4 line 5 line 6 line 7

      Perhaps you meant

      #!perl use strict; while (my $line = <DATA>){ if ($line =~ /line 1/){ print $line; <DATA> for 1..5; $line = <DATA>; print $line; } }
      poj
        Yes, poj, you're right, there were two mistakes.

        This is the amended (and tested) version:

        use strict; use warnings; my $input = "input.txt"; open my $IN, "<", $input or die "cannot open $input $!"; my $output = "output.txt"; open my $OUT, ">", $output or die "cannot open $output $!"; while (my $line = <$IN>) { chomp $line; # remove newline character from end of line if ($line =~ /INTERPOLATED HYDROGRAPH AT (\w+)$/){ print $OUT $1; $line = <$IN> for 1..6; # skip 5 lines my $val2 = (split / /, $line)[1]; # get the second column print $OUT " $val2"; $line = <$IN>for 1..2; # skip one line chomp $line; my $val3 = (split / /, $line)[-1]; # get the last column print $OUT " $val3\n"; } }
        And this is the output:
        $ cat output.txt CAC40 1223. 1456.
        I should add that this is not very robust code, it will probably break with any irregularity in the input data. But we would need to know more about the data (having at least three or four samples of line groups, instead of only one) to be able to do something more robust.
Re: Perl solution for current batch file to extract specific column text
by marioroy (Prior) on Aug 04, 2015 at 13:36 UTC

    Update: See this post for a faster version.

    Thank you for sharing batch scripting on Windows. Do not fear Perl. The following book is helpful for beginners.

    Perl by Example 4th Edition.

    Below, am sharing a parallel version using MCE. However, am using MCE mainly for its chunking abilities. Each record is handled individually. The record separator is anchored at the start of the line with "\nINTERPOLATED HYDROGRAPH". MCE detects this and handles automatically.

    use strict; use warnings; use MCE::Loop; use MCE::Candy; my $input_file = shift || 'input.txt'; my $output_file = shift || 'output.txt'; my $match_string = "INTERPOLATED HYDROGRAPH"; open my $ofh, ">", $output_file or die "cannot open '$output_file' for writing: $!\n"; MCE::Loop::init { use_slurpio => 1, chunk_size => 1, max_workers => 4, gather => MCE::Candy::out_iter_fh($ofh), RS => "\n${match_string}", }; ## Below, each worker receives one record at a time ## Output order is preserved via MCE::Candy::out_iter_fh ## line 1 CAC40 # INTERPOLATED HYDROGRAPH AT CAC40 ## line 2 # blank line here ## line 3 # PEAK FLOW TIME MAXIMUM AVERAGE FLOW ## line 4 # 6-HR 24-HR 72-HR 166.58-HR ## line 5 # + (CFS) (HR) ## line 6 # (CFS) ## line 7 1223. # + 1223. 12.67 890. 588. 245. 106. ## line 8 # (INCHES) .154 .408 .509 .509 ## line 9 1456. # (AC-FT) 441. 1166. 1456. 1456. ## line 10 # CUMULATIVE AREA = 53.67 SQ MI mce_loop_f { my ( $mce, $chunk_ref, $chunk_id ) = @_; ## Skip initial record containing header lines including *** *** if ( $chunk_id == 1 && $$chunk_ref !~ /^${match_string}/ ) { ## Gathering here is necessary when preserving output order, ## to let the manager process know chunk_id 1 has completed. MCE->gather( $chunk_id, "" ); MCE->next; } ## Each record begins with INTERPOLATED HYDROGRAPH. my ( $k1, $k2, $k3 ) = ( "", "", "" ); open my $ifh, "<", $chunk_ref; while ( <$ifh> ) { $k1 = $1 and next if $. == 1 && /(\S+)\s*$/; $k2 = $1 and next if $. == 7 && /^\S+\s+(\S+)/; $k3 = $1 and last if $. == 9 && /(\S+)\s*$/; } close $ifh; ## Gather values. MCE->gather( $chunk_id, "$k1 $k2 $k3\r\n" ); } $input_file;

    Kind regards, Mario.

Re: Perl solution for current batch file to extract specific column text
by Anonymous Monk on Aug 03, 2015 at 23:13 UTC
    Another possibility is to use awk. Instead of "go down 2 more rows," look for "recognizable lines" (based on some regex pattern), and capture info from each one. Some line indicates that a particular "record" is now complete.
Re: Perl solution for current batch file to extract specific column text
by anonymized user 468275 (Curate) on Aug 04, 2015 at 12:24 UTC
    Some hints:

    - use the match // operator to parse a string

    - use 'split' to separate a line of input into columns

    - use 'fork' to perform tasks in parallel

    One world, one people

Re: Perl solution for current batch file to extract specific column text
by marioroy (Prior) on Aug 04, 2015 at 14:02 UTC

    Update: Increased chunk size to 400.

    Below, a parallel version with chunking enabled for the solution provided by monk Laurent_R. I ran against an input file containing 500k records.

    Serial: 2.574 seconds. Parallel: 0.895 seconds, which includes the time to fork and reap children under a Unix environment. Afterwards, the output contains 500k lines.

    The test machine is a 2.6 GHz Haswel Core i7 with RAM at 1600 MHz.

    Optionally, the script can receive the input_file and output_file as arguments.

    use strict; use warnings; use MCE::Loop; use MCE::Candy; my $input_file = shift || 'input.txt'; my $output_file = shift || 'output.txt'; open my $ofh, ">", $output_file or die "cannot open '$output_file' for writing: $!\n"; MCE::Loop::init { use_slurpio => 1, chunk_size => 400, max_workers => 4, gather => MCE::Candy::out_iter_fh($ofh), RS => "\nINTERPOLATED HYDROGRAPH", }; ## Each worker receives many records determined by chunk_size. ## Output order is preserved via MCE::Candy::out_iter_fh mce_loop_f { my ( $mce, $chunk_ref, $chunk_id ) = @_; open my $ifh, "<", $chunk_ref; my $output = ""; while ( my $line = <$ifh> ) { chomp $line; # remove newline character from end of line if ( $line =~ /INTERPOLATED HYDROGRAPH AT (\w+)$/ ) { $output .= $1; $line = <$ifh> for 1..6; # skip 5 lines my $val2 = (split / /, $line)[1]; # get the second column $output .= " $val2"; $line = <$ifh> for 1..2; # skip one line chomp $line; my $val3 = (split / /, $line)[-1]; # get the last column $output .= " $val3\r\n"; } } close $ifh; MCE->gather( $chunk_id, $output ); } $input_file; close $ofh;

    Kind regards, Mario.

      Quite interesting. Although I vaguely knew about the MCE module(s) before, I have never tried to use it/them, because I had the feeling that it would not bring much benefit when reading only one very large file (my input files are often gigabytes or even tens of GB large). It appears from your example that I was probably dead-wrong.

      I should probably give it a try. Although one of my problem, presently, is that I am currently stuck with very old versions of Perl (5.8) (because of the AIX and VMS versions I am working on), so that dependencies might be fairly difficult to resolve.

      But, having said that, we should move relatively soon (hopefully just a few months) to new hardware (blades) with much more recent versions of Linux, thus enabling much more recent versions of Perl. If not now, at least then, I might be able to take advantage of the MCE module(s).

      Thank you for the information.

        I do appreciate the help on this issue, it has been wonderful. I have been intrigued by Perl and what it can do. It is definitely quite different from anything that I have used before. I did have to dive into the solutions that were provided a little, because it was not working right out of the gate for me, and it came down to leading spaces in a line of the input in front of "Interpolated Hydrograph". For some reason I left that off on the original post.

        I modified that, added a header row, and removed the \r on the MCE->gather so there are no blank lines. The time difference is night and day:

        Batch (~275k lines) = ~3 hours

        Perl <1 second

        The final code that works like a charm that I am using:

        use strict; use warnings; use MCE::Loop; use MCE::Candy; my $input_file = shift || 'input.txt'; my $output_file = shift || 'output.txt'; my $match_string = " INTERPOLATED HYDROGRAPH A +T "; open my $ofh, ">", $output_file or die "cannot open '$output_file' for writing: $!\n"; print $ofh "HEC1_ID,Q100,V100\n"; MCE::Loop::init { use_slurpio => 1, chunk_size => 1, max_workers => 4, gather => MCE::Candy::out_iter_fh($ofh), RS => "\n${match_string}", }; ## Below, each worker receives one record at a time ## Output order is preserved via MCE::Candy::out_iter_fh ## line 1 CAC40 # INTERPOLATED HYDROGRAPH AT CAC40 ## line 2 # blank line here ## line 3 # PEAK FLOW TIME MAXIMUM AVERAGE FLOW ## line 4 # 6-HR 24-HR 72-HR 166.58-HR ## line 5 # + (CFS) (HR) ## line 6 # (CFS) ## line 7 1223. # + 1223. 12.67 890. 588. 245. 106. ## line 8 # (INCHES) .154 .408 .509 .509 ## line 9 1456. # (AC-FT) 441. 1166. 1456. 1456. ## line 10 # CUMULATIVE AREA = 53.67 SQ MI mce_loop_f { my ( $mce, $chunk_ref, $chunk_id ) = @_; ## Skip initial record containing header lines including *** *** if ( $chunk_id == 1 && $$chunk_ref !~ /^${match_string}/ ) { ## Gathering here is necessary when preserving output order, ## to let the manager process know chunk_id 1 has completed. MCE->gather( $chunk_id, "" ); MCE->next; } ## Each record begins with INTERPOLATED HYDROGRAPH. my ( $k1, $k2, $k3 ) = ( "", "", "" ); open my $ifh, "<", $chunk_ref; while ( <$ifh> ) { $k1 = $1 and next if $. == 1 && /(\S+)\s*$/; $k2 = $1 and next if $. == 7 && /^\S+\s+(\S+)/; $k3 = $1 and last if $. == 9 && /(\S+)\s*$/; } close $ifh; ## Gather values. MCE->gather( $chunk_id, "$k1,$k2,$k3\n" ); } $input_file;

        Thanks again. I hope to be learning more of this in the future.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1137275]
Approved by 1nickt
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2024-04-18 21:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found