Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Reading a huge input line in parts

by kroach (Pilgrim)
on May 04, 2015 at 13:20 UTC ( [id://1125570]=perlquestion: print w/replies, xml ) Need Help??

kroach has asked for the wisdom of the Perl Monks concerning the following question:

I need to read a line of numbers separated by spaces and ending with a 0, like this:

1 2 3 4 7 20 12334 0

Each number needs to only be processed individually, so I don't have to keep them all in memory. The problem is, input lines can be very long and reading them with <> and splitting consumes a lot of memory.

I tried setting a space as the input record separator, but it doesn't get the last number correctly.

use strict; use warnings; sub do_something { print '{', $_[0], "}\n" } local $/ = ' '; while (<>) { do_something($_); }

I also tried to simulate C++ cin behaviour with the following function:

sub cin_read { my $inchar = getc; $inchar = getc while $inchar =~ /^\s$/; my $result = ''; while ($inchar =~ /^\S$/) { $result .= $inchar; $inchar = getc; } return $result; }

However, it's overly complicated and slow.

How else could I go about doing this?

EDIT: Updated sample input to include multi-digit numbers

Replies are listed 'Best First'.
Re: Reading a huge input line in parts (Handles multi-digit numbers!)
by BrowserUk (Patriarch) on May 04, 2015 at 13:54 UTC

    How about reading a block at a time and spliting that?:

    sub genBufferedGetNum { my @buf = do{ local $/ = \4096; split ' ', scalar <>; } my $leftover = pop @buf; return sub { unless( @buf ) { unless( eof ) { @buf = do{ local $/ = \4096; split ' ', $leftover . <> + }; $leftover = pop @buf; } else { die 'premature eof' if $leftover != 0; return $leftover; # last number } } return shift @buf; }; } my $getNum = genBufferedGetNum(); while( my $num = getNum->() ) { ## do stuff }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      Could this not be simplified?

      sub genBufferedGetNum { return sub { @buf = do{ local $/ = \10; split ' ', <> }; return @buf; }; } my $getNum = genBufferedGetNum(); while( my @part = $getNum->() ) { print @part, "\n"; }

      I shortened the buffer size for testing purposes

      $: cat tb.dat 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 $:
      $: cat tb.dat | perl tb.pl 01234 56789 01234 56789 01234 56789 0
      Dum Spiro Spero
        Could this not be simplified?

        And what happens when your buffer size splits a multi-digit number in two?

        Ie. Run your code against this input:

        123 456 789 1

        And it produces: 123 456 78 9 1

        And doesn't notice that the last number is supposed to be 0.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Reading a huge input line in parts
by hdb (Monsignor) on May 04, 2015 at 14:08 UTC

    What is wrong with the last number? Is it ignored? Then undo your setting of $/ and read again. Does it have a newline? Then use a regex to get rid of it. In any case I would think you should chomp your input to get rid of the blanks.

    Update: try to add s/\s//g before your call to do_something.

      The last number is not read unless eof is encountered. I can't undo the setting of $/ and read again because I have no way of detecting the last number. If I undo it midway I would get the rest of the line, which could be enormous.

        I cannot reproduce your problem but I was thinking of

        use strict; use warnings; sub do_something { print '{', $_[0], "}\n" } { local $/ = ' '; while (<>) { do_something($_); } } do_something(<>);
        It should not be to costly in terms of resources and performance to check if you have a space at the beginning and at the end of each chunk of data before splitting it, and reconstruct the boundary numbers accordingly, especially if your read data chunks are relatively large.

        Update: Ooops, this was meant as an answer to the following post: Re^2: Reading a huge input line in parts, sorry for inconvenience.

        Je suis Charlie.
Re: Reading a huge input line in parts
by flexvault (Monsignor) on May 04, 2015 at 20:08 UTC

    Hello kroach,

    I tried to compare 2 way of doing this, and clearly letting Perl do the buffering wins out, but with the size of your line, you may want to look at the 2nd subroutine 'getnum_new' for how to do partial reads from the file. I think both will work for your requirement ( memory allowing ). Reading a line at a time was about 4-6 times faster.

    use strict; use warnings; use Benchmark qw(:all); our ( $eof, $buffer ); # Build a file for testing! open ( my $data, ">", "./slurp.txt" ) || die "$!"; for my $lines ( 0..10 ) { my $unit = ''; for my $nos ( 0..30) { $unit .= int( rand(3000) ) . " "; # simulate keys } $unit .= $lines; # make sure last doesn't have +space. print $data "$unit\n"; } close $data; my $sa = &getnum1; my $sb = &getnum2; # print "sa|$sa\n\nsb|$sb\n"; exit; if ( $sa ne $sb ) { print "Didn't Work!\n"; exit(1); } timethese ( -9 , { case1 => sub { &getnum1 }, case2 => sub { &getnum2 }, }, ); sub getnum1 { my $s1 = ''; open ( my $data, "<", "./slurp.txt" ) || die "$!"; while ( my $line = <$data> ) { chomp( $line ); my @ar = split( /\ /, $line ); for ( 0..$#ar ) { $s1 .= "$ar[$_],"; } } close $data; return $s1; } sub getnum2 { my $s2 = ''; $eof = 0; open ( my $inp, "<", "./slurp.txt" ) || die "$!"; while ( 1 ) { $s2 .= getnum_new( \$inp ) . ','; if ( $eof ) { chop $s2; last; } } close $inp; return $s2; } sub getnum_new { my $file = shift; my $ret = ''; our $eof; our $buffer; while( 1 ) { if ( ! $buffer ) { my $size = read ( $$file, $buffer, 1024 ); if ( $size == 0 ) { $eof = 1; return $ret; } } my $val = substr( $buffer,0,1,''); if ( ( $val eq ' ' )||( $val eq "\n" ) ) { return $ret; } $ret .= $val; } }

    That's one long line :-)

    Regards...Ed

    "Well done is better than well said." - Benjamin Franklin

Re: Reading a huge input line in parts
by aaron_baugher (Curate) on May 04, 2015 at 13:39 UTC

    You could probably gain quite a bit of speed by reading in chunks of the line instead of one character at a time. That way you can use the normal split function. Something like this, but with as large a buffer value as your system can handle well:

    #!/usr/bin/env perl use 5.010; use strict; use warnings; my $l; # chunk of a line my $tiny_buffer = 8; # tiny buffer for testing while( read DATA, $l, $tiny_buffer ){ for (split ' ', $l){ if( $_ eq '0' ){ say 'Reached the end'; exit; } say "; $_ ;"; # do stuff with the digit } } __DATA__ 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 0

    Aaron B.
    Available for small or large Perl jobs and *nix system administration; see my home node.

      I thought about using read, however, since the numbers are not constant length, a single number could be split between two chunks. This would introduce additional complexity to detect and merge such split numbers. I should've included such examples in the sample input from the start, I've updated the question.
        It should not be to costly in terms of resources and performance to check if you have a space at the beginning and at the end of each chunk of data before splitting it, and reconstruct the boundary numbers accordingly, especially if your read data chunks are relatively large.

        Je suis Charlie.

        In that case, I'd check the end of the buffer for digits, and if there are any, trim them off and save them to prepend to the next buffer that you read in. But you don't want to do that if it's the final 0 in the file, so I have some if statements in here. There's probably a more elegant way to do some of this, but I think this will handle it correctly:

        #!/usr/bin/env perl use 5.010; use strict; use warnings; my $l; # chunk of a line my $tiny_buffer = 8; # tiny buffer for testing my $leftover = ''; # leftover, possibly partial number at end of buf +fer while ( read DATA, $l, $tiny_buffer ) { $l = $leftover . $l; say " ;$l;"; $leftover = ''; if( $l =~ s/(\d+)$//g ){ if( $1 == 0 ){ $l .= '0'; $leftover = ''; } else { $leftover = $1; } } for (split ' ', $l) { if ( $_ == 0 ) { say 'Reached a zero'; } else { say "; $_ ;"; # process a number } } } __DATA__ 1 2 3 4 5 6 7 8 99 1 2 3 4 5 6 7 8 9 0 1 22 3 4 5 6 7 8 99 1 2 3 4 5 6 77 8 9 0

        Aaron B.
        Available for small or large Perl jobs and *nix system administration; see my home node.

Re: Reading a huge input line in parts
by hdb (Monsignor) on May 04, 2015 at 16:54 UTC

    You say you cannot afford to slurp and split, can you afford to slurp? Then use a regex to extract the digits one by one.

    my $all = <>; do_something($1) while $all =~ /(\d+)/g;
      I can't afford to slurp.
Re: Reading a huge input line in parts
by CountZero (Bishop) on May 04, 2015 at 21:25 UTC
    I get a different result when using a space as the delimiter.

    The zero at the end of the line gets recognized OK, but it is the first figure at the next line that gets skipped. So this small test program takes care of that problem:

    use Modern::Perl qw/2014/; { local $/ = ' '; while (<DATA>) { chomp; if (/^0\n*$/) { say "0 - End of line"; next; } elsif (/^0\n(\d+)$/) { say "0 - End of line"; say ">$1<"; next; } else { say ">$_<"; } } } __DATA__ 1 34 282716 7 20 333333 91 0 23 68 82629172 112 8271718 102 1 0 7 211 2 123 0 99 666 0
    Output:
    >1< >34< >282716< >7< >20< >333333< >91< 0 - End of line >23< >68< >82629172< >112< >8271718< >102< >1< 0 - End of line >7< >211< >2< >123< 0 - End of line >99< >666< 0 - End of line
    As you can see, a single zero is recognized as an end-of-line marker, even when not physically at the end of a line.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Reading a huge input line in parts
by CountZero (Bishop) on May 04, 2015 at 18:33 UTC
    Just out of sheer curiosity: how long is very long?

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
      The lines in question can be up to 2 700 000 000 000 000 characters.
        Given quantities of that magnitude, and the relative simplicity of the task (breaking the stream into a sequence of numerics), I'd say it's worthwhile to write an application in C and compile it.

        It would be a short and easy program to write, esp. as a stdin-stdout filter: it's just a while loop that reads a nice size char buffer (say, a few MB at a time), and steps through the buffer one character at a time, accumulating consecutive digit characters, and outputting the string of digits every time you encounter a non-digit character. It wouldn't be more than 20 lines of C code, if that, and you'll save a lot of run-time.

        I suppose there must be more to your overall process than just splitting into digit strings; you could still do that extra part of your process in perl, but have the perl script read from the output of the C program. (But again, given the quantity of data, if the other stuff can be done in C without too much trouble, I'd do that.)

        UPDATE: Okay, I admit I was wrong about how many lines of C it would take. This C program is 26 30 lines (not counting the 4 blank lines added for legibility):

        (2nd update: added four more lines at the end to handle the case where the last char in the stream happens to be a digit.)
        Now that is long indeed!

        Assuming you can read and process a gigabyte of data per second, handling a line that long will take you more than a month.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        My blog: Imperial Deltronics
Re: Reading a huge input line in parts
by pme (Monsignor) on May 04, 2015 at 13:53 UTC
    This one may solve the problem but the performance can be poor.
    use strict; use warnings; sub do_something { print '{', $_[0], "}\n" } local $/ = ' '; while (<DATA>) { s/\n/ /; # replace end-of-line with space my @a = split(' '); # split line at spaces do_something($_) for @a; } __DATA__ 1 2 3 4 5 0 6 7 8 9 10 0

      The performance on that may not be as bad as you think. I tried benchmarking my read-by-chunks solution against a change-the-input-record-separator-to-space solution. The latter makes the code much simpler, since the only special thing you have to watch for is the newlines. But it was also a bit quicker:

      $ perl 1125570a.pl Rate read_buffer change_irs read_buffer 1.15/s -- -33% change_irs 1.72/s 50% -- $ cat 1125570a.pl #!/usr/bin/env perl use Modern::Perl; use Benchmark qw(:all); # setup long multiline strings with lines ending in 0 my $line1 = join ' ', (map { int(rand()*100) } 1..1000000), 0; $line1 =~ s/ 0 / 0\n/g; my $line2 = $line1; cmpthese( 10, { 'read_buffer' => \&read_buffer, 'change_irs' => \&change_irs, }); sub read_buffer { my $l; # chunk of a line my $tiny_buffer = 1000000; # buffer size of chunks my $leftover = ''; # leftover, possibly partial number at end o +f buffer open my $in, '<', \$line1; while ( read $in, $l, $tiny_buffer ) { $l = $leftover . $l; # say " ;$l;"; $leftover = ''; if ( $l =~ s/(\d+)$//g ) { if ( $1 == 0 ) { $l .= '0'; $leftover = ''; } else { $leftover = $1; } } for (split ' ', $l) { if ( $_ == 0 ) { # say 'Reached a zero'; } else { # say "; $_ ;"; # process a number } } } } sub change_irs { open my $in, '<', \$line2; local $/ = ' '; while ( <$in> ) { # say " $_"; if ( $_ =~ /0\n(\d+)/ ) { # say 'Reached a zero'; # say "; $1 ;"; # process a number } elsif ( $_ == 0){ # say 'Reached a zero'; } else { # say "; $_ ;"; # process a number } } }

      The larger the buffer you can use on the read_buffer solution, the faster it should be, I think, but I don't know if it would ever catch up to the $/=' ' solution. Considering how much clearer that one's code is, I think it wins.

      EDIT: It also occurs to me that reading the file from disc might make a difference, if the RS=space solution causes more disc reads. I'd think OS buffering would prevent that, but I don't know for sure. You'd want to benchmark that with your actual situation.

      Aaron B.
      Available for small or large Perl jobs and *nix system administration; see my home node.

      This is not different than my first approach. Replacing newline here occurs only after the data is read, so it doesn't change anything. Since $/ was changed, the newline is just like any other character. If there was a way to treat the newline in input as a space or set $/ to "\s", that would help.
Re: Reading a huge input line in parts
by Anonymous Monk on May 05, 2015 at 03:54 UTC

    A couple simple if hackish ways to handle this:

    use 5.014; $/ = \8192; while (<>) { # like so state $buf .= $_ . ' 'x eof; $buf =~ s{ \s* (\S+) \s }{ process($1), "" }xge; } while (<>) { # ..or so state $buf .= $_ . ' 'x eof; $buf = pop( my $tok = [split ' ', $buf, -1] ); process(@$tok); } sub process { say for @_ }
    The ' 'x eof may be omitted if \n endings are guaranteed.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1125570]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (6)
As of 2024-04-16 19:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found