Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

get the line of ith occurrence of '>' without OPENING THE FILE

by Anonymous Monk
on Oct 05, 2002 at 18:04 UTC ( [id://203066]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Monks, I have post it before but it seems to me I am not making myself clear. The toughest situation is: I CAN NOT USE OPEN ON THE FILE! IT IS TOO LARGE TO OPEN IN MY OS. So I can only resort to shell.
I got stuck with a problem of getting the line with ith occurrence of a symbol, '>' in this case, from a file. I have 3 million '>' in my file. What I want to do is to split the file into 8 pieces that I have even number of '>'s in each file. (the file contains other things and I do not know where these '>' locate). The file is too large to get open in memory. I read sth about csplit but it split a file whenever it sees a '>'. So I am thinking maybe I want to get the line of the 10000th '>' and use it for csplit. Does anyone have better idea or how I can get the line? Any of your idea or hints will be highly appreciated. Ginger

Replies are listed 'Best First'.
(tye)Re: get the line of ith occurrence of '>' without OPENING THE FILE
by tye (Sage) on Oct 05, 2002 at 21:35 UTC

    I can understand everyone saying that what you are saying doesn't make sense.

    I just wanted to drop a quick note saying there actually are cases where you can't open a huge file from within Perl but where you can do:     cat huge.file | perlscript It is a strange concequence of how "large file support" was retrofitted into some operating system(s). In order to open a really large file under these operating systems you have to use a special version of open() that Perl will only use if you've compiled Perl to include "large file support". The operating system prevents you from opening the file otherwise just in case you plan on using seek() on the file handle.

    In some ways I think this is a pretty silly decision, but then I haven't analyzed the situation much and yet I can still understand at least the temptation to do this so it is probably a reasonable design that just looks stupid until you understand the trade-offs better.

    In any case, such operating system(s) will have a program called 'cat' that knows how to open and read such huge files so you just need to use:     open( FILE, "cat $filename |" ) or die... and make very sure that $filename doesn't contain "interesting" characters. I'd show the secure way to do that, but I don't think there is one simple answer that covers all of the variables (versions of Perl, versions of operatings systems, etc.). I guess, since we are assuming 'cat', we can also assume /bin/sh and so

    open( FILE, qq<cat "\Q$filename\E" |> ) or die ...;
    is probably bullet-proof.

            - tye (echo "but my friends call me "'"'"$Tye"'"')

    P.S. Please, please, please! Register a username at the site. It will make it easier for us to help you when you have these problems figuring out why you think your question has disappeared and will make it easier for you to find your questions and their answers.

    Failing that, please, please, please, don't keep posting the same question over and over! You need to have a little patience. When you post a question, it doesn't appear on the main pages of the web site right away. That does not mean that your question has been lost and it doesn't even mean that there aren't people already reading your question and writing answers to it.

      If we're going as far as having access to "cat" and "shell", then it might be fun to tinker with "dd" as well. Then the OP can get to just the part of the file he wants to:
      open(THIRDMB, "dd if=$filename bs=1024 skip=2048 count=1024|") or die; while(<>) { ... } # Reads third MB of file only. close(THIRDMB);
      dd's useful in Unix for skipping around in large files at the shell level where seek(2) might be a bit more complicated.
      Hi I'm new to Perl so I don't know how to do it that way.
      Conceptually though, from a UNIX/LINUX generalised standpoint
      cat filename | wc -l
      Say this returns 80000
      head -40000 filename > file1sthalf
      tail -40000 filename > file2ndhalf
      head -20000 file1sthalf > filename1stquarter
      tail -20000 file1sthalf > filename2ndquarter
      ...iterate for file2ndhalf
      ...then cut the resulting 4 quarters into 8ths by same method
      This doesn't count and evenly-distribute your >'s but it will give you 8 smaller files to play with.
      This method is crude and not at all Perlesque but I just don't have the knowledge ... yet: I'd be interested to see the Perl code to do what I described.
      Then run some Perl code to do your line of i th occurence
      Rich
      In the Gulf of Mexico
      Noone can hear you scream
Re: get the line of ith occurrence of '>' without OPENING THE FILE
by dws (Chancellor) on Oct 05, 2002 at 18:26 UTC
    The toughest situation is: I CAN NOT USE OPEN ON THE FILE! IT IS TOO LARGE TO OPEN IN MY OS.

    You seem to misunderstand the difference between

    1. openning a file,
    2. reading it, and
    3. reading it all into memory.

    Openning a file does not imply that you're going to read it. Reading a file does not imply that you're going to read all of it, all at once, into memory.

    To simplfy, imagine that you have a file folder of papers in a file cabinet, and that there are too many papers in the folder to fit onto your desk. Can you still open the drawer to the file cabinet? Sure you can. Can reach into the file folder and pull out one sheet of paper at a time? Sure you can. It's the same when reading files. You can pull an entire file into memory at one time, but normally you'll read it a chunk at a time.

Re: get the line of ith occurrence of '>' without OPENING THE FILE
by Zaxo (Archbishop) on Oct 05, 2002 at 18:21 UTC

    You cannot read a file without opening it.

    If you can open the file in the shell, the OS is not the problem. You probably need perl to be built for large file support.

    After Compline,
    Zaxo

Re: get the line of ith occurrence of '>' without OPENING THE FILE
by BrowserUk (Patriarch) on Oct 05, 2002 at 18:29 UTC

    I CAN NOT USE OPEN ON THE FILE! IT IS TOO LARGE TO OPEN IN MY OS

    With respect, that doesn't make sense. For the file to exist on your system, something must have written to it, and therefore that same something must have opened it.

    Likewise, for any shell utility to process it, even if only to split it, it has to open it.

    So I can only resort to shell.

    If you really need a shell solution, I would recommend asking on a site that specialises in shell - not Perlmonks.com?

    However, I think the biggest problem here is your understanding of what open does and means. You can open a file without needing to read the whole thing into memory. You could for instance, open the file, read it in smallish chucks, counting the '>' chars and recording the positions of each as you went through. You would then know how many there are and where they are.

    You then decide how to split the file, re-open it and open a split file, read the first chunk bit by bit and write it to the split file. Then close the first split file and open a second, continue reading and writing, opening and closing new files until you have the number of smaller files that you want.

    You' have to make sure that you had sufficient disc space (at least double and possibly more) for this to work.

    That would be a Perl solution. Maybe if you at least told us which OS you are using then someone might also suggest a more efficient method using a system utility, but you'd be better asking elsewhere for that kind of help.


    Cor! Like yer ring! ... HALO dammit! ... 'Ave it yer way! Hal-lo, Mister la-de-da. ... Like yer ring!
Re: get the line of ith occurrence of '>' without OPENING THE FILE
by Anonymous Monk on Oct 06, 2002 at 14:14 UTC
    If you're on some kind of UNIX you can probably do
    
    split -b100M < hugefile
    
    
    This will create files xaa, xab, xac etc, each with 100 MB of data from the file.

    "man split" is your friend.

    /Hacker
Re: get the line of ith occurrence of '>' without OPENING THE FILE
by Anonymous Monk on Oct 05, 2002 at 22:38 UTC
    I wrote a program with only one operation:
    open( FILE, "bigfile" ) or die...
    it dies, but Ok with a smaller file. That is why I say I can not open it
      Except in very rare circumstances, the size of the file has nothing to do with whether you can open it or not. If   open(FILE, "bigfile") or die... is failing for you, avail yourself of "$!", which usually holds a valuable error message.   open(FILE, "bigfile") or die "bigfile: $!"; will do the trick.

      Also, double-check that you have a file name "bigfile" (and not "bigfile.txt") in the same directory as the script. If you're on a Unix system, then case in the filename is significant (e.g., "bigfile" isn't the same as "Bigfile").

      And please, in the future post your questions once. There might be a slight delay before your question is "approved", at which point it will be visible when you look in Newest Nodes.

      Oh I see. I'd like to direct you to scottstef's How to ask questions the smart way. and jeffa's How (Not) To Ask A Question. What you've posted so far is very unhelpful. At a bare minimum you should include your program text (or more likely just the part that fails) and a description of the errors you get.

      __SIG__
      printf "You are here %08x\n", unpack "L!", unpack "P4", pack "L!", B::svref_2object(sub{})->OUTSIDE

      Ginger, (I am assuming this Anonymous Monk is Ginger again) you can display an informative message when an open fails as follows:
      open FILE, "bigfile" or die "Open failed because: $!\n";
      What if the reason open failed is not "file too big to be opened?" What if it is "this account does have read permission for bigfile?"

      Revealing the mystery operating system and including a copy of the ls -l or dir, or whatever that might provide sufficient clues to us hapless volunteers to get you where you want to go. Something I have ruled out has been the source of my problems; perhaps the same is true of yours. Bob

      If open fails mysteriously (and you have no idea what the error message might be - you should inspect $! but others pointed this out already), you might try the sysopen() and sysread() family.

      Here you can use a buffer with a certain length and read that number of bytes at a time into thiis buffer.

      OTOH, If the problem is that you can not split on the default line endings, you might also try setting $/ to something more suitable (e.g. when the file doesn't contain newlines and you're reading a line at at time).

      { local $/ = '>'; # you were looking for this one, right? open(FILE, "bigfile.big") or die $!; while (<FILE>) { # do interesting stuff # e.g. merely count lines... } close FILE; }

      --
      Cheers, Joe

Re: get the line of ith occurrence of '>' without OPENING THE FILE
by Anonymous Monk on Oct 07, 2002 at 16:49 UTC
    As others have mentioned, you probably misunderstand the difference between opening a file and reading it into memory. If you simply want to read to the 10000th '>', this might work.

    my $big_line = "" # Set the newline character to '>' $/ = '>'; open fh, $filename; for (my $i = 0; i < 10000; i++) { $big_line .= <fh>; } $/ = "\n"; close fh;
    Now $bigline will contain everything up to the 10000th '>'. I'm not sure that this will work if you've been having file troubles, but it's probably worth a try.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://203066]
Approved by Zaxo
Front-paged by tye
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2024-03-29 14:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found