(tye)Re: get the line of ith occurrence of '>' without OPENING THE FILE
by tye (Sage) on Oct 05, 2002 at 21:35 UTC
|
I can understand everyone saying that what you are saying doesn't make sense.
I just wanted to drop a quick note saying there actually are cases where you can't open a huge file from within Perl but where you can do:
cat huge.file | perlscript
It is a strange concequence of how "large file support" was retrofitted into some operating system(s). In order to open a really large file under these operating systems you have to use a special version of open() that Perl will only use if you've compiled Perl to include "large file support". The operating system prevents you from opening the file otherwise just in case you plan on using seek() on the file handle.
In some ways I think this is a pretty silly decision, but then I haven't analyzed the situation much and yet I can still understand at least the temptation to do this so it is probably a reasonable design that just looks stupid until you understand the trade-offs better.
In any case, such operating system(s) will have a program called 'cat' that knows how to open and read such huge files so you just need to use:
open( FILE, "cat $filename |" ) or die...
and make very sure that $filename doesn't contain "interesting" characters. I'd show the secure way to do that, but I don't think there is one simple answer that covers all of the variables (versions of Perl, versions of operatings systems, etc.). I guess, since we are assuming 'cat', we can also assume /bin/sh and so
open( FILE, qq<cat "\Q$filename\E" |> )
or die ...;
is probably bullet-proof.
- tye (echo "but my friends call me "'"'"$Tye"'"')
P.S. Please, please, please! Register a username at the site. It will make it easier for us to help you when you have these problems figuring out why you think your question has disappeared and will make it easier for you to find your questions and their answers.
Failing that, please, please, please, don't keep posting the same question over and over! You need to have a little patience. When you post a question, it doesn't appear on the main pages of the web site right away. That does not mean that your question has been lost and it doesn't even mean that there aren't people already reading your question and writing answers to it.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
If we're going as far as having access to "cat" and "shell", then it might be fun to tinker with "dd" as well. Then the OP can get to just the part of the file he wants to:
open(THIRDMB, "dd if=$filename bs=1024 skip=2048 count=1024|") or die;
while(<>) { ... } # Reads third MB of file only.
close(THIRDMB);
dd's useful in Unix for skipping around in large files at the shell level where seek(2) might be a bit more complicated.
| [reply] [Watch: Dir/Any] [d/l] |
|
Hi
I'm new to Perl so I don't know how to do it that way.
Conceptually though, from a UNIX/LINUX generalised standpoint
cat filename | wc -l
Say this returns 80000
head -40000 filename > file1sthalf
tail -40000 filename > file2ndhalf
head -20000 file1sthalf > filename1stquarter
tail -20000 file1sthalf > filename2ndquarter
...iterate for file2ndhalf
...then cut the resulting 4 quarters into 8ths by same method
This doesn't count and evenly-distribute your >'s but it will
give you 8 smaller files to play with.
This method is crude and not at all Perlesque but I just
don't have the knowledge ... yet: I'd be interested to see
the Perl code to do what I described.
Then run some Perl code to do your line of i th occurence
Rich
In the Gulf of Mexico
Noone can hear you scream
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
Re: get the line of ith occurrence of '>' without OPENING THE FILE
by dws (Chancellor) on Oct 05, 2002 at 18:26 UTC
|
The toughest situation is: I CAN NOT USE OPEN ON THE FILE! IT IS TOO LARGE TO OPEN IN MY OS.
You seem to misunderstand the difference between
- openning a file,
- reading it, and
- reading it all into memory.
Openning a file does not imply that you're going to read it. Reading a file does not imply that you're going to read all of it, all at once, into memory.
To simplfy, imagine that you have a file folder of papers in a file cabinet, and that there are too many papers in the folder to fit onto your desk. Can you still open the drawer to the file cabinet? Sure you can. Can reach into the file folder and pull out one sheet of paper at a time? Sure you can. It's the same when reading files. You can pull an entire file into memory at one time, but normally you'll read it a chunk at a time.
| [reply] [Watch: Dir/Any] |
Re: get the line of ith occurrence of '>' without OPENING THE FILE
by Zaxo (Archbishop) on Oct 05, 2002 at 18:21 UTC
|
You cannot read a file without opening it.
If you can open the file in the shell, the OS is not the problem. You probably need perl to be built for large file support.
After Compline, Zaxo
| [reply] [Watch: Dir/Any] |
Re: get the line of ith occurrence of '>' without OPENING THE FILE
by BrowserUk (Patriarch) on Oct 05, 2002 at 18:29 UTC
|
I CAN NOT USE OPEN ON THE FILE! IT IS TOO LARGE TO OPEN IN MY OS
With respect, that doesn't make sense. For the file to exist on your system, something must have written to it, and therefore that same something must have opened it.
Likewise, for any shell utility to process it, even if only to split it, it has to open it.
So I can only resort to shell.
If you really need a shell solution, I would recommend asking on a site that specialises in shell - not Perlmonks.com?
However, I think the biggest problem here is your understanding of what open does and means. You can open a file without needing to read the whole thing into memory. You could for instance, open the file, read it in smallish chucks, counting the '>' chars and recording the positions of each as you went through. You would then know how many there are and where they are.
You then decide how to split the file, re-open it and open a split file, read the first chunk bit by bit and write it to the split file. Then close the first split file and open a second, continue reading and writing, opening and closing new files until you have the number of smaller files that you want.
You' have to make sure that you had sufficient disc space (at least double and possibly more) for this to work.
That would be a Perl solution. Maybe if you at least told us which OS you are using then someone might also suggest a more efficient method using a system utility, but you'd be better asking elsewhere for that kind of help.
Cor! Like yer ring! ... HALO dammit! ... 'Ave it yer way! Hal-lo, Mister la-de-da. ... Like yer ring!
| [reply] [Watch: Dir/Any] [d/l] |
Re: get the line of ith occurrence of '>' without OPENING THE FILE
by Anonymous Monk on Oct 06, 2002 at 14:14 UTC
|
If you're on some kind of UNIX you can probably do
split -b100M < hugefile
This will create files xaa, xab, xac etc, each with
100 MB of data from the file.
"man split" is your friend.
/Hacker
| [reply] [Watch: Dir/Any] |
Re: get the line of ith occurrence of '>' without OPENING THE FILE
by Anonymous Monk on Oct 05, 2002 at 22:38 UTC
|
I wrote a program with only one operation:
open( FILE, "bigfile" ) or die...
it dies, but Ok with a smaller file. That is why I say I can not open it | [reply] [Watch: Dir/Any] |
|
Except in very rare circumstances, the size of the file has nothing to do with whether you can open it or not. If
open(FILE, "bigfile") or die...
is failing for you, avail yourself of "$!", which usually holds a valuable error message.
open(FILE, "bigfile") or die "bigfile: $!";
will do the trick.
Also, double-check that you have a file name "bigfile" (and not "bigfile.txt") in the same directory as the script. If you're on a Unix system, then case in the filename is significant (e.g., "bigfile" isn't the same as "Bigfile").
And please, in the future post your questions once. There might be a slight delay before your question is "approved", at which point it will be visible when you look in Newest Nodes.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Oh I see. I'd like to direct you to scottstef's How to ask questions the smart way. and jeffa's How (Not) To Ask A Question. What you've posted so far is very unhelpful. At a bare minimum you should include your program text (or more likely just the part that fails) and a description of the errors you get.
__SIG__ printf "You are here %08x\n", unpack "L!", unpack "P4", pack "L!", B::svref_2object(sub{})->OUTSIDE
| [reply] [Watch: Dir/Any] [d/l] |
|
Ginger, (I am assuming this Anonymous Monk is Ginger again) you can display an informative message when an open fails as follows:
open FILE, "bigfile" or die "Open failed because: $!\n";
What if the reason open failed is not "file too big to be opened?" What if it is "this account does have read permission for bigfile?"
Revealing the mystery operating system and including a copy of the ls -l or dir, or whatever that might provide sufficient clues to us hapless volunteers to get you where you want to go. Something I have ruled out has been the source of my problems; perhaps the same is true of yours.
Bob
| [reply] [Watch: Dir/Any] [d/l] |
|
If open fails mysteriously (and you have no idea what the error message might be - you should inspect $! but others pointed this out already), you might try the sysopen() and sysread() family.
Here you can use a buffer with a certain length and read that number of bytes at a time into thiis buffer.
OTOH, If the problem is that you can not split on the default line endings, you might also try setting $/ to something more suitable (e.g. when the file doesn't contain newlines and you're reading a line at at time).
{
local $/ = '>'; # you were looking for this one, right?
open(FILE, "bigfile.big") or die $!;
while (<FILE>) {
# do interesting stuff
# e.g. merely count lines...
}
close FILE;
}
--
Cheers, Joe | [reply] [Watch: Dir/Any] [d/l] |
Re: get the line of ith occurrence of '>' without OPENING THE FILE
by Anonymous Monk on Oct 07, 2002 at 16:49 UTC
|
As others have mentioned, you probably misunderstand the difference between opening a file and reading it into memory. If you simply want to read to the 10000th '>', this might work.
my $big_line = ""
# Set the newline character to '>'
$/ = '>';
open fh, $filename;
for (my $i = 0; i < 10000; i++) {
$big_line .= <fh>;
}
$/ = "\n";
close fh;
Now $bigline will contain everything up to the 10000th '>'. I'm not sure that this will work if you've been having file troubles, but it's probably worth a try. | [reply] [Watch: Dir/Any] [d/l] |