Script to create huge sample files

paragkalra has asked for the wisdom of the Perl Monks concerning the following question:

Hello All,

Major part of my Perl scripting goes in processing text files. And most of the times I need huge sized text files ( 3 MB +) to perform benchmarking tests.

So I am planing to write a Perl script which will create huge sized text file of the sample file which it will receive as first Input parameter. I have following algorithm in mind:

1. Provide 2 input parameters to the Perl script - (i) Sample file, (ii) Size of the new file EG: - To create a new file of size 3 MB - perl Create_Huge_File.pl Sample.txt 3

2. Read the input file and store the contents into an array.

3. Create a new file.

4. Dump the contents of the above array into the new file.

5. Check the length of the new file. If it is less than second input parameter, repeat step 4 or else goto step 6.

6. Close the new file.

I have following questions:

a.) What do I need to do to make sure that length of new file will increase every time the step 4 is executed.

b.) Since lot of I/O is involved is it the most optimised solution? If not, does any one has any better design to suffice my requirement.

c.) What are the likely bugs that may creep in with this algorithm.

Parag

Comment on Script to create huge sample files

Replies are listed 'Best First'.
Re: Script to create huge sample files by GrandFather (Saint) on Jan 02, 2010 at 23:45 UTC
3 MB is not huge. In the context of terrabyte drives and multiple GB of physical memory it is in fact trivial! Assembling your output image in memory by simply concatenating copies of the input file on to the end of a string, then printing the resulting string to your output file is likely to be fast and easy to code. length then can be used to calculate how many copies you need and how big your output image is currently. However, depending on the contents of the file and how it will be processed, assembling a test file in this fashon seems highly dubious to me! You would probably be better to figure out a way of generating realistic data that will provide a though test set for your benchmarking tests. If the code being tested behaves differently for the first instance of a record than for subsequent occurrences, or if different record data is more expensive (in terms of processor time) than other data then your multiple copy data file may introduce nasty biases in the benchmark results. True laziness is hard work	[reply]
Re: Script to create huge sample files by BrowserUk (Patriarch) on Jan 02, 2010 at 21:06 UTC
Kinda unwieldy, but on win you could use (wrapped): `perl -e"$f=-s($ARGV[0]); $n=1+int($ARGV[1]10243)/$f; system 'copy ' . join'+', $ARGV[0] x $n . ' huge.txt'" sample.txt 1` [download] There's probably a similar formulation for nix. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "I'd rather go naked than blow up my ass"	[reply] [d/l]
Re: Script to create huge sample files by marto (Cardinal) on Jan 02, 2010 at 20:45 UTC
Did you even actually try to change your pseudocode into a perl script? It looks as though you've been given a task to do at work and just pasted your requirements here.	[reply]
Re: Script to create huge sample files by bobr (Monk) on Jan 02, 2010 at 19:41 UTC
Why not take size of input file and calculate number of repetitions = `$target_size/$input_size + 1` The size requirement of 3MB + is not all that large, so default output buffering should be good enough. -- Roman	[reply] [d/l]
Re: Script to create huge sample files by stonecolddevin (Parson) on Jan 02, 2010 at 18:43 UTC
Is it imperative that you use flatfiles? Or could something like SQLite be used instead? mtfnpy	[reply]
Re: Script to create huge sample files by Anonymous Monk on Jan 02, 2010 at 21:04 UTC
Would it not be faster just to divide the required size by the size of the initial file and generate it in a loop running that number of times? (you could randomize the input's order if required). If concatenate the input into a single string or array and then only need to write once. In fact you could even simply do something like this: $output = $input x (length $input / length $output); and then write;	[reply]
Re: Script to create huge sample files by wazoox (Prior) on Jan 04, 2010 at 18:00 UTC
I've made the following script to generate a large set of text files. The generated files looks like real text files, they are compressible but not too much (about 50%). Should work on any Unix-like system (or windows with an additional dictionary file as a source of words). Feel free to test and adapt. #!/usr/bin/perl use strict; use warnings; use Carp; sub loaddict { my $dict = shift; open my $fh, $dict or croak "can't open $dict: $!"; my @words = <$fh>; chomp @words; return \@words; } ####################### # main my $testdir = $ARGV[0] or die "usage : $0 <test folder> <number of files>"; my $filecount = $ARGV[1] or die "usage : $0 <test folder> <number of files>"; my $seed = 0; $seed = $ARGV[2] if defined $ARGV[2]; # force number $filecount += 0; if ( not -d "$testdir" ) { mkdir "$testdir" or die "can't mkdir $testdir"; } my $wordlist = loaddict("/usr/share/dict/words"); srand(42 + $seed ); for ( 1 .. $filecount ) { open my $file, '>', "$testdir/$_" or croak "can't open file : $!"; my $filesize = int( rand(10000) ) + 5000 ; for ( 1 .. $filesize ) { my $dice = int( rand($#{$wordlist}) ) ; print $file $wordlist->[$dice] . " "; if ( $_ % 12 == 0 ) { print $file "\n"; } } } [download]	[reply] [d/l]


Clear questions and runnable code get the best and fastest answer
	PerlMonks