Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Script to create huge sample files

by paragkalra (Scribe)
on Jan 02, 2010 at 18:13 UTC ( [id://815331]=perlquestion: print w/replies, xml ) Need Help??

paragkalra has asked for the wisdom of the Perl Monks concerning the following question:

Hello All,

Major part of my Perl scripting goes in processing text files. And most of the times I need huge sized text files ( 3 MB +) to perform benchmarking tests.

So I am planing to write a Perl script which will create huge sized text file of the sample file which it will receive as first Input parameter. I have following algorithm in mind:

1. Provide 2 input parameters to the Perl script - (i) Sample file, (ii) Size of the new file EG: - To create a new file of size 3 MB - perl Create_Huge_File.pl Sample.txt 3

2. Read the input file and store the contents into an array.

3. Create a new file.

4. Dump the contents of the above array into the new file.

5. Check the length of the new file. If it is less than second input parameter, repeat step 4 or else goto step 6.

6. Close the new file.

I have following questions:

a.) What do I need to do to make sure that length of new file will increase every time the step 4 is executed.

b.) Since lot of I/O is involved is it the most optimised solution? If not, does any one has any better design to suffice my requirement.

c.) What are the likely bugs that may creep in with this algorithm.

Parag

Replies are listed 'Best First'.
Re: Script to create huge sample files
by GrandFather (Saint) on Jan 02, 2010 at 23:45 UTC

    3 MB is not huge. In the context of terrabyte drives and multiple GB of physical memory it is in fact trivial! Assembling your output image in memory by simply concatenating copies of the input file on to the end of a string, then printing the resulting string to your output file is likely to be fast and easy to code. length then can be used to calculate how many copies you need and how big your output image is currently.

    However, depending on the contents of the file and how it will be processed, assembling a test file in this fashon seems highly dubious to me! You would probably be better to figure out a way of generating realistic data that will provide a though test set for your benchmarking tests. If the code being tested behaves differently for the first instance of a record than for subsequent occurrences, or if different record data is more expensive (in terms of processor time) than other data then your multiple copy data file may introduce nasty biases in the benchmark results.


    True laziness is hard work
Re: Script to create huge sample files
by BrowserUk (Patriarch) on Jan 02, 2010 at 21:06 UTC

    Kinda unwieldy, but on win you could use (wrapped):

    perl -e"$f=-s($ARGV[0]); $n=1+int($ARGV[1]*1024**3)/$f; system 'copy ' . join'+', $ARGV[0] x $n . ' huge.txt'" sample.txt 1

    There's probably a similar formulation for *nix.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Script to create huge sample files
by marto (Cardinal) on Jan 02, 2010 at 20:45 UTC

    Did you even actually try to change your pseudocode into a perl script? It looks as though you've been given a task to do at work and just pasted your requirements here.

Re: Script to create huge sample files
by bobr (Monk) on Jan 02, 2010 at 19:41 UTC
    Why not take size of input file and calculate number of repetitions = $target_size/$input_size + 1

    The size requirement of 3MB + is not all that large, so default output buffering should be good enough.

    -- Roman

Re: Script to create huge sample files
by stonecolddevin (Parson) on Jan 02, 2010 at 18:43 UTC

    Is it imperative that you use flatfiles? Or could something like SQLite be used instead?

    mtfnpy

Re: Script to create huge sample files
by Anonymous Monk on Jan 02, 2010 at 21:04 UTC
    Would it not be faster just to divide the required size by the size of the initial file and generate it in a loop running that number of times? (you could randomize the input's order if required). If concatenate the input into a single string or array and then only need to write once. In fact you could even simply do something like this: $output = $input x (length $input / length $output); and then write;
Re: Script to create huge sample files
by wazoox (Prior) on Jan 04, 2010 at 18:00 UTC
    I've made the following script to generate a large set of text files. The generated files looks like real text files, they are compressible but not too much (about 50%). Should work on any Unix-like system (or windows with an additional dictionary file as a source of words). Feel free to test and adapt.
    #!/usr/bin/perl use strict; use warnings; use Carp; sub loaddict { my $dict = shift; open my $fh, $dict or croak "can't open $dict: $!"; my @words = <$fh>; chomp @words; return \@words; } ####################### # main my $testdir = $ARGV[0] or die "usage : $0 <test folder> <number of files>"; my $filecount = $ARGV[1] or die "usage : $0 <test folder> <number of files>"; my $seed = 0; $seed = $ARGV[2] if defined $ARGV[2]; # force number $filecount += 0; if ( not -d "$testdir" ) { mkdir "$testdir" or die "can't mkdir $testdir"; } my $wordlist = loaddict("/usr/share/dict/words"); srand(42 + $seed ); for ( 1 .. $filecount ) { open my $file, '>', "$testdir/$_" or croak "can't open file : $!"; my $filesize = int( rand(10000) ) + 5000 ; for ( 1 .. $filesize ) { my $dice = int( rand($#{$wordlist}) ) ; print $file $wordlist->[$dice] . " "; if ( $_ % 12 == 0 ) { print $file "\n"; } } }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://815331]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (2)
As of 2024-04-16 21:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found