comment on

I've made the following script to generate a large set of text files. The generated files looks like real text files, they are compressible but not too much (about 50%). Should work on any Unix-like system (or windows with an additional dictionary file as a source of words). Feel free to test and adapt.

#!/usr/bin/perl


use strict;
use warnings;
use Carp;


sub loaddict {    
    my $dict = shift;
    
    open my $fh, $dict or croak "can't open $dict: $!";
    my @words = <$fh>;
    chomp @words;
        
    return \@words;
}

#######################
# main

my $testdir = $ARGV[0] 
    or die "usage : $0 <test folder> <number of files>";
    
my $filecount = $ARGV[1] 
    or die "usage : $0 <test folder> <number of files>";

my $seed = 0;
$seed = $ARGV[2] if defined  $ARGV[2];

# force number
$filecount += 0;

if ( not -d "$testdir" ) {
    mkdir "$testdir" or die "can't mkdir $testdir";
}

my $wordlist = loaddict("/usr/share/dict/words");
srand(42 + $seed );

for ( 1 .. $filecount ) {
    open my $file, '>', "$testdir/$_" or croak "can't open file : $!";
    
    my $filesize = int( rand(10000) ) + 5000 ;
    for ( 1 .. $filesize ) {
        my $dice = int( rand($#{$wordlist}) ) ;
        print $file $wordlist->[$dice] . " ";
        if ( $_ % 12 == 0 ) {
            print $file "\n";
        }
    }
}
[download]

In reply to Re: Script to create huge sample files by wazoox
in thread Script to create huge sample files by paragkalra

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


P is for Practical
	PerlMonks