Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

how to split a file.txt in multiple text files

by saulnier (Initiate)
on Feb 12, 2019 at 08:29 UTC ( [id://1229778]=perlquestion: print w/replies, xml ) Need Help??

saulnier has asked for the wisdom of the Perl Monks concerning the following question:

I'm new to Perl, I would like to create a script in perl splitting a long text file in greek (encoded UTF-8) called Text.txt in n text files called Textn.txt all having the same number of greek characters (3000 including spaces). Can anyone give me some direction?
  • Comment on how to split a file.txt in multiple text files

Replies are listed 'Best First'.
Re: how to split a file.txt in multiple text files
by hippo (Bishop) on Feb 12, 2019 at 09:00 UTC

    You could use split as the starting point and either modify the byte count to be a character count or extend it by adding character count as an alternative.

    You will almost certainly want to read perlunitut before starting.

Re: how to split a file.txt in multiple text files
by Tux (Canon) on Feb 12, 2019 at 15:01 UTC

    You can use $/ :)

    $ ls -l zzz -rw-rw-rw- 1 tux users 60892 Feb 12 15:54 zzz $ perl -CS -Mautodie -wE'$/=\3000;my$i="0000";while(<>){open my $fh, " +>:encoding(utf-8)", "zz".$i++;print $fh $_}' < zzz $ ls -l zz0* -rw-rw-rw- 1 tux users 3624 Feb 12 15:58 zz0000 -rw-rw-rw- 1 tux users 3681 Feb 12 15:58 zz0001 -rw-rw-rw- 1 tux users 3661 Feb 12 15:58 zz0002 -rw-rw-rw- 1 tux users 3655 Feb 12 15:58 zz0003 -rw-rw-rw- 1 tux users 3652 Feb 12 15:58 zz0004 -rw-rw-rw- 1 tux users 3634 Feb 12 15:58 zz0005 -rw-rw-rw- 1 tux users 3640 Feb 12 15:58 zz0006 -rw-rw-rw- 1 tux users 3646 Feb 12 15:58 zz0007 -rw-rw-rw- 1 tux users 3631 Feb 12 15:58 zz0008 -rw-rw-rw- 1 tux users 3631 Feb 12 15:58 zz0009 -rw-rw-rw- 1 tux users 3692 Feb 12 15:58 zz0010 -rw-rw-rw- 1 tux users 3659 Feb 12 15:58 zz0011 -rw-rw-rw- 1 tux users 3647 Feb 12 15:58 zz0012 -rw-rw-rw- 1 tux users 3648 Feb 12 15:58 zz0013 -rw-rw-rw- 1 tux users 3634 Feb 12 15:58 zz0014 -rw-rw-rw- 1 tux users 3643 Feb 12 15:58 zz0015 -rw-rw-rw- 1 tux users 2514 Feb 12 15:58 zz0016

    Enjoy, Have FUN! H.Merijn
      Thank you tux. Your script works well but I also obtain a series of warnings such as:

      utf8 "\xCE" does not map to Unicode at split2.pl line 9, <> chunk 3.

      utf8 "\x94" does not map to Unicode at split2.pl line 9, <> chunk 4.

      Wide character in print at split2.pl line 9, <> chunk 2.

      ...

      and above all many of the files created are filled with unintelligible characters instead of having fragments of my greek text. Any idea?

        • What is your OS?
        • What is your perl version? (perl -v)
        • Did you invoke the script with the required -CS command-line option?
          $ perl -CS split2.pl < inputfile

        My example was used on UTF-8 encoded files that contained quite a few characters outside of the iso-8895-1 range, so I should have noted the same warnings if my example was seriously flawed.

        Is your data secret, or is it sharable, in which case, some of us might want to download it (in a zip) to check.

        As you converted my command-line example to a script, maybe it would be a goor idea to show what the script looks like. You might have missed a crucial issue. It might look a bit like this:

        use strict; use warnings; use autodie; local $/ = \3000; my $i = "0000"; while (<>) { my $fn = "zz" . $i++; open my $fh, ">:encoding(utf-8)", $fn or die "$fn: $!"; print $fh $_; close $fh; }

        Enjoy, Have FUN! H.Merijn
Re: how to split a file.txt in multiple text files
by bliako (Monsignor) on Feb 12, 2019 at 13:28 UTC
    perl -Mstrict -Mwarnings -Mutf8 -CSD -e 'my $c=0; my $i=1; my $fn = sp +rintf("%010d",$i);open(FH,">",$fn)||die"open $fn,$!";while(<>){while( +/(\X)/g){print FH "$1";if(++$c%3000==0){close(FH);$fn=sprintf("%010d" +,++$i);open(FH,">",$fn)||die"open $fn,$!";}}}close(FH);' < input.txt

    The important switch is -CSD telling perl that all file-opens (for read/write) should be UTF-8 (the 'D') and all read/write to standard file handles (STDOUT,STDERR,STDIN, e.g. printing diagnostics) should be done with encoding UTF-8 (the 'S'). More or less (see perlrun). -Mutf8 is when you are dealing in your code with variables containing UTF-8, for example checking the length o a unicode string with or without that switch will count characters instead of bytes.

    10' Update: btw, "%010d" in sprintf() tells it to create a filename with padded zeros plus the file index, which means you get 0000000001 , ...2 etc. You said long files and perl knows no limits. btw2, it reads input character-by-character and counts them up to 3000. Although Perl's IO is usually buffered, you may achieve better performance by slurping the file all at once if you can afford the RAM and process it as before.

    Update:Tux's solution Re: how to split a file.txt in multiple text files is way better than mine.

    bw, bliako

Re: how to split a file.txt in multiple text files
by roboticus (Chancellor) on Feb 13, 2019 at 17:02 UTC

    saulnier:

    You might not need to write anything yourself: If you're on a *nix box or MacOSX, there's a command-line script available called split that should be able to do the job for you. If you're on Windows and have installed MinGW or Cygwin, it's available there as well.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: how to split a file.txt in multiple text files
by Anonymous Monk on Feb 12, 2019 at 13:03 UTC
      and this helps how?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1229778]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (5)
As of 2024-04-24 12:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found