how to split a file.txt in multiple text files

saulnier has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: how to split a file.txt in multiple text files by hippo (Bishop) on Feb 12, 2019 at 09:00 UTC
You could use split as the starting point and either modify the byte count to be a character count or extend it by adding character count as an alternative. You will almost certainly want to read perlunitut before starting.	[reply]
Re: how to split a file.txt in multiple text files by Tux (Canon) on Feb 12, 2019 at 15:01 UTC
You can use `$/` :) $ ls -l zzz -rw-rw-rw- 1 tux users 60892 Feb 12 15:54 zzz $ perl -CS -Mautodie -wE'$/=\3000;my$i="0000";while(<>){open my $fh, " +>:encoding(utf-8)", "zz".$i++;print $fh $_}' < zzz $ ls -l zz0* -rw-rw-rw- 1 tux users 3624 Feb 12 15:58 zz0000 -rw-rw-rw- 1 tux users 3681 Feb 12 15:58 zz0001 -rw-rw-rw- 1 tux users 3661 Feb 12 15:58 zz0002 -rw-rw-rw- 1 tux users 3655 Feb 12 15:58 zz0003 -rw-rw-rw- 1 tux users 3652 Feb 12 15:58 zz0004 -rw-rw-rw- 1 tux users 3634 Feb 12 15:58 zz0005 -rw-rw-rw- 1 tux users 3640 Feb 12 15:58 zz0006 -rw-rw-rw- 1 tux users 3646 Feb 12 15:58 zz0007 -rw-rw-rw- 1 tux users 3631 Feb 12 15:58 zz0008 -rw-rw-rw- 1 tux users 3631 Feb 12 15:58 zz0009 -rw-rw-rw- 1 tux users 3692 Feb 12 15:58 zz0010 -rw-rw-rw- 1 tux users 3659 Feb 12 15:58 zz0011 -rw-rw-rw- 1 tux users 3647 Feb 12 15:58 zz0012 -rw-rw-rw- 1 tux users 3648 Feb 12 15:58 zz0013 -rw-rw-rw- 1 tux users 3634 Feb 12 15:58 zz0014 -rw-rw-rw- 1 tux users 3643 Feb 12 15:58 zz0015 -rw-rw-rw- 1 tux users 2514 Feb 12 15:58 zz0016 [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^2: how to split a file.txt in multiple text files by saulnier (Initiate) on Feb 14, 2019 at 14:51 UTC
Thank you tux. Your script works well but I also obtain a series of warnings such as: utf8 "\xCE" does not map to Unicode at split2.pl line 9, <> chunk 3. utf8 "\x94" does not map to Unicode at split2.pl line 9, <> chunk 4. Wide character in print at split2.pl line 9, <> chunk 2. ... and above all many of the files created are filled with unintelligible characters instead of having fragments of my greek text. Any idea?	[reply]
Re^3: how to split a file.txt in multiple text files by Tux (Canon) on Feb 14, 2019 at 16:24 UTC
What is your OS? What is your perl version? (`perl -v`) Did you invoke the script with the required `-CS` command-line option? `$ perl -CS split2.pl < inputfile` My example was used on UTF-8 encoded files that contained quite a few characters outside of the `iso-8895-1` range, so I should have noted the same warnings if my example was seriously flawed. Is your data secret, or is it sharable, in which case, some of us might want to download it (in a zip) to check. As you converted my command-line example to a script, maybe it would be a goor idea to show what the script looks like. You might have missed a crucial issue. It might look a bit like this: `use strict; use warnings; use autodie; local $/ = \3000; my $i = "0000"; while (<>) { my $fn = "zz" . $i++; open my $fh, ">:encoding(utf-8)", $fn or die "$fn: $!"; print $fh $_; close $fh; }` [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^4: how to split a file.txt in multiple text files by saulnier (Initiate) on Feb 14, 2019 at 21:18 UTC
Re^5: how to split a file.txt in multiple text files by choroba (Cardinal) on Feb 14, 2019 at 21:52 UTC
Re^5: how to split a file.txt in multiple text files by saulnier (Initiate) on Feb 15, 2019 at 07:34 UTC
Some notes below your chosen depth have not been shown here
Re: how to split a file.txt in multiple text files by bliako (Monsignor) on Feb 12, 2019 at 13:28 UTC
`perl -Mstrict -Mwarnings -Mutf8 -CSD -e 'my $c=0; my $i=1; my $fn = sp +rintf("%010d",$i);open(FH,">",$fn)\|\|die"open $fn,$!";while(<>){while( +/(\X)/g){print FH "$1";if(++$c%3000==0){close(FH);$fn=sprintf("%010d" +,++$i);open(FH,">",$fn)\|\|die"open $fn,$!";}}}close(FH);' < input.txt` [download] The important switch is `-CSD` telling perl that all file-opens (for read/write) should be UTF-8 (the 'D') and all read/write to standard file handles (STDOUT,STDERR,STDIN, e.g. printing diagnostics) should be done with encoding UTF-8 (the 'S'). More or less (see perlrun). `-Mutf8` is when you are dealing in your code with variables containing UTF-8, for example checking the length o a unicode string with or without that switch will count characters instead of bytes. 10' Update: btw, "%010d" in `sprintf()` tells it to create a filename with padded zeros plus the file index, which means you get 0000000001 , ...2 etc. You said long files and perl knows no limits. btw2, it reads input character-by-character and counts them up to 3000. Although Perl's IO is usually buffered, you may achieve better performance by slurping the file all at once if you can afford the RAM and process it as before. Update:Tux's solution Re: how to split a file.txt in multiple text files is way better than mine. bw, bliako	[reply] [d/l] [select]
Re: how to split a file.txt in multiple text files by roboticus (Chancellor) on Feb 13, 2019 at 17:02 UTC
saulnier: You might not need to write anything yourself: If you're on a nix box or MacOSX, there's a command-line script available called split that should be able to do the job for you. If you're on Windows and have installed MinGW or Cygwin, it's available there as well. ...roboticus When your only tool is a hammer, all problems look like your thumb.*	[reply]
Re: how to split a file.txt in multiple text files by Anonymous Monk on Feb 12, 2019 at 13:03 UTC
https://www.perl.com/pub/2012/05/perlunicookbook-unicode-normalization.html/	[reply]
Re^2: how to split a file.txt in multiple text files by Anonymous Monk on Feb 12, 2019 at 13:26 UTC
and this helps how?	[reply]


XP is just a number
	PerlMonks