Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^2: how to split a file.txt in multiple text files

by saulnier (Initiate)
on Feb 14, 2019 at 14:51 UTC ( [id://1229912]=note: print w/replies, xml ) Need Help??


in reply to Re: how to split a file.txt in multiple text files
in thread how to split a file.txt in multiple text files

Thank you tux. Your script works well but I also obtain a series of warnings such as:

utf8 "\xCE" does not map to Unicode at split2.pl line 9, <> chunk 3.

utf8 "\x94" does not map to Unicode at split2.pl line 9, <> chunk 4.

Wide character in print at split2.pl line 9, <> chunk 2.

...

and above all many of the files created are filled with unintelligible characters instead of having fragments of my greek text. Any idea?

  • Comment on Re^2: how to split a file.txt in multiple text files

Replies are listed 'Best First'.
Re^3: how to split a file.txt in multiple text files
by Tux (Canon) on Feb 14, 2019 at 16:24 UTC
    • What is your OS?
    • What is your perl version? (perl -v)
    • Did you invoke the script with the required -CS command-line option?
      $ perl -CS split2.pl < inputfile

    My example was used on UTF-8 encoded files that contained quite a few characters outside of the iso-8895-1 range, so I should have noted the same warnings if my example was seriously flawed.

    Is your data secret, or is it sharable, in which case, some of us might want to download it (in a zip) to check.

    As you converted my command-line example to a script, maybe it would be a goor idea to show what the script looks like. You might have missed a crucial issue. It might look a bit like this:

    use strict; use warnings; use autodie; local $/ = \3000; my $i = "0000"; while (<>) { my $fn = "zz" . $i++; open my $fh, ">:encoding(utf-8)", $fn or die "$fn: $!"; print $fh $_; close $fh; }

    Enjoy, Have FUN! H.Merijn
      OS: Windows 10 Home
      perl 5, version 14, subversion 2 (v5.14.2) built for MSWin32-x86-multi-thread

      This is my script split2.pl
      use strict; use warnings; use autodie; $/=\3000; my$i="000"; while(<>){open my $fh, ">:encoding(utf-8)", "input".$i++.".txt"; print $fh $_; close $fh;}
      If I invoke the script in this way:  perl -CS split2.pl <input.txt
      I obtain this message
      utf8 "\xE1" does not map to Unicode at split2.pl line 11, <> chunk 2. Close with partial character at (eval 21) line 67, <> chunk 2.
      and only the first fragment is created "input000.txt"

      If I run the script without -CS, no warning message and all the files are created. But they include inintelligible characters and not my greek text splitted.

      I can share my greek text (346 kB) but I do not exactly in which way I can do from here.
        -CS uses UTF-8 for standard input and output, but the diamond operator uses the ARGV handle, not STDIN.

        Use -Ci to set UTF-8 encoding of all input; or use -CD to use utf-8 for all input and output, and you can even drop the encoding from the open line.

        Update: Using redirection works for me (Linux). Are you sure the input is encoded in UTF-8?

        Update2: Verified with the file you linked to.

        perl -CS split2.pl < input.txt
        works correctly on Linux.

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
        This is my input file

        https://ufile.io/v0g1c

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1229912]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (2)
As of 2024-04-25 19:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found