http://qs321.pair.com?node_id=11148376


in reply to Re^5: getting rid of UTF-8
in thread getting rid of UTF-8

I did the $line =~ s/\xef\xbb\xbf// and it didn't remove the characters!

Using the advice from kcott here to use /g, it works for me. If it really doesn't work for you, then perhaps the data you have in your Perl string is not what you think it is. See my node here for advice on how to show us the real data, in particular Devel::Peek, and make sure to provide an SSCCE that we can run to see the problem for ourselves.

Replies are listed 'Best First'.
Re^7: getting rid of UTF-8
by BernieC (Pilgrim) on Nov 25, 2022 at 15:08 UTC
    I must be doing something stupid. Here's my little test program:
    #!/usr/bin/perl use v5.10 ; use strict; use warnings ; my $BOM = "\xef\xbb\xbf" ; die "no args\n" unless @ARGV == 2 ; open (my $i, "<", $ARGV[0]) or die "Can't open $ARGV[0]\n" ; open (my $o, ">", $ARGV[1]) or die "Can't write to $ARGV[1]\n" ; say "marker is" ; printhex ($BOM) ; say "" ; while (my $line = <$i>) { my $newline = $line ; printhex ($newline) ; $newline =~ s/$BOM//g; die "didn't change" if $newline eq $line ; print $o $newline ; } close $i ; close $o ; exit ; sub printhex { my $str = $_[0] ; for my $chr (split(//,$str)) { printf("%x ", ord($chr)) ; } }
    and when I run it on one of teh BOM'ed files I get:
    marker is ef bb bf didn't change at D:\Desktop\striputf.pl line 19, <$i> line 3. ef bb bf 49 6d 70 6f 72 74 61 6e 63 65 2c 46 69 72 73 74 20 4e 61 6d 6 +5 2c 4d 69 64 64 6c 65 20 4e 61 6d 65 2c 4c 61 73 74 20 4e 61 6d 65 2 +c 46 75 6c 6c 20 4e 61 6d 65 2c 43 6f 6d 70 61 6e 79 2c 44 65 70 61 7 +2 74 6d 65 6e 74 2c 4a 6f 62 20 54 69 74 6c 65 2c 53 74 72 65 65 74 2 +0 28 62 2e 29 2c 43 69 74 79 20 28 62 2e 29 2c 53 74 61 74 65 20 28 6 +2 2e 29 2c 5a 49 50 20 43 6f 64 65 20 28 62 2e 29 2c 43 6f 75 6e 74 7 +2 79 2f 52 65 67 69 6f 6e 20 28 62 2e 29 2c 48 6f 6d 65 20 50 68 6f 6 +e 65 2c 42 75 73 69 6e 65 73 73 20 50 68 6f 6e 65 2c 4d 6f 62 69 6c 6 +5 20 50 68 6f 6e 65 2c 42 75 73 69 6e 65 73 73 20 50 68 6f 6e 65 20 3 +2 2c 42 75 73 69 6e 65 73 73 20 50 68 6f 6e 65 20 33 2c 42 75 73 69 6 +e 65 73 73 20 50 68 6f 6e 65 20 34 2c 42 75 73 69 6e 65 73 73 20 46 6 +1 78 2c 42 75 73 69 6e 65 73 73 20 57 65 62 20 50 61 67 65 2c 53 74 7 +2 65 65 74 20 28 68 2e 29 2c 43 69 74 79 20 28 68 2e 29 2c 53 74 61 7 +4 65 20 28 68 2e 29 2c 5a 49 50 20 43 6f 64 65 20 28 68 2e 29 .....
    What am I getting wrong/missing?

      Thanks for providing sample code, unfortunately I can't reproduce the issue, not on Windows or Linux. A common reason is that there is a mixup between the code and data you've posted here and the code and data you're actually using on your machine (Update: in this case, if the AM's guess is correct, then the issue is that you didn't post the full input data and script output). Make sure to re-read my nodes here and here on how to provide the most transparent data for us, and prodvide us with a full set of information to reproduce the issue, like the following.

      C:\Temp>hex -b i.txt ef bb bf 49 6d 70 6f 72 74 61 6e 63 65 2c 46 69 # 00000000 ...Imp +ortance,Fi 72 73 74 20 4e 61 6d 65 2c 4d 69 64 64 6c 65 20 # 00000010 rst Na +me,Middle 4e 61 6d 65 2c 4c 61 73 74 20 4e 61 6d 65 2c 46 # 00000020 Name,L +ast Name,F 75 6c 6c 20 4e 61 6d 65 2c 43 6f 6d 70 61 6e 79 # 00000030 ull Na +me,Company 2c 44 65 70 61 72 74 6d 65 6e 74 2c 4a 6f 62 20 # 00000040 ,Depar +tment,Job 54 69 74 6c 65 2c 53 74 72 65 65 74 20 28 62 2e # 00000050 Title, +Street (b. 29 2c 43 69 74 79 20 28 62 2e 29 2c 53 74 61 74 # 00000060 ),City + (b.),Stat 65 20 28 62 2e 29 2c 5a 49 50 20 43 6f 64 65 20 # 00000070 e (b.) +,ZIP Code 28 62 2e 29 2c 43 6f 75 6e 74 72 79 2f 52 65 67 # 00000080 (b.),C +ountry/Reg 69 6f 6e 20 28 62 2e 29 2c 48 6f 6d 65 20 50 68 # 00000090 ion (b +.),Home Ph 6f 6e 65 2c 42 75 73 69 6e 65 73 73 20 50 68 6f # 000000a0 one,Bu +siness Pho 6e 65 2c 4d 6f 62 69 6c 65 20 50 68 6f 6e 65 2c # 000000b0 ne,Mob +ile Phone, 42 75 73 69 6e 65 73 73 20 50 68 6f 6e 65 20 32 # 000000c0 Busine +ss Phone 2 2c 42 75 73 69 6e 65 73 73 20 50 68 6f 6e 65 20 # 000000d0 ,Busin +ess Phone 33 2c 42 75 73 69 6e 65 73 73 20 50 68 6f 6e 65 # 000000e0 3,Busi +ness Phone 20 34 2c 42 75 73 69 6e 65 73 73 20 46 61 78 2c # 000000f0 4,Bus +iness Fax, 42 75 73 69 6e 65 73 73 20 57 65 62 20 50 61 67 # 00000100 Busine +ss Web Pag 65 2c 53 74 72 65 65 74 20 28 68 2e 29 2c 43 69 # 00000110 e,Stre +et (h.),Ci 74 79 20 28 68 2e 29 2c 53 74 61 74 65 20 28 68 # 00000120 ty (h. +),State (h 2e 29 2c 5a 49 50 20 43 6f 64 65 20 28 68 2e 29 # 00000130 .),ZIP + Code (h.) C:\Temp>perl 11148386.pl i.txt o.txt marker is ef bb bf ef bb bf 49 6d 70 6f 72 74 61 6e 63 65 2c 46 69 72 73 74 20 4e 61 6d 6 +5 2c 4d 69 64 64 6c 65 20 4e 61 6d 65 2c 4c 61 73 74 20 4e 61 6d 65 2 +c 46 75 6c 6c 20 4e 61 6d 65 2c 43 6f 6d 70 61 6e 79 2c 44 65 70 61 7 +2 74 6d 65 6e 74 2c 4a 6f 62 20 54 69 74 6c 65 2c 53 74 72 65 65 74 2 +0 28 62 2e 29 2c 43 69 74 79 20 28 62 2e 29 2c 53 74 61 74 65 20 28 6 +2 2e 29 2c 5a 49 50 20 43 6f 64 65 20 28 62 2e 29 2c 43 6f 75 6e 74 7 +2 79 2f 52 65 67 69 6f 6e 20 28 62 2e 29 2c 48 6f 6d 65 20 50 68 6f 6 +e 65 2c 42 75 73 69 6e 65 73 73 20 50 68 6f 6e 65 2c 4d 6f 62 69 6c 6 +5 20 50 68 6f 6e 65 2c 42 75 73 69 6e 65 73 73 20 50 68 6f 6e 65 20 3 +2 2c 42 75 73 69 6e 65 73 73 20 50 68 6f 6e 65 20 33 2c 42 75 73 69 6 +e 65 73 73 20 50 68 6f 6e 65 20 34 2c 42 75 73 69 6e 65 73 73 20 46 6 +1 78 2c 42 75 73 69 6e 65 73 73 20 57 65 62 20 50 61 67 65 2c 53 74 7 +2 65 65 74 20 28 68 2e 29 2c 43 69 74 79 20 28 68 2e 29 2c 53 74 61 7 +4 65 20 28 68 2e 29 2c 5a 49 50 20 43 6f 64 65 20 28 68 2e 29 C:\Temp>hex -b o.txt 49 6d 70 6f 72 74 61 6e 63 65 2c 46 69 72 73 74 # 00000000 Import +ance,First 20 4e 61 6d 65 2c 4d 69 64 64 6c 65 20 4e 61 6d # 00000010 Name, +Middle Nam 65 2c 4c 61 73 74 20 4e 61 6d 65 2c 46 75 6c 6c # 00000020 e,Last + Name,Full 20 4e 61 6d 65 2c 43 6f 6d 70 61 6e 79 2c 44 65 # 00000030 Name, +Company,De 70 61 72 74 6d 65 6e 74 2c 4a 6f 62 20 54 69 74 # 00000040 partme +nt,Job Tit 6c 65 2c 53 74 72 65 65 74 20 28 62 2e 29 2c 43 # 00000050 le,Str +eet (b.),C 69 74 79 20 28 62 2e 29 2c 53 74 61 74 65 20 28 # 00000060 ity (b +.),State ( 62 2e 29 2c 5a 49 50 20 43 6f 64 65 20 28 62 2e # 00000070 b.),ZI +P Code (b. 29 2c 43 6f 75 6e 74 72 79 2f 52 65 67 69 6f 6e # 00000080 ),Coun +try/Region 20 28 62 2e 29 2c 48 6f 6d 65 20 50 68 6f 6e 65 # 00000090 (b.), +Home Phone 2c 42 75 73 69 6e 65 73 73 20 50 68 6f 6e 65 2c # 000000a0 ,Busin +ess Phone, 4d 6f 62 69 6c 65 20 50 68 6f 6e 65 2c 42 75 73 # 000000b0 Mobile + Phone,Bus 69 6e 65 73 73 20 50 68 6f 6e 65 20 32 2c 42 75 # 000000c0 iness +Phone 2,Bu 73 69 6e 65 73 73 20 50 68 6f 6e 65 20 33 2c 42 # 000000d0 siness + Phone 3,B 75 73 69 6e 65 73 73 20 50 68 6f 6e 65 20 34 2c # 000000e0 usines +s Phone 4, 42 75 73 69 6e 65 73 73 20 46 61 78 2c 42 75 73 # 000000f0 Busine +ss Fax,Bus 69 6e 65 73 73 20 57 65 62 20 50 61 67 65 2c 53 # 00000100 iness +Web Page,S 74 72 65 65 74 20 28 68 2e 29 2c 43 69 74 79 20 # 00000110 treet +(h.),City 28 68 2e 29 2c 53 74 61 74 65 20 28 68 2e 29 2c # 00000120 (h.),S +tate (h.), 5a 49 50 20 43 6f 64 65 20 28 68 2e 29 # 00000130 ZIP Co +de (h.) C:\Temp>type 11148386.pl #!/usr/bin/perl use v5.10 ; use strict; use warnings ; my $BOM = "\xef\xbb\xbf" ; die "no args\n" unless @ARGV == 2 ; open (my $i, "<", $ARGV[0]) or die "Can't open $ARGV[0]\n" ; open (my $o, ">", $ARGV[1]) or die "Can't write to $ARGV[1]\n" ; say "marker is" ; printhex ($BOM) ; say "" ; while (my $line = <$i>) { my $newline = $line ; printhex ($newline) ; $newline =~ s/$BOM//g; die "didn't change" if $newline eq $line ; print $o $newline ; } close $i ; close $o ; exit ; sub printhex { my $str = $_[0] ; for my $chr (split(//,$str)) { printf("%x ", ord($chr)) ; } } C:\Temp>perl -v This is perl 5, version 32, subversion 1 (v5.32.1) built for MSWin32-x +86-multi-thread-64int ...
        I just double checked and it is actually working!! Instead of being fancy I just printed a hex dump before and a hex dump after and I could then look at the output with a text editor and it was clear.
        before: ef bb bf 49 6d 70 6f 72 74 61 6e 63 65 2c 46 69 72 73 74 20 4e 61 6d 6 +5 2c 4d 69 64 64 after: 49 6d 70 6f 72 74 61 6e 63 65 2c 46 69 72 73 74 20 4e 61 6d 65 2c 4d 6 +9 64 64 6c 65 20 before: 2c 2c 2c 2c 2c 22 ef bb bf 39 35 33 2d 31 35 31 33 a after: 2c 2c 2c 2c 2c 22 39 35 33 2d 31 35 31 33 a
        Sorry for sowing so much confusion due to my incompetence...

      ...<$i> line 3

      I guess output flushed/shown is for previous lines before successful substitution