Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^3: getting rid of UTF-8

by haukex (Archbishop)
on Nov 24, 2022 at 23:04 UTC ( #11148364=note: print w/replies, xml ) Need Help??


in reply to Re^2: getting rid of UTF-8
in thread getting rid of UTF-8

The issue with the sample data you posted is that it is entirely ASCII with some some BOMs in it, but from your description it sounded like you could have other Latin-1 (or CP1252 or Latin-9) or UTF-8 characters in it, which you don't show.

What should I do either to upload something to here with example probelmatic stuff

A hex dump of the raw bytes like you showed above is fine. See also my node here.

and/or be able just to brute-force fix it?

Iff your data consists entirely a single-byte encoding like the ones I named above, and the only UTF-8 characters that appear in it are BOMs, then the regex you showed in the root node may be acceptable. However, I very much expect that if there's a BOM, then other UTF-8 characters can be present, and if those are mixed with single-byte-encodings, or you've got double-encoded characters, you'll have a tough time picking that apart. But again, you'd need to show us more representative data.

Edit: Typo fixes.

Replies are listed 'Best First'.
Re^4: getting rid of UTF-8
by BernieC (Pilgrim) on Nov 25, 2022 at 02:57 UTC
    I'll try to get something together and paste a hex dump. But: i know that there are nothing but plain lower 128 ASCII characters {I just mentioned ISO-latin out of habit}. It is all data that I entered in and there's no data in the CSV files that isn't something I entered. I have no idea why there's a bom in the middle of the first record..... I'll get the dump
      OK. I've got a hex dump. Here's what the file looks like in a text editor:
      Importance,"First Name","Middle Name","Last Name","Full Name",Company, +Department,"Job Title","Street (b.)","City (b.)","State (b.)","ZIP Co +de (b.)","Country/Region (b.)","Home Phone","Business Phone","Mobile +Phone","Business Phone 2","Business Phone 3","Business Phone 4","Busi +ness Fax","Business Web Page","Street (h.)","City (h.)","State (h.)", +"ZIP Code (h.)","Country/Region (h.)","Home Phone 2","Home Phone 3"," +Home Phone 4","Home Fax","Personal Web Page","Mobile Phone 2","Mobile + Phone 3","Mobile Phone 4",E-mail,"E-mail 2","E-mail 3","E-mail 4",x, +y,z,w,Office,Supervisor,Assistant,Salutation,Nickname,Gender,Spouse,B +irthday,Anniversary,Family,Hobbies,Specialty,Strengths,Personality,No +tes,"Custom 2","Custom 3","Custom 4","Custom 5","Custom 6","Custom 7" +,"Custom 8",Comment,Group,"Birthday Reminder On/Off","Anniversary Rem +inder On/Off" Normal,,,,"A-1 Heating and Cooling","A-1 Heating and Cooling",,,"PO Bo +x 94",Newport,Virginia,24128,"United States",,544-7810,,,,,,,,,,,"Uni +ted States",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"953-1513 Scott - service mgr - cell 540 357 2816","Emergency contacts",No,No
      And here's the hex dump of it
      EF BB BF 49 6D 70 6F 72 74 61 6E 63 65 2C 22 46 69 72 73 74 20 4E 61 6 +D 65 22 2C 22 4D 69 64 64 6C 65 20 4E 61 6D 65 22 2C 22 4C 61 73 74 2 +0 4E 61 6D 65 22 2C 22 46 75 6C 6C 20 4E 61 6D 65 22 2C 43 6F 6D 70 6 +1 6E 79 2C 44 65 70 61 72 74 6D 65 6E 74 2C 22 4A 6F 62 20 54 69 74 6 +C 65 22 2C 22 53 74 72 65 65 74 20 28 62 2E 29 22 2C 22 43 69 74 79 2 +0 28 62 2E 29 22 2C 22 53 74 61 74 65 20 28 62 2E 29 22 2C 22 5A 49 5 +0 20 43 6F 64 65 20 28 62 2E 29 22 2C 22 43 6F 75 6E 74 72 79 2F 52 6 +5 67 69 6F 6E 20 28 62 2E 29 22 2C 22 48 6F 6D 65 20 50 68 6F 6E 65 2 +2 2C 22 42 75 73 69 6E 65 73 73 20 50 68 6F 6E 65 22 2C 22 4D 6F 62 6 +9 6C 65 20 50 68 6F 6E 65 22 2C 22 42 75 73 69 6E 65 73 73 20 50 68 6 +F 6E 65 20 32 22 2C 22 42 75 73 69 6E 65 73 73 20 50 68 6F 6E 65 20 3 +3 22 2C 22 42 75 73 69 6E 65 73 73 20 50 68 6F 6E 65 20 34 22 2C 22 4 +2 75 73 69 6E 65 73 73 20 46 61 78 22 2C 22 42 75 73 69 6E 65 73 73 2 +0 57 65 62 20 50 61 67 65 22 2C 22 53 74 72 65 65 74 20 28 68 2E 29 2 +2 2C 22 43 69 74 79 20 28 68 2E 29 22 2C 22 53 74 61 74 65 20 28 68 2 +E 29 22 2C 22 5A 49 50 20 43 6F 64 65 20 28 68 2E 29 22 2C 22 43 6F 7 +5 6E 74 72 79 2F 52 65 67 69 6F 6E 20 28 68 2E 29 22 2C 22 48 6F 6D 6 +5 20 50 68 6F 6E 65 20 32 22 2C 22 48 6F 6D 65 20 50 68 6F 6E 65 20 3 +3 22 2C 22 48 6F 6D 65 20 50 68 6F 6E 65 20 34 22 2C 22 48 6F 6D 65 2 +0 46 61 78 22 2C 22 50 65 72 73 6F 6E 61 6C 20 57 65 62 20 50 61 67 6 +5 22 2C 22 4D 6F 62 69 6C 65 20 50 68 6F 6E 65 20 32 22 2C 22 4D 6F 6 +2 69 6C 65 20 50 68 6F 6E 65 20 33 22 2C 22 4D 6F 62 69 6C 65 20 50 6 +8 6F 6E 65 20 34 22 2C 45 2D 6D 61 69 6C 2C 22 45 2D 6D 61 69 6C 20 3 +2 22 2C 22 45 2D 6D 61 69 6C 20 33 22 2C 22 45 2D 6D 61 69 6C 20 34 2 +2 2C 78 2C 79 2C 7A 2C 77 2C 4F 66 66 69 63 65 2C 53 75 70 65 72 76 6 +9 73 6F 72 2C 41 73 73 69 73 74 61 6E 74 2C 53 61 6C 75 74 61 74 69 6 +F 6E 2C 4E 69 63 6B 6E 61 6D 65 2C 47 65 6E 64 65 72 2C 53 70 6F 75 7 +3 65 2C 42 69 72 74 68 64 61 79 2C 41 6E 6E 69 76 65 72 73 61 72 79 2 +C 46 61 6D 69 6C 79 2C 48 6F 62 62 69 65 73 2C 53 70 65 63 69 61 6C 7 +4 79 2C 53 74 72 65 6E 67 74 68 73 2C 50 65 72 73 6F 6E 61 6C 69 74 7 +9 2C 4E 6F 74 65 73 2C 22 43 75 73 74 6F 6D 20 32 22 2C 22 43 75 73 7 +4 6F 6D 20 33 22 2C 22 43 75 73 74 6F 6D 20 34 22 2C 22 43 75 73 74 6 +F 6D 20 35 22 2C 22 43 75 73 74 6F 6D 20 36 22 2C 22 43 75 73 74 6F 6 +D 20 37 22 2C 22 43 75 73 74 6F 6D 20 38 22 2C 43 6F 6D 6D 65 6E 74 2 +C 47 72 6F 75 70 2C 22 42 69 72 74 68 64 61 79 20 52 65 6D 69 6E 64 6 +5 72 20 4F 6E 2F 4F 66 66 22 2C 22 41 6E 6E 69 76 65 72 73 61 72 79 2 +0 52 65 6D 69 6E 64 65 72 20 4F 6E 2F 4F 66 66 22 0D 0A 4E 6F 72 6D 6 +1 6C 2C 2C 2C 2C 22 41 2D 31 20 48 65 61 74 69 6E 67 20 61 6E 64 20 4 +3 6F 6F 6C 69 6E 67 22 2C 22 41 2D 31 20 48 65 61 74 69 6E 67 20 61 6 +E 64 20 43 6F 6F 6C 69 6E 67 22 2C 2C 2C 22 50 4F 20 42 6F 78 20 39 3 +4 22 2C 4E 65 77 70 6F 72 74 2C 56 69 72 67 69 6E 69 61 2C 32 34 31 3 +2 38 2C 22 55 6E 69 74 65 64 20 53 74 61 74 65 73 22 2C 2C 35 34 34 2 +D 37 38 31 30 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 22 55 6E 69 74 65 64 2 +0 53 74 61 74 65 73 22 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2 +C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2 +C 22 EF BB BF 39 35 33 2D 31 35 31 33 0D 0A 53 63 6F 74 74 20 2D 20 7 +3 65 72 76 69 63 65 20 6D 67 72 20 2D 20 63 65 6C 6C 20 35 34 30 20 3 +3 35 37 20 32 38 31 36 22 2C 22 45 6D 65 72 67 65 6E 63 79 20 20 63 6 +F 6E 74 61 63 74 73 22 2C 4E 6F 2C 4E 6F 0D 0A 4E 6F 72 6D 61 6C 2C 2 +C 2C 2C 22 41 62 69 6E 67 64 6F 6E 20 45 71 75 69 70 6D 65 6E 74 22 2 +C 22
      Notice, from the dump that there another EFBBBF toward the end of the file. And: I tried to brute force it and it didn't work!! I did the
      $line =~ s/\xef\xbb\xbf//
      and it didn't remove the characters! I'll try again...
        I did the $line =~ s/\xef\xbb\xbf// and it didn't remove the characters!

        Using the advice from kcott here to use /g, it works for me. If it really doesn't work for you, then perhaps the data you have in your Perl string is not what you think it is. See my node here for advice on how to show us the real data, in particular Devel::Peek, and make sure to provide an SSCCE that we can run to see the problem for ourselves.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11148364]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2023-02-03 14:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer not to run the latest version of Perl because:







    Results (26 votes). Check out past polls.

    Notices?