http://qs321.pair.com?node_id=11148363


in reply to Re: getting rid of UTF-8
in thread getting rid of UTF-8

The problem is out-of-support and I've used it for years. And it works perfectly... except.. as you deduced it puts that stuff in when it is exporting to a CSV file. I don't know how to upload the broken data. When I open it in my text editor, it has no problem with it but when I go and save the file the UTF-8 is all still there. I loaded it into Excel, it loaded fine and showed no anomalies, but when I saved it from excel all the UTF-8 stuff was still there. There's no pattern I can tell for why there are the byte-order-markers strewn through the file.

What should I do either to upload something to here with example probelmatic stuff and/or be able just to brute-force fix it?

Replies are listed 'Best First'.
Re^3: getting rid of UTF-8
by haukex (Archbishop) on Nov 24, 2022 at 23:04 UTC

    The issue with the sample data you posted is that it is entirely ASCII with some some BOMs in it, but from your description it sounded like you could have other Latin-1 (or CP1252 or Latin-9) or UTF-8 characters in it, which you don't show.

    What should I do either to upload something to here with example probelmatic stuff

    A hex dump of the raw bytes like you showed above is fine. See also my node here.

    and/or be able just to brute-force fix it?

    Iff your data consists entirely a single-byte encoding like the ones I named above, and the only UTF-8 characters that appear in it are BOMs, then the regex you showed in the root node may be acceptable. However, I very much expect that if there's a BOM, then other UTF-8 characters can be present, and if those are mixed with single-byte-encodings, or you've got double-encoded characters, you'll have a tough time picking that apart. But again, you'd need to show us more representative data.

    Edit: Typo fixes.

      I'll try to get something together and paste a hex dump. But: i know that there are nothing but plain lower 128 ASCII characters {I just mentioned ISO-latin out of habit}. It is all data that I entered in and there's no data in the CSV files that isn't something I entered. I have no idea why there's a bom in the middle of the first record..... I'll get the dump
        OK. I've got a hex dump. Here's what the file looks like in a text editor:
        Importance,"First Name","Middle Name","Last Name","Full Name",Company, +Department,"Job Title","Street (b.)","City (b.)","State (b.)","ZIP Co +de (b.)","Country/Region (b.)","Home Phone","Business Phone","Mobile +Phone","Business Phone 2","Business Phone 3","Business Phone 4","Busi +ness Fax","Business Web Page","Street (h.)","City (h.)","State (h.)", +"ZIP Code (h.)","Country/Region (h.)","Home Phone 2","Home Phone 3"," +Home Phone 4","Home Fax","Personal Web Page","Mobile Phone 2","Mobile + Phone 3","Mobile Phone 4",E-mail,"E-mail 2","E-mail 3","E-mail 4",x, +y,z,w,Office,Supervisor,Assistant,Salutation,Nickname,Gender,Spouse,B +irthday,Anniversary,Family,Hobbies,Specialty,Strengths,Personality,No +tes,"Custom 2","Custom 3","Custom 4","Custom 5","Custom 6","Custom 7" +,"Custom 8",Comment,Group,"Birthday Reminder On/Off","Anniversary Rem +inder On/Off" Normal,,,,"A-1 Heating and Cooling","A-1 Heating and Cooling",,,"PO Bo +x 94",Newport,Virginia,24128,"United States",,544-7810,,,,,,,,,,,"Uni +ted States",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"953-1513 Scott - service mgr - cell 540 357 2816","Emergency contacts",No,No
        And here's the hex dump of it
        EF BB BF 49 6D 70 6F 72 74 61 6E 63 65 2C 22 46 69 72 73 74 20 4E 61 6 +D 65 22 2C 22 4D 69 64 64 6C 65 20 4E 61 6D 65 22 2C 22 4C 61 73 74 2 +0 4E 61 6D 65 22 2C 22 46 75 6C 6C 20 4E 61 6D 65 22 2C 43 6F 6D 70 6 +1 6E 79 2C 44 65 70 61 72 74 6D 65 6E 74 2C 22 4A 6F 62 20 54 69 74 6 +C 65 22 2C 22 53 74 72 65 65 74 20 28 62 2E 29 22 2C 22 43 69 74 79 2 +0 28 62 2E 29 22 2C 22 53 74 61 74 65 20 28 62 2E 29 22 2C 22 5A 49 5 +0 20 43 6F 64 65 20 28 62 2E 29 22 2C 22 43 6F 75 6E 74 72 79 2F 52 6 +5 67 69 6F 6E 20 28 62 2E 29 22 2C 22 48 6F 6D 65 20 50 68 6F 6E 65 2 +2 2C 22 42 75 73 69 6E 65 73 73 20 50 68 6F 6E 65 22 2C 22 4D 6F 62 6 +9 6C 65 20 50 68 6F 6E 65 22 2C 22 42 75 73 69 6E 65 73 73 20 50 68 6 +F 6E 65 20 32 22 2C 22 42 75 73 69 6E 65 73 73 20 50 68 6F 6E 65 20 3 +3 22 2C 22 42 75 73 69 6E 65 73 73 20 50 68 6F 6E 65 20 34 22 2C 22 4 +2 75 73 69 6E 65 73 73 20 46 61 78 22 2C 22 42 75 73 69 6E 65 73 73 2 +0 57 65 62 20 50 61 67 65 22 2C 22 53 74 72 65 65 74 20 28 68 2E 29 2 +2 2C 22 43 69 74 79 20 28 68 2E 29 22 2C 22 53 74 61 74 65 20 28 68 2 +E 29 22 2C 22 5A 49 50 20 43 6F 64 65 20 28 68 2E 29 22 2C 22 43 6F 7 +5 6E 74 72 79 2F 52 65 67 69 6F 6E 20 28 68 2E 29 22 2C 22 48 6F 6D 6 +5 20 50 68 6F 6E 65 20 32 22 2C 22 48 6F 6D 65 20 50 68 6F 6E 65 20 3 +3 22 2C 22 48 6F 6D 65 20 50 68 6F 6E 65 20 34 22 2C 22 48 6F 6D 65 2 +0 46 61 78 22 2C 22 50 65 72 73 6F 6E 61 6C 20 57 65 62 20 50 61 67 6 +5 22 2C 22 4D 6F 62 69 6C 65 20 50 68 6F 6E 65 20 32 22 2C 22 4D 6F 6 +2 69 6C 65 20 50 68 6F 6E 65 20 33 22 2C 22 4D 6F 62 69 6C 65 20 50 6 +8 6F 6E 65 20 34 22 2C 45 2D 6D 61 69 6C 2C 22 45 2D 6D 61 69 6C 20 3 +2 22 2C 22 45 2D 6D 61 69 6C 20 33 22 2C 22 45 2D 6D 61 69 6C 20 34 2 +2 2C 78 2C 79 2C 7A 2C 77 2C 4F 66 66 69 63 65 2C 53 75 70 65 72 76 6 +9 73 6F 72 2C 41 73 73 69 73 74 61 6E 74 2C 53 61 6C 75 74 61 74 69 6 +F 6E 2C 4E 69 63 6B 6E 61 6D 65 2C 47 65 6E 64 65 72 2C 53 70 6F 75 7 +3 65 2C 42 69 72 74 68 64 61 79 2C 41 6E 6E 69 76 65 72 73 61 72 79 2 +C 46 61 6D 69 6C 79 2C 48 6F 62 62 69 65 73 2C 53 70 65 63 69 61 6C 7 +4 79 2C 53 74 72 65 6E 67 74 68 73 2C 50 65 72 73 6F 6E 61 6C 69 74 7 +9 2C 4E 6F 74 65 73 2C 22 43 75 73 74 6F 6D 20 32 22 2C 22 43 75 73 7 +4 6F 6D 20 33 22 2C 22 43 75 73 74 6F 6D 20 34 22 2C 22 43 75 73 74 6 +F 6D 20 35 22 2C 22 43 75 73 74 6F 6D 20 36 22 2C 22 43 75 73 74 6F 6 +D 20 37 22 2C 22 43 75 73 74 6F 6D 20 38 22 2C 43 6F 6D 6D 65 6E 74 2 +C 47 72 6F 75 70 2C 22 42 69 72 74 68 64 61 79 20 52 65 6D 69 6E 64 6 +5 72 20 4F 6E 2F 4F 66 66 22 2C 22 41 6E 6E 69 76 65 72 73 61 72 79 2 +0 52 65 6D 69 6E 64 65 72 20 4F 6E 2F 4F 66 66 22 0D 0A 4E 6F 72 6D 6 +1 6C 2C 2C 2C 2C 22 41 2D 31 20 48 65 61 74 69 6E 67 20 61 6E 64 20 4 +3 6F 6F 6C 69 6E 67 22 2C 22 41 2D 31 20 48 65 61 74 69 6E 67 20 61 6 +E 64 20 43 6F 6F 6C 69 6E 67 22 2C 2C 2C 22 50 4F 20 42 6F 78 20 39 3 +4 22 2C 4E 65 77 70 6F 72 74 2C 56 69 72 67 69 6E 69 61 2C 32 34 31 3 +2 38 2C 22 55 6E 69 74 65 64 20 53 74 61 74 65 73 22 2C 2C 35 34 34 2 +D 37 38 31 30 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 22 55 6E 69 74 65 64 2 +0 53 74 61 74 65 73 22 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2 +C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2C 2 +C 22 EF BB BF 39 35 33 2D 31 35 31 33 0D 0A 53 63 6F 74 74 20 2D 20 7 +3 65 72 76 69 63 65 20 6D 67 72 20 2D 20 63 65 6C 6C 20 35 34 30 20 3 +3 35 37 20 32 38 31 36 22 2C 22 45 6D 65 72 67 65 6E 63 79 20 20 63 6 +F 6E 74 61 63 74 73 22 2C 4E 6F 2C 4E 6F 0D 0A 4E 6F 72 6D 61 6C 2C 2 +C 2C 2C 22 41 62 69 6E 67 64 6F 6E 20 45 71 75 69 70 6D 65 6E 74 22 2 +C 22
        Notice, from the dump that there another EFBBBF toward the end of the file. And: I tried to brute force it and it didn't work!! I did the
        $line =~ s/\xef\xbb\xbf//
        and it didn't remove the characters! I'll try again...