Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

DBD::mysql incorrect string value

by cormanaz (Deacon)
on Apr 10, 2021 at 00:02 UTC ( [id://11131078]=perlquestion: print w/replies, xml ) Need Help??

cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Hi all. I have some tweets in Russla/Ukrainian/Bulgarian in a database on another machine. I have extracted ones I want to translate using google cloud translate, which is installed on another machine. I loaded them into an array of hashes, dumped them into a Storable file, and moved that to the other machine. I am now trying to load these into a MySQL database and an getting an error I can't figure out on the other machine. Here is the code (I set the die on $sth->execute to print the error and offending values):
use DBI; use Storable qw(retrieve); $| = 1; my $j = retrieve("/mnt/c/temp/to-translate.sto"); my $dbh = connectdb('****','****','****','****','****'); foreach my $i (0..@$j-1) { my $r; $r->{tid} = $j->[$i]->{id}; $r->{orig} = $j->[$i]->{text}; #print "$r->{tid}\t$r->{orig}\n"; insertsql($dbh,'translate',$r); } sub connectdb { # connects to mysql or PgPP my ($database,$user,$password,$driver,$server) = @_; unless ($driver) { $driver = "mysql"; } unless ($server) { $server = "hdshcresearch.asu.edu"; } my $url = "DBI:$driver:$database:$server"; unless ($user) { $user = "root"; $password = "research.HDSHC.mysql"; } my $dbh = DBI->connect( $url, $user, $password ) or die "connectdb + can't connect to mysql: $!\n"; return $dbh; } sub insertsql { my ($dbh,$table,$data,$ignore) = @_; my @qm; my @keys; my @values; my $i = -1; foreach my $k (keys %$data) { if (defined($data->{$k})) { $i++; $keys[$i] = $k; $values[$i] = $data->{$k}; $qm[$i] = '?'; } } my $keylist = join(",",@keys); my $qlist = join(",",@qm); my $sqlstatement = "insert into $table ($keylist) values ($qlist)" +; if ($ignore) { my $sqlstatement = "insert ignore into $table ($keylist) value +s ($qlist)"; } my $sth = $dbh->prepare($sqlstatement); #$sth->execute(@values) || die "putsql could not execute MySQL sta +tement: $sqlstatement $sth->errstr"; $sth->execute(@values) || die $sth->errstr. " ".join(" ",@values); $sth->finish(); return $dbh->{'mysql_insertid'}; }
The encoding on the original db is utf8, and so are the table and columns on the target db. When it gets to one particular item it croaks:

Incorrect string value: '\xF0\x9F\x98\x84 "...' for column 'orig' at row 1 530248086468063232 Нужно срочно брать на роботу. Цель для него есть 😄 "Рассекречена личность морпеха застрелившего бин Ладена"  at /home/steve/load-tweets.pl line 61.

Is the problem the emoticon? If so, how can I filter these out?

Replies are listed 'Best First'.
Re: DBD::mysql incorrect string value
by choroba (Cardinal) on Apr 10, 2021 at 07:04 UTC
    Handling of UTF-8 is broken in DBD::mysql. Use DBD::MariaDB instead (see here).

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: DBD::mysql incorrect string value
by parv (Parson) on Apr 10, 2021 at 07:07 UTC

    Is it not possible to change the table/database encoding to "utf8mb4"? Or, would that also not be enough for MySQL to deal with Unicode?

      Doh! Why didn't I think of that? Thanks, that did the trick.

        If you care about Unicode in UTF-8 to|from MySQL, I do hope that you would switch to DBD::MariaDB as choroba mentioned.

        Thanks goes to Adam H for reminding me about broken "utf8" encoding scheme of MySQL & to use "utf8mb4" instead few days ago, and all the other people through whom I found previously linked article written in 2016. Some web! 8-)

Re: DBD::mysql incorrect string value
by cormanaz (Deacon) on Apr 10, 2021 at 00:27 UTC
    Well it seems it is the emoticon because I found a proposed regex here which got me past the offending tweet. However, it croaked on a later tweet:
    Incorrect string value: '\xF3\xBE\x94\x9F \xD0...' for column 'orig' at row 1 󾔟 Кол мне в лоб! 󾔟
    
    Разыгрываем кровь (почти как настоящая!) и кол в лоб.
    С помощью этих спецэффектов образ для...  392907622149791744 at /home/steve/load-tweets.pl line 64.

    And I don't know how to modify the regex pattern to intercept that sequence. Can someone teach me so I don't have to come back here every time I need to whack another mole? Or does someone have a better pattern?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11131078]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-04-19 14:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found