Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Connect SQLite with unicode directory

by IB2017 (Pilgrim)
on Nov 08, 2017 at 10:03 UTC ( [id://1202951]=perlquestion: print w/replies, xml ) Need Help??

IB2017 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

I've been having issues with UNICODE since I started programming in Perl/Windows. Things work, but they mostly require so much adaptation, at least for me. Today I have a new problem I wasn't able to solve: connecting to a SQLite database saved in a directory containing unicode characters. The strange thing is (in my eyes): I am able to create the database without any problem, but I fail to open/access it. In the following (non-sense) script I create 2 databases in two directories (one with and one without unicode characters) and try to access them. Creation is okay for both. Access only for the database in the directory without unicode characters. What I am not understanding?

#!/usr/bin/perl use utf8; use strict; use warnings; use DBI; #first exmple without any unicode in directory my $PathCorpusDB1="a/databaseTest1.db"; print "Creating following database $PathCorpusDB1 ...\n"; my $dbh1 = DBI->connect("dbi:SQLite:$PathCorpusDB1", "", "", { RaiseEr +ror => 1, AutoCommit => 1, PrintError => 1 }); $dbh1->do( "CREATE TABLE data ( ID INTEGER PRIMARY KEY, text, filename +)" ); $dbh1->disconnect; print "Connecting to $PathCorpusDB1\n"; $dbh1 = DBI->connect("dbi:SQLite:$PathCorpusDB1", "", "", { RaiseError + => 1, AutoCommit => 1, PrintError => 1 }); my $AllDbText_ref1 = $dbh1->selectall_arrayref("SELECT filename FROM d +ata"); $dbh1->disconnect; #second example with unicode in directory my $PathCorpusDB2="ü/databaseTest2.db"; print "Creating following database $PathCorpusDB2 ...\n"; my $dbh2 = DBI->connect("dbi:SQLite:$PathCorpusDB2", "", "", { RaiseEr +ror => 1, AutoCommit => 1, PrintError => 1 }); $dbh2->do( "CREATE TABLE data ( ID INTEGER PRIMARY KEY, text, filename +)" ); $dbh2->disconnect; print "Connecting to $PathCorpusDB2\n"; $dbh2 = DBI->connect("dbi:SQLite:$PathCorpusDB2", "", "", { RaiseError + => 1, AutoCommit => 1, PrintError => 1 }); my $AllDbText_ref2 = $dbh2->selectall_arrayref("SELECT filename FROM d +ata"); $dbh2->disconnect;

This is the error message I get

D:\MyModule> perl .\UnicodeDatabaseConnect.pl Creating following database a/databaseTest1.db ... Connecting to a/databaseTest1.db Creating following database ³/databaseTest2.db ... Connecting to ³/databaseTest2.db DBI connect('³/databaseTest2.db','',...) failed: unable to open databa +se file at .\UnicodeDatabaseConnect.pl line 27.

Replies are listed 'Best First'.
Re: Connect SQLite with unicode directory
by Corion (Patriarch) on Nov 08, 2017 at 10:15 UTC

    Filesystems in general don't know about encoding.

    Unixish filesystems (and the APIs) usually expose the filename as a binary blob, which matches well with using UTF-8 encoded filenames.

    Windows filesystems (and the APIs) usually expose the filename as Wide Characters, so if you get the filename as UTF-8, you need to translate it to Wide Characters and you also need to use the Wide APIs (CreateFileW etc) to access such files.

    As a workaround to these issues, I am a fan of Text::Unidecode (and Text::CleanFragment) to downcase characters to ASCII.

    Personally, I try to avoid non-ASCII characters in the functional parts of programs and instead use the named entities:

    my $PathCorpusDB2 = "\N{LOWER CASE LATIN LETTER U WITH DIAERESIS}/data +baseTest2.db";

    This still won't solve your problem with Umlauts in the charset though. I think that using CreateFileW() with UTF-8 encoded filenames should work, but I don't know how to tell SQLite that.

Re: Connect SQLite with unicode directory
by hippo (Bishop) on Nov 08, 2017 at 11:11 UTC

    I'm not convinced that the non-ascii character in your code is actually utf-8 (at least not as represented here on this site). Replacing this with its named character (slightly different from Corion's example) I get this test script which passes fine on Linux (having first run mkdir a ü of course). YMMV with other OSes.

    #!/usr/bin/perl use utf8; use strict; use warnings; use DBI; use Test::More; my @dbs = ('a/databaseTest1.db', "\N{LATIN SMALL LETTER U WITH DIAERESIS}/databaseTest2.db"); plan tests => 7 * @dbs; for my $db (@dbs) { dbt ($db) } exit; sub dbt { my $dbname = shift; my $dbh = DBI->connect("dbi:SQLite:$dbname", "", "", { RaiseError +=> 1, AutoCommit => 1, PrintError => 1 }); ok ($dbh, "Creating database $dbname"); ok ($dbh->do( "CREATE TABLE data ( ID INTEGER PRIMARY KEY, text, f +ilename)" ), "Created table data"); ok ($dbh->disconnect, "Disconnected from $dbname"); ok (-e $dbname, "Database file exists"); $dbh = DBI->connect("dbi:SQLite:$dbname", "", "", { RaiseError => +1, AutoCommit => 1, PrintError => 1 }); ok ($dbh, "Reconnected to database $dbname"); my $AllDbText_ref = $dbh->selectall_arrayref("SELECT filename FROM + data"); ok (defined $AllDbText_ref, "Results set obtained from $dbname"); ok ($dbh->disconnect, "Disconnected from $dbname"); }

    You'll see that I've moved the operations into a subroutine to show that there aren't any differences (and to shorten it). Here's the output:

    $ perl sqlt.pl 
    1..14
    ok 1 - Creating database a/databaseTest1.db
    ok 2 - Created table data
    ok 3 - Disconnected from a/databaseTest1.db
    ok 4 - Database file exists
    ok 5 - Reconnected to database a/databaseTest1.db
    ok 6 - Results set obtained from a/databaseTest1.db
    ok 7 - Disconnected from a/databaseTest1.db
    ok 8 - Creating database ü/databaseTest2.db
    ok 9 - Created table data
    ok 10 - Disconnected from ü/databaseTest2.db
    ok 11 - Database file exists
    ok 12 - Reconnected to database ü/databaseTest2.db
    ok 13 - Results set obtained from ü/databaseTest2.db
    ok 14 - Disconnected from ü/databaseTest2.db
    

    HTH, Hippo

Re: Connect SQLite with unicode directory
by NetWallah (Canon) on Nov 08, 2017 at 17:51 UTC
    Running hippo's test on Windows 10:
    \test>perl test-unicode-sqlite.pl 1..14 ok 1 - Creating database a/databaseTest1.db ok 2 - Created table data ok 3 - Disconnected from a/databaseTest1.db ok 4 - Database file exists ok 5 - Reconnected to database a/databaseTest1.db ok 6 - Results set obtained from a/databaseTest1.db ok 7 - Disconnected from a/databaseTest1.db ok 8 - Creating database &#8319;/databaseTest2.db ok 9 - Created table data ok 10 - Disconnected from &#8319;/databaseTest2.db not ok 11 - Database file exists # Failed test 'Database file exists' # at test-unicode-sqlite.pl line 22. ok 12 - Reconnected to database &#8319;/databaseTest2.db ok 13 - Results set obtained from &#8319;/databaseTest2.db ok 14 - Disconnected from &#8319;/databaseTest2.db # Looks like you failed 1 test of 14. test>dir ü Volume in drive C is Win10 Volume Serial Number is D27E-5B7F Directory of ...test\ü 11/08/2017 09:36 AM <DIR> . 11/08/2017 09:36 AM <DIR> .. 11/08/2017 09:36 AM 2,048 databaseTest2.db 1 File(s) 2,048 bytes 2 Dir(s) 813,772,779,520 bytes free
    So - this looks like a perl filesystem api issue, but hippo's code is able to reconnect.

                    All power corrupts, but we need electricity.

Re: Connect SQLite with unicode directory
by dasgar (Priest) on Nov 08, 2017 at 17:13 UTC

    I'm not sure if you're trying to work with Windows and/or a non-Windows OS. You mention Windows in your first sentence, but your code appears to be more for Linux (based on the first line and the file paths used). The comments below would only apply to Windows.

    I'm probably not going to be able to explain this fully and might not use the correct terminology. In Windows, there's an attempt to maintain backwards compatibility. As a result, the default filesystem API has some limits (such as no Unicode support and a max path length of about 260 characters). Most programs (including file explorer and the command prompt) use this API. And this is also the API used by most Perl modules.

    There is a second filesystem API available that will allow for Unicode characters and a significantly larger max path length. The one Perl module that I've had success in dealing with longer path names is Win32::LongPath and it does support Unicode characters in paths.

    I don't have experience using the DBI module, so I took a quick look to see if it will take a file handle instead a file path. Unless I missed it, I don't think that it does. I would suggest trying to take a look at the shortpathL function from the Win32::LongPath module. This function will attempt to return the "short path" (which I'm assuming is the path in DOS 8.3 format). You might have better luck using the "short path" of the file that has Unicode characters in its path.

Re: Connect SQLite with unicode directory
by swl (Parson) on Nov 08, 2017 at 22:05 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1202951]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-04-26 04:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found