note
haj
<p>Thanks for the clarifications! It is relevant information that the stuff comes from a Postgres database. There's a lot of encoding done behind the scenes if a database is part of the game. Postgres has a configurable server encoding and a configurable client encoding, either one or both might have changed between the legacy and current application.</p>
<p>The string <tt>�</tt> is an UTF-8-encoded version of the "Unicode replacement character". You get this by software which tries to <em>decode</em> strings as UTF-8 which contain non-UTF-8 characters, and then <em>encodes</em> this result as UTF-8. I <em>guess</em> that the decoding step gets fed with plain ISO-latin <tt>àáâä</tt>.</p>
<p>There is a chance that the bogus decoding happens in Perl's [mod://DBD::Pg|Postgres database driver]. You can check that by setting the DBH option <tt>pg_enable_utf8</tt> to zero when connecting. Your application will then be able to examine the "raw" contents, and decode accordingly.</p>
<p>A convenient way to examine strings is <tt>printf</tt> with the "v" format specifier:</p>
<code>printf "%vx",$string</code>
<p>From there you can decide how to proceed. Probably you need to re-build the data with a consistent encoding.</p>
11122000
11122056