Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

How to Determine a File's Character Code

by spickles (Scribe)
on Jul 23, 2010 at 17:13 UTC ( [id://851079]=perlquestion: print w/replies, xml ) Need Help??

spickles has asked for the wisdom of the Perl Monks concerning the following question:

I have a file simply named 'EDK' without any extension. When I open it in notepad, I see some characters clearly printed, while others are garbage. Is it at all possible to run a script against this file to determine the character code or character set in use so that I can determine the appropriate decoder? I don't know how I would open a file without an extension name? Can I open a file simply named 'C:/EDK'?

Regards,

Scott

Replies are listed 'Best First'.
Re: How to Determine a File's Character Code
by repellent (Priest) on Jul 23, 2010 at 18:11 UTC
    If you don't know anything about the file, you can only be hopeful, at best. In Unix-land, you can run file against the unknown file to get some info:
    $ file EDK

    But since you're most likely in Win32-land, with Perl, you can try to Encode::Guess which character encoding was used for the unknown file. And then (fingers crossed), Encode::Repair the file if things were not quite right.

    Sure, you can open (as in read its contents) a file without an extension name. The filename just doesn't have an extension to it.

    If, by "open", you mean have Windows figure out a program to open C:\EDK, then it'll probably fail. Windows associates programs based on the target filename's extension (and doesn't provide a default for no-extension).
      But since you're most likely in Win32-land,

      You can install file

Re: How to Determine a File's Character Code
by ww (Archbishop) on Jul 23, 2010 at 21:27 UTC

    <Picky>

    As to your second (non-Perl) question, "Can I open a file simply named 'C:/EDK'?" -- of course!

    my $file = "EDK"; open (FH, '<', $file) or die "Can't open file $file $!";

    As to my phrase "second (non-Perl) question" your remark "I don't know how I would open a file without an extension name?" is not a question.

    The point is NOT grammar-Nazi-dom; it's "be clear, accurate, precise and careful." And just BTW, were one to read your remark as "please tell me how to use Perl to open a file without an extension name" an appropriate answer would be perldoc -f open

    </Picky>
Re: How to Determine a File's Character Code
by AndyZaft (Hermit) on Jul 23, 2010 at 19:11 UTC
    As far as "some characters clearly printed", if a file is binary, bytes in it can fall between the a-z A-Z 0-9 ranges, that doesn't mean the other bytes are "garbage". Many files have header information that would indicate what the file is used for. As in the other post, the 'file' on *nix could tell you something about most of them. cygwin also has this, so you can check some files already in windows, just don't expect 100% accuracy. There are probably windows apps that try to do the same too.
      Thanks Gents!
Re: How to Determine a File's Character Code
by morgon (Priest) on Jul 24, 2010 at 02:14 UTC
    I know nothing about Windows, so I am just guessing here, but to me it seems that without a file-extension notepad will use a default encoding (latin1 or utf8 or whatever) while your file uses a different encoding, which is why some characters (probably umlauts and the like appear garbled).

    You can of course only guess what the proper encoding is...

    One way to guess would be to open the file with different encodings and print the result, the one the "looks right" is probably the right one.

    In perl you can supply the encoding to use to the open, so this approach could work:

    use strict; use Encode; my @encodings = Encode->encodings; for my $encoding (@encodings) { open(my $fh, "<:encoding($encoding)", "C:/EDK") or die $!; my $content = do { local $/; <$fh>; }; print "encoding: $encoding:\n$content\n"; }
Re: How to Determine a File's Character Code
by Marshall (Canon) on Jul 24, 2010 at 05:05 UTC
    In Windows land, I would and did run a google against "file .EDK". Maybe this thing really is a "Ensoniq KT Disk Image"?. There are other options and you will have to hunt. If this thing is a binary file and you are seeing binary and occasional text strings, figuring out what it all means is a pretty tall order if you have no idea what this file is, what made it or where it came from.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://851079]
Approved by biohisham
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-03-28 20:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found