comment on

Tools to assist, and thoughts:

Generate a file of all 256 characters, for viewing in your viewer of choice, to match up something like <FC> to its actual character code.

perl -e 'print map chr,0..255' > chars.all
[download]

object dump: dump a file in hex format. You could copy mystery characters into a file, and run this on the file.

od -t x1 chars.all
[download]

Dump a scalar in hex format. I use this when writing programs that decode binary files.

sub hex_it {
  return join ' ', map {sprintf '%2.2x', $_} unpack('C*', $_[0]);
}
[download]

Data::Dumper's Useqq can be set to 1, causing dumps to be encoded like you would write them in a Perl double-quoted string; just right for developing a regex.

use Data::Dumper;
$Data::Dumper::Useqq = 1;
print Dumper $wierd;
[download]

Many of the strange characters are probably coming from people pasting text direct from MS Word, which is infamous for causing these kinds of problems. Rather than just removing the characters, you may want to paste them into Word to see what they really mean, and write your regex to translate to the nearest equivalent. For example,

tr{\x93\x94}{""}; # Translate MS Word SmartQuotes into regular quotes.
[download]

Control-M is also known as "\r", Carriage Return, or just CR. Control-J is also known as "\n", Line Feed, or just LF. The names are left over from the old teletype days. Different systems uses different characters (sometimes more than one) to end a line of text; this is called the "newline" for that system. Unix uses LF, while Windows uses CRLF. When you view Windows text on a Unix system, you see the CR that is left over after your viewer interprets the LF.

In reply to Re: Stripping out special characters by Util
in thread Stripping out special characters by AssFace

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Come for the quick hacks, stay for the epiphanies.
	PerlMonks