Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
I have a script that I have mentioned on here before. It pulls down EBay feedback and also news headlines and then superimposes that text over images from recent news articles.

I have noticed that sometimes the text that it writes out doesn't have the white background that the other text does (it uses Image::Magick's annotate call and then in there sets the background color on the text).
It will also occasionally just not have any text at all.

I've narrowed the problem down to "special" characters that are in the text that it is pulling off of its various sources.

I don't know what these characters are - it is just scraping from a web page - and I don't see the characters in there, but then in the files that it populates with the text, if I view it over ssh with the less command, then some examples that I see are:
"z<FC>gige"
"anf<E4>nglich"
"Mi<DF>verst<E4>ndnisse"
"v<F6>llig"

and more importantly:
"^M^MIS GAMBLING A SIN?"

Now the characters up there are what show up when viewed via less in me ssh connection - but they aren't really there like that. It looks like it misinterprets the characters and displays that instead, not knowing what else to do.
The German text seems to be okay and when it gets put into the images as text, it will still show up as whatever letter it is supposed to be - usually with an umlaut or accent, or whatever.

But what I'm most confused about is the last one - what looks like control-M-control-M at the beginning of a line - that is the text that usually then shows up without a background (like Image::Magick is somehow breaking on that text).

I have code in to strip out various characters - but I'm not sure how to strip those out since I don't even know what they are - looking at it in my terminal doesn't offer hope.
I know that I can add a RedEx to yank out anything that isn't a letter/number - but then there are the punctuation marks and whatnot.

Any help/suggestions would be great - thanks!

-------------------------------------------------------------------
There are some odd things afoot now, in the Villa Straylight.

In reply to Stripping out special characters by AssFace

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-23 17:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found