Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Hello all,

I have a script that gets data from Google Adsense. The data is in Unicode (UTF-16, I believe). When I try to pattern match on the data, I can only match one character. A pattern that looks for more than one character in sequence fails.

A typical line looks like:

5/18/05     184     7       3.8%    6.14    1.13

Matching \d works, but attempting to match \d{2}, \d+\/ or anything else that catches two characters in sequence fails. I take it this is because Unicode uses more than one byte per character.

I'm only extracting data from this Unicode text, and do not need to output Unicode. Why don't the regexps work? If they're not supposed to work, how can I convert the text to ISO-8859-1/Latin1? I tried converting using iconv, but to no avail (would return UTF-16 regardless of args (used -f UTF-16 -t UTF-8).

Thanks in advance for your help.


In reply to Unicode and Regexps: convert or am I missing something? by newrisedesigns

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (6)
As of 2024-04-19 07:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found