Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

How do I regex for characters like ¾, ¼ ?

by Plankton (Vicar)
on Sep 10, 2007 at 02:28 UTC ( #637975=perlquestion: print w/replies, xml ) Need Help??

Plankton has asked for the wisdom of the Perl Monks concerning the following question:

Dear Wise Monks,

I am processing a text file that lines like this ...

The use of the Porter and Ale is more prevalent in England. In the United States ½ Old and ½ New Ale is usually used when this drink is called for, unless otherwise specified.
... I want to change the fraction character to ½ so I tried this in my script ...
$origtext =~ s/½/\½/g;
... by simply cut-n-paste the ½ into my script, but that doesn't seem to work. What should I be doing here?

Thanks

Replies are listed 'Best First'.
Re: How do I regex for characters like ¾, ¼ ?
by GrandFather (Saint) on Sep 10, 2007 at 02:52 UTC
    use strict; use warnings; open OUT, '>', 'delme1.txt'; print OUT <<STR; The use of the Porter and Ale is more prevalent in England. In the United States ½ Old and ½ New Ale is usually used when this drink is called for, unless otherwise specified. STR close OUT; open IN, '<', 'delme1.txt'; my $str = do {local $/; <IN>}; close IN; $str =~ s/½/&frac12;/g; print $str;

    Prints:

    The use of the Porter and Ale is more prevalent in England. In the United States &frac12; Old and &frac12; New Ale is usually used when t +his drink is called for, unless otherwise specified.

    as expected using ActiveState Perl v5.8.7 built for MSWin32-x86-multi-thread under Windows XP.

    Maybe the file you are using is not the character format that you think it is?


    DWIM is Perl's answer to Gödel
      Thank you for the reply. The file I am processing is a html file and I am using HTML::Parser to process it. Also I am having trouble understanding you helpful reply. I am using HTML::Parser like so ...
      my $p = HTML::Parser->new( api_version => 3, start_h => [\&start, "tagname, attr"], end_h => [\&end, "tagname"], text_h => [\&text, "dtext"], marked_sections => 1, ); # Parse directly from file $p->parse_file($inputFile);
      ... so I have a sub text() that looks like this ...
      sub text { my($origtext, $is_cdata) = @_; if ( $origtext =~ /^\s*$/ ) { return; } $origtext = "UNDEF" if !defined $origtext; $is_cdata = "UNDEF" if !defined $is_cdata; $origtext =~ s/ \& / \&amp; /g; $origtext =~ s/½/\&frac12;/g; print $origtext; }
      ... so when should I do the "local $/;" call?
        print HTML::Entities::encode( $origtext );

        You don't do the "local $/;" bit. That was just to provide a stand alone chunk of demo code using a temporary file. A sample closer to your actual issue looks like:

        use strict; use warnings; use HTML::Parser; open OUT, '>', 'delme1.txt'; print OUT <<STR; <html><head></head> <body> <p>The use of the Porter and Ale is more prevalent in England. In the United States ½ Old and ½ New Ale is usually used when this drink is called for, unless otherwise specified.</p> </body> </html> STR close OUT; my $p = HTML::Parser->new( api_version => 3, text_h => [\&text, "dtext"], ); $p->parse_file ('delme1.txt'); sub text { my($origtext) = @_; $origtext =~ s/½/\&frac12;/g; print $origtext; }

        Prints:

        The use of the Porter and Ale is more prevalent in England. In the United States &frac12; Old and &frac12; New Ale is usually used when t +his drink is called for, unless otherwise specified.

        but still doesn't show the problem you are experiencing. Perhaps you can modify the sample until it does show the error?


        DWIM is Perl's answer to Gödel
Re: How do I regex for characters like ¾, ¼ ?
by moritz (Cardinal) on Sep 10, 2007 at 05:51 UTC
    It might be a charset issue:

    In order to make it work reliably, you should make sure that the file's content gets Encode::decode'd into perl's internal format, and that the perl script is written in utf8, and that you use utf8;.

    When you print that file again (to the browser or a file) you have to Encode::encode encode it into the desired charset.

Re: How do I regex for characters like ¾, ¼ ?
by Anonymous Monk on Sep 10, 2007 at 04:10 UTC
    Because patterns are processed as double quoted strings, the follo +wing also work: \t tab (HT, TAB) \n newline (LF, NL) \r return (CR) \f form feed (FF) \a alarm (bell) (BEL) \e escape (think troff) (ESC) \033 octal char (think of a PDP-11) \x1B hex char \x{263a} wide hex char (Unicode SMILEY) \c[ control char \N{name} named char \l lowercase next char (think vi) \u uppercase next char (think vi) \L lowercase till \E (think vi) \U uppercase till \E (think vi) \E end case modification (think vi) \Q quote (disable) pattern metacharacters till \E
Re: How do I regex for characters like ¾, ¼ ?
by goibhniu (Hermit) on Sep 10, 2007 at 15:46 UTC

    Other, smarter monks than I have commented on your technical problems, but I thuoght I'd just add that I gave you ++ for having an example that talked about beer!


    I humbly seek wisdom.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://637975]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2023-09-22 08:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?