Looking for a Better Way to Substitute Characters (accents to HTML)

svsingh has asked for the wisdom of the Perl Monks concerning the following question:

Now that I'm more comfortable with Perl, I'm going through some of my old Pascal programs and trying to rewrite them in Perl. Partly to web-enable them, and partly for practice.

I'm working on one now that reads a text file and creates an HTML file. The text file contains some accented letters (i.e. é). My first instinct is to use a substitution to replace the accented characters with the proper HTML equivalents (i.e. é), but the text file is about 200 kb and there are eight accents that I know of. Running every character (by line) through that many substitutions seems to be a lot more overhead than I want for a dynamic page.

I'm trying to find a method that can do this for me under the assumption that built-in methods are faster. It would also be nice to have a solution that doesn't require me adding another substitution everytime a new accent shows up in the text file. I tried Perl Monks, the Black Book, the Cookbook, and a few web sites. No luck yet.

Am I missing something, or do I have to build this from scratch? If so, then is there a smarter way than individual substitutions?

Thank you.

Comment on Looking for a Better Way to Substitute Characters (accents to HTML)

Replies are listed 'Best First'.
Re: Looking for a Better Way to Substitute Characters (accents to HTML) by glivings (Scribe) on May 10, 2003 at 15:04 UTC
You'll probably find that HTML::Entities is what you want. I've used it before on larger files than yours (mine were ~2mb) and had no issues with speed. HTML::Entities also has the advantage of dealing with entities that you are not familiar with. In your manual solution, it would be easy to overlook an entity, which is something you don't want to happen.	[reply]
Re: Looking for a Better Way to Substitute Characters (accents to HTML) by bobn (Chaplain) on May 10, 2003 at 16:26 UTC
I performance is a concern, mod_perl can signifcantly increase performance by eliminating loading of the perl executeable and compliation of the script from each hit. And if the text files don't change, the html can be pregenerated and spit out statically. Bob Niederman, http://bob-n.com	[reply]
Re: Looking for a Better Way to Substitute Characters (accents to HTML) by zentara (Archbishop) on May 11, 2003 at 14:08 UTC
I don't know if it will help, but I just saw this on comp.lang.perl.misc,concerning a similar problem. #by Janek Schleicher #favor using a module: use Regexp::Subst::Parallel; my $replaces_str = subst($string, qr/A/ => 'Y', qr/B/ => 'Z' ); ###################################### #Discussion: #using a hash my %substitute = (A => 'Y', B => 'Z'); my $keys = join "\|", keys %substitute; s/($keys)/$substitute{$1}/g; #Disadvantages are that it becomes slow, #when there a lot of different expressions, #and it can lead to problems if there are some regexp characters insid +e. ############################################### #It's often extremely useful, #if you know that the matches can be matched with a more general one, #e.g. they are words: s/(\w+)/$substitute{$1} \|\| ""/ge; [download]	[reply] [d/l]


go ahead... be a heretic
	PerlMonks