Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Extracting appropriate language text from HTML data

by UnderMine (Friar)
on May 27, 2006 at 20:42 UTC ( [id://552039]=perlquestion: print w/replies, xml ) Need Help??

UnderMine has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I would like your advice on how to store multi-lingual text in some HTML data. Due to other (legacy) constraints I have to fit this into a single database field.

My thoughts were along the lines of using XML tags <locale lang="xx"></locale> to mark up the language specific areas. This seems to work ok.

The code for processing this to extract a single language also needs to back out to the most appropriate language if the first choice is not available.

I have code that works and I have included a sample to show what I mean but it lacks eligence and requires two parse passing of the HTML if the first choice language is unavailable.

All suggestions and comments gratefully recieved as this one has me at a loss.

Thanks
UnderMine

#!/usr/bin/perl -w use strict; use HTML::Parser; my $langs={}; my $in_tag = 0; my $tags={}; our $textout=''; my $start = sub { my ($tag, $attr, $text) = @_; our ($lang, $textout); if ($tag eq 'locale') { $langs->{$attr->{lang}}=1; # mark languages found if ($attr->{lang} eq $lang) { $textout.=$text; $in_tag=0; # override if already in locale tag } else { $in_tag=1; } } else { $textout.=$text; } }; my $end = sub { my ($tag, $attr, $text) = @_; our $textout; if ($tag eq 'locale' and $in_tag) { $in_tag=0; } else { $textout.=$text; } }; my $p = HTML::Parser->new( default_h => [ sub { $textout.=shift unless $in_tag }, 'text'], start_h => [ $start , 'tagname, attr, text'], end_h => [ $end, 'tagname, attr, text'], ); # Order of preference for languages my $acceptable = [qw{ en fr de it }]; my $data; while (<DATA>) { $data.=$_; } $textout.=''; $langs={}; foreach our $lang (@$acceptable) { next if (scalar keys %$langs && !(exists $langs->{$lang})); $textout=''; $p->parse($data); last if (scalar keys %$langs && (exists $langs->{$lang})); } print $textout."\n"; __DATA__ <html> <body> <locale lang="en">Some English</locale> <locale lang="fr">Some French</locale> <locale lang="de">Some German</locale> </body> </html>

Replies are listed 'Best First'.
Re: Extracting appropriate language text from HTML data
by stonecolddevin (Parson) on May 27, 2006 at 23:59 UTC
    Why don't you just create separate files for each language, get the language from CGI.pm (this has some information on getting/setting the language from the header), and return the appropriate document?
    meh.

      The main problem is that I am trying to introduce a multilingual facility to a legacy system without having to do a complete rewrite. Adding a language field to the database sounds simple but would require a redesign of the whole system.

      There are currently about 20k documents in a variety of english, french, german or italian in the database. Some of these will not be translated but most will need to be in at least english probably french and their native language.

      Adding the tags would be straight forward and even may be possible to automate the update. It would also make it easy to do coverage reports and other such things.

      I did mention that the code was a sample extracted out of a large system. The full system uses I18N::LangTags and I18N::AcceptLanguage to determine the best choice backing out using panic_languages if neccessary.

      Thanks
      UnderMine

        In case you're not aware of this, you can add 'lang=xx' attributes to both block-level and inline elements in HTML4 and later, which may or may not make parsing a bit easier.

        One question for clarification: what should the system do if your user requests, for example, French, but the source document is Italian in origin, and has more translations for some 'chunks' (for want of a better term) in EN than FR?

        i.e. chunk 1 has IT & EN translations, chunk 2 has IT, EN & FR, chunk 3 has IT only - chunk 2 would obviously return the FR version and chunk 3 the IT (as it's the only one available), but what about chunk 1? What would the user expect to see for that?

Re: Extracting appropriate language text from HTML data
by ww (Archbishop) on May 28, 2006 at 00:29 UTC
    and, TIMTOWTDI, in line with dhoss suggestion and links to docs, don't forget feasibility of using a <DTD...
     
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
     
    fact is, I generally go the belt and suspenders route, following that with
    <html lang="en">
    and, while I can't find ref just now, believe you might be able to send header and DTD or <html lang="en"> and wait for response. IIRC, browser is supposed to reply with preferences.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://552039]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2024-04-19 07:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found