Extracting appropriate language text from HTML data

UnderMine has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I would like your advice on how to store multi-lingual text in some HTML data. Due to other (legacy) constraints I have to fit this into a single database field.

My thoughts were along the lines of using XML tags <locale lang="xx"></locale> to mark up the language specific areas. This seems to work ok.

The code for processing this to extract a single language also needs to back out to the most appropriate language if the first choice is not available.

I have code that works and I have included a sample to show what I mean but it lacks eligence and requires two parse passing of the HTML if the first choice language is unavailable.

All suggestions and comments gratefully recieved as this one has me at a loss.

Thanks
UnderMine

#!/usr/bin/perl -w

use strict;
use HTML::Parser;

my $langs={};
my $in_tag = 0;
my $tags={};
our $textout='';

my $start = sub {
    my ($tag, $attr, $text) = @_;
    our ($lang, $textout);
    if ($tag eq 'locale') {
        $langs->{$attr->{lang}}=1; # mark languages found
        if ($attr->{lang} eq $lang) {
            $textout.=$text;
        $in_tag=0; # override if already in locale tag 
    } else {
        $in_tag=1;
    }
    } else {
        $textout.=$text;
    }
};

my $end = sub {
    my ($tag, $attr, $text) = @_;
    our $textout;
    if ($tag eq 'locale' and $in_tag) {
    $in_tag=0;
    } else {
        $textout.=$text;
    }
};


my $p = HTML::Parser->new(
   default_h => [ sub { $textout.=shift unless $in_tag }, 'text'],
   start_h   => [ $start , 'tagname, attr, text'],
   end_h     => [ $end, 'tagname, attr, text'],
);


# Order of preference for languages
my $acceptable = [qw{ en fr de it }];

my $data;
while (<DATA>) {
  $data.=$_;
}

$textout.='';
$langs={};
foreach our $lang (@$acceptable) {
   next if (scalar keys %$langs && !(exists $langs->{$lang}));
   $textout='';
   $p->parse($data);
   last if (scalar keys %$langs && (exists $langs->{$lang}));
} 

print $textout."\n";

__DATA__
<html>
<body>
<locale lang="en">Some English</locale>
<locale lang="fr">Some French</locale>
<locale lang="de">Some German</locale>
</body>
</html>
[download]

Comment on Extracting appropriate language text from HTML data Download Code

Replies are listed 'Best First'.
Re: Extracting appropriate language text from HTML data by stonecolddevin (Parson) on May 27, 2006 at 23:59 UTC
Why don't you just create separate files for each language, get the language from CGI.pm (this has some information on getting/setting the language from the header), and return the appropriate document? meh.	[reply]
Re^2: Extracting appropriate language text from HTML data by UnderMine (Friar) on May 28, 2006 at 00:24 UTC
The main problem is that I am trying to introduce a multilingual facility to a legacy system without having to do a complete rewrite. Adding a language field to the database sounds simple but would require a redesign of the whole system. There are currently about 20k documents in a variety of english, french, german or italian in the database. Some of these will not be translated but most will need to be in at least english probably french and their native language. Adding the tags would be straight forward and even may be possible to automate the update. It would also make it easy to do coverage reports and other such things. I did mention that the code was a sample extracted out of a large system. The full system uses I18N::LangTags and I18N::AcceptLanguage to determine the best choice backing out using panic_languages if neccessary. Thanks UnderMine	[reply]
Re^3: Extracting appropriate language text from HTML data by john_oshea (Priest) on May 28, 2006 at 15:06 UTC
In case you're not aware of this, you can add 'lang=xx' attributes to both block-level and inline elements in HTML4 and later, which may or may not make parsing a bit easier. One question for clarification: what should the system do if your user requests, for example, French, but the source document is Italian in origin, and has more translations for some 'chunks' (for want of a better term) in EN than FR? i.e. chunk 1 has IT & EN translations, chunk 2 has IT, EN & FR, chunk 3 has IT only - chunk 2 would obviously return the FR version and chunk 3 the IT (as it's the only one available), but what about chunk 1? What would the user expect to see for that?	[reply]
Re^4: Extracting appropriate language text from HTML data by UnderMine (Friar) on May 28, 2006 at 21:55 UTC
Re^5: Extracting appropriate language text from HTML data by john_oshea (Priest) on May 29, 2006 at 12:15 UTC
Re^4: Extracting appropriate language text from HTML data by UnderMine (Friar) on May 29, 2006 at 16:01 UTC
Re: Extracting appropriate language text from HTML data by ww (Archbishop) on May 28, 2006 at 00:29 UTC
and, TIMTOWTDI, in line with dhoss suggestion and links to docs, don't forget feasibility of using a <DTD... `<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">` fact is, I generally go the belt and suspenders route, following that with `<html lang="en">` and, while I can't find ref just now, believe you might be able to send header and DTD or <html lang="en"> and wait for response. IIRC, browser is supposed to reply with preferences.	[reply] [d/l]


Don't ask to ask, just ask
	PerlMonks