Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Parsing HTML and Inserting JavaScript/HTML into Documents

by hackdaddy (Hermit)
on Oct 13, 2005 at 18:11 UTC ( [id://499980]=perlquestion: print w/replies, xml ) Need Help??

hackdaddy has asked for the wisdom of the Perl Monks concerning the following question:

I am working on a project where I am parsing HTML documents and inserting JavaScript/HTML into the document. I tried to use HTML::TreeBuilder, but it could not support languages other than English. Non-English documents would have garbage characters in the document when it was saved.

I am open to doing this in C++ or Perl.

Does anyone have any experience in creating IE/FireFox plugins? I would like to either have this working as a proxy or browser plugin.

Any assistance is greatly appreciated. Thanks.

Replies are listed 'Best First'.
Re: Parsing HTML and Inserting JavaScript/HTML into Documents
by fizbin (Chaplain) on Oct 13, 2005 at 18:36 UTC
    Your complaint about HTML::TreeBuilder sounds like you were getting tripped up by character set and unicode/non-unicode issues - this shouldn't be an HTML::TreeBuilder problem per se, and I'd encourage you to revisit your solution, possibly posting a portion to perlmonks that adds almost nothing and giving us the URL of a sample in the target language which ends up with "garbage" characters when you process it.

    Hopefully you can do that without revealing your proprietary HTML/Javascript.

    In perl, what you're proposing would be much, much simpler as an HTTP proxy.

    Incidentally, I'll note that I have some concern as to why you're wanting to do this. You wouldn't be trying to hijack the browser against the wishes of the desktop user, would you?

    --
    @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
      Thanks, fizbin, for your reply.

      My intentions are not malicious here. There are no designs to hijack the browser or web experience. Although, now I see, after considering your response, that there would be security issues with this type of application.

      Here is the code snippet from one of my “experiments” in parsing with the HTML::TreeBuilder module.
      #!perl -w use HTML::TreeBuilder; use diagnostics; use strict; my $root = HTML::TreeBuilder->new; $root->parse_file('sample_document.htm') || die $!; my @paras = $root->find_by_tag_name('p'); foreach my $h (@paras) { foreach my $item_r ($h->content_refs_list) { next if ref $$item_r; ### proprietary JavaScript/HMTL inserted with substitution } } # end foreach print $root->as_HTML;
        Okay. I tried your sample with ActiveState's 5.6.1 perl and a russian page I found through google and got garbage. I tried the same with perl 5.8.6 (cygwin) and also got garbage, but got this helpful warning message:
        Parsing of undecoded UTF-8 will give garbage when decoding entities at + /usr/lib/perl5/site_perl/5.8/cygwin/HTML/Parser.pm line 104.
        For reference, the document I was using was http://www.ras.ru/about.aspx?_Language=ru.

        Now, this indeed looks like character set issues. Namely, the document I had was encoded in utf8, but perl assumed it was encoded in iso-latin-1. So, I modified the script to assume that the document was encoded in utf8:

        #!perl -w use HTML::TreeBuilder; use diagnostics; use strict; my $root = HTML::TreeBuilder->new; open(MYFILE, '<:utf8', 'sample_document.htm'); while (<MYFILE>) {$root->parse($_);} $root->eof(); my @paras = $root->find_by_tag_name('p'); foreach my $h (@paras) { foreach my $item_r ($h->content_refs_list) { next if ref $$item_r; ### proprietary JavaScript/HMTL inserted with substitu +tion } } # end foreach print $root->as_HTML;
        And then when I ran it, I got a document that looked very different from what went in, but looked identical in a web browser. So this is the solution for utf8 documents.

        But what about in general? After all, you can't assume that all incoming documents will be utf-8. Well, in general you won't be working from the file system, you'll be pulling stuff via http. The nice thing about that is that with http you're able to determine the content type from the headers, usually.

        After working on it a bit, I have this that is successful in general, but requires perl 5.8. It should be easy to rework into an HTTP proxy using the HTTP::Proxy module (it looks like more code than it really is, since perltidy tends to put in excessive spaces):

        use LWP::UserAgent; use HTML::Parser; use HTML::TreeBuilder; use Encode; use strict; my $ua = LWP::UserAgent->new; $ua->timeout(10); $ua->env_proxy; my $charset = undef; sub set_charset_from_content_type { if ( $_[0] =~ /.*; charset=(\S+)/ ) { $charset ||= $1; } } # This parser is active only until we get the charset my $mini_parser = HTML::Parser->new( api_version => 3, start_h => [ sub { $_[0] eq 'meta' and $_[1]->{'http-equiv'} and lc( $_[1]->{'http-equiv'} ) eq 'content-type' and set_charset_from_content_type( $_[1]->{'content'} ); }, "tagname, attr" ], end_h => [ sub { $_[0] eq 'head' and do { $charset ||= "iso-8859-1" } }, "tagname" ] ); # This doesn't do what you think it does - it does something # strange; see the HTML::Parser documentation $mini_parser->utf8_mode(1); my $root = HTML::TreeBuilder->new; my $isfirst = 1; my $unencoded_buffer = ''; my $result = ''; sub process_lwp_response { my ( $chunk, $resp_object ) = @_; $unencoded_buffer .= $chunk; if ( !$charset ) { if ($isfirst) { $isfirst = 0; set_charset_from_content_type( $resp_object->header('Content-Type') ); } $mini_parser->parse($chunk); } if ($charset) { $mini_parser = undef; $root->parse( decode( $charset, $unencoded_buffer, Encode::FB_ +QUIET ) ); } } my $targeturl = 'http://www.ras.ru/about.aspx?_Language=ru'; # $targeturl = shift; my $response = $ua->get( $targeturl, ':content_cb' => \&process_lwp_re +sponse ); if ( $response->is_success ) { $root->eof(); # original code my @paras = $root->find_by_tag_name('p'); foreach my $h (@paras) { foreach my $item_r ( $h->content_refs_list ) { next if ref $$item_r; ### proprietary JavaScript/HMTL inserted with substitution } } # end foreach print $root->as_HTML; } else { die $response->status_line; }
        Update: To work completely properly, this really needs the HTML::Parser patch I mention below. However, that's an HTML::Parser bug; this code would be fine if HTML::Parser behaved better in utf-8 environments.
        --
        @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
      Remember when I said that it shouldn't be an HTML::TreeBuilder problem per se? Well, it's not exactly, but it turns out that HTML::Parser, which HTML::TreeBuilder uses, has an issue with characters whose utf-8 expansions include the character 0xA0. Why? Well, internal to the parser it's expanding the string into a bunch of UTF-8 bytes and then parsing as it used to before unicode came to the perl world. Unfortunately, at certain points it calls a function to skip "space characters" - and in latin1, character 0xA0 is a space. This leads to it skip part of a utf8 character, and pass along partial/truncated utf8 characters to other perl functions, which is bad.

      There are two ways around this. One is to patch the C source for HTML::Parser - the file you need to change is hctype.h and the code you need to change is:

      $ diff hctype.h.orig hctype.h 42c42 < 0x01, 0x78, 0x78, 0x78, 0x78, 0x78, 0x78, 0x78, /* 160 - 167 */ --- > 0x78, 0x78, 0x78, 0x78, 0x78, 0x78, 0x78, 0x78, /* 160 - 167 */
      The other way is to change the section that decodes incoming text into perl's representation so that you never pass anything to HTML::Parser that might contain a 0xA0 when represented in utf8; for example, in my LWP::UserAgent sample, you would do this:
      if ($charset) { $mini_parser = undef; my $decoded = decode($charset, $unencoded_buffer, Encode::FB_QUIET +); $decoded =~ s/([\x80-\x{FFFF}])/sprintf('&#x%02X;',ord($1))/ge; $root->parse($decoded); }
      Technically, that character range is larger than you need, but removing every possible character with an 0xA0 expansion in UTF-8, and nothing else, is an annoying task.

      I'll be filing a bug report with the maintainer of HTML::Parser.

      --
      @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
Re: Parsing HTML and Inserting JavaScript/HTML into Documents
by fizbin (Chaplain) on Oct 16, 2005 at 20:06 UTC
    I've written up an extended example of how to do this and placed it at node 500607.
    --
    @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://499980]
Approved by 5mi11er
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (6)
As of 2024-03-28 14:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found