http://qs321.pair.com?node_id=1176832

dimitarsh1 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, When reading and HTML file with TWIG and the twig_print_outside_roots enabled, all " are replaced with &quote; and all ' are replaced with '. Even if I enable the keep_encoding option nothing changes. Does anyone have any idea? Thanks in advance, Dimitar.
  • Comment on twig_print_outside_roots replaces " with &quote; and ' with '

Replies are listed 'Best First'.
Re: twig_print_outside_roots replaces " with &quote; and ' with '
by Discipulus (Canon) on Nov 29, 2016 at 19:02 UTC
    Hello dimitarsh1,

    it seems to me that escaping is the default behaviour of print methods and also parse and sprint ones:

    print ($optional_filehandle, $optional_pretty_print_style) Prints an entire element, including the tags, optionally to a $opt +ional_filehandle, optionally with a $pretty_print_style. The print outputs XML data so base entities are escaped. print_to_file ($filename, %options) Prints the element to file $filename. options: see flush. =item sprint ($elt, $optional_no_enclosing_tag +) Return the xml string for an entire element, including the tags. I +f the optional second argument is true then only the string inside th +e element is returned (the start and end tag for $elt are not). The t +ext is XML-escaped: base entities (& and < in text, & < and " in attr +ibute values) are turned into entities.

    In contrast text method specify that entities are not escaped.

    Without any code and example data i cannot tell you more.

    See also XML::Twig modify data, and I don't want that L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: twig_print_outside_roots replaces " with &quote; and ' with &apos;
by kcott (Archbishop) on Nov 29, 2016 at 20:14 UTC

    G'day Dimitar,

    You can write a very simple function, to convert those (and other) entities back to their original characters, if that's what you need:

    #!/usr/bin/env perl use strict; use warnings; { my %char_for_ent = qw{&quot; " &apos; ' &lt; < &gt; >}; my $re = qr/(?x: ( @{[ join '|', keys %char_for_ent ]} ) )/; sub ent2char { $_[0] =~ s/$re/$char_for_ent{$1}/g; $_[0] } } print "IN: ${_}OUT: ", ent2char($_) while <DATA>; __DATA__ I said, &quot;My name&apos;s Ken&quot;. <pre>Here&apos;s some &lt;em&gt;emphasis&lt;/em&gt;.</pre>

    Output:

    IN: I said, &quot;My name&apos;s Ken&quot;. OUT: I said, "My name's Ken". IN: <pre>Here&apos;s some &lt;em&gt;emphasis&lt;/em&gt;.</pre> OUT: <pre>Here's some <em>emphasis</em>.</pre>

    I suspect there may be a CPAN module with this functionality. I don't know for certain: perhaps another monk does.

    — Ken

Re: twig_print_outside_roots replaces " with &quote; and ' with &apos;
by CountZero (Bishop) on Nov 29, 2016 at 19:12 UTC
    That is actually totally correct and expected. That HTML file must not have contained raw single or double quote characters. These were represented in their HTML entity form &apos; and &quot;.

    Only when shown in your browser are these entities replaced on the screen by the usual ' and ".

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: twig_print_outside_roots replaces " with &quote; and ' with &apos;
by dimitarsh1 (Novice) on Nov 30, 2016 at 15:43 UTC
    Dear all,

    Thank you for sharing you opinion and about this issue and giving ideas how to tackle this issue. The problem is that, printing what is outside the roots I have no control of (or at least I haven't found a good method). What I mean is that for the roots I can write a handler that will modify the output in whatever way I want (e.g., what @kcott suggest). This I cannot do for the outside of the root.

    Furthermore, if I implement a handler that reads the style (which is what is actually outside of the the roots and I don't what to deal with) and simply prints it, it will be just fine.

    Here is a sample input:

    <style type="text/css"> @font-face { font-family: 'MyFont'; src: url('http://mywebsite/fonts/MyFont.otf'); } </style> <div> This is a test. Let's go. This is a brand name: 'D&G'. And this is + "a test". </div>

    Here are the two example scripts and output:

    my $t= XML::Twig->new( twig_print_outside_roots => 1, twig_roots => {'div' => sub { my ( $t, $e ) = @_; print STDOUT $e->text; $t->purge; }, 'style' => sub { my ( $t, $e ) = @_; $e->print(); $t->purge; } }, keep_atts_order => 1, );

    and

    my $t= XML::Twig->new( twig_print_outside_roots => 1, twig_roots => {'div' => sub { my ( $t, $e ) = @_; print STDOUT $e->text; $t->purge; }, }, keep_atts_order => 1, );

    And this is the output from the first and then from the second version of the script.

    <html><head></head><body>´&#9559;&#9488;<style type="text/css"> @font-face { font-family: 'MyFont'; src: url('http://mywebsite/fonts/MyFont.otf'); } </style> This is a test. Let's go. This is a brand name: 'D&G'. And thi +s is "a test". </body></html>

    And

    <html><head></head><body>´&#9559;&#9488;<style type="text/css"> @font-face { font-family: &apos;MyFont&apos;; src: url(&apos;http://mywebsite/fonts/MyFont.otf&apos;); } </style> This is a test. Let's go. This is a brand name: 'D&G'. And thi +s is "a test". </body></html>

    Than's a lot. Greetings,
    Dimitar

      What is wrong with the first version output, it didn't convert any & or '" to entities , so its working, right?
Re: twig_print_outside_roots replaces " with &quote; and ' with &apos;
by dimitarsh1 (Novice) on Nov 29, 2016 at 17:18 UTC
    Just a clarification: " are substituted by &quote; and ' are substituted by &apos;.
Re: twig_print_outside_roots replaces " with &quote; and ' with &apos;
by Anonymous Monk on Nov 29, 2016 at 23:46 UTC

    Have you tried output_filter of html?

    If you can post a 20 line program which demonstrates the problem, I'll take a stab at fixing it :)

      Hello,

      I tried output_filter as well as input_filter - nothing. They apply on the roots, but not on the outside content.

      Cheers, Dimitar.