http://qs321.pair.com?node_id=1004080

gibsonca has asked for the wisdom of the Perl Monks concerning the following question:

I'd like to learn more about XML, I have not found a tutorial that works for me, guess it is my brick wall. What I want to do is use a perl script to find the empty (no ascii) fields between items, line # of file would be great. The files are quite large, but here is a simple example: <abc>fds </abc> ok <ddd></ddd> not ok <eee> </eee> not ok Every time I start reading about xml or xml parsing, etc I get a headache. Thx.

Replies are listed 'Best First'.
Re: XML Newbie
by mirod (Canon) on Nov 16, 2012 at 05:57 UTC

    An XML::Twig version that does not load the entire XML in memory:

    #!/usr/bin/perl use strict; use warnings; use XML::Twig; my $t=XML::Twig->new( start_tag_handlers => { _all_ => \&store_line_number, }, twig_handlers => { _all_ => \&warn_on_empty_elt, }, ); $t->parsefile( "so_line_numbers.xml"); sub store_line_number { my( $twig, $elt)= @_; $elt->set_att( '#line' => $twig->current_line); $elt->parent->set_att( '#not_empty') if $elt->parent; } sub warn_on_empty_elt { my( $twig, $elt)= @_; if( ! $elt->att( '#not_empty') && $elt->text !~ m{\S}) { print $el +t->att( '#line'), "\n"; } $twig->purge; }

    The little bit of cleverness here is that the code manages whether an element is empty or not itself, which allows it to purge the twig after each element (otherwise an enclosing element would have no content and trigger the warning).

      Hello mirod.

      I read your post and tried. And found it seems not working good. This prints

      3
      4
      1
      
      And the script is like this. Just added example xml with your script. As you see the DATA, empty tag will be ddd, and eee(line 3,4). And print out of "1" means "gibsonca" tag. I wonder this has relation with parsing of twig, as document says,

      Remember that element handlers are called when the element is CLOSED, so if you have handlers for nested elements the inner handlers will be called first.

      So, as a result of purging inner elements, gibsonca tag is empty for twig, I guess.

      If I comment out purge, it prints line number 3 and 4.

      regards.

      update: Large XML files may have some cluster that may be easy to purge. For example item tag in the below case. Maybe this will print line number of end tag(not as correct as yours), but I would like to purge like this.

        Duh! you need to set #not_empty to a true value. That will teach me to change tested code right before posting it.

        So it should be $elt->parent->set_att( '#not_empty', 1) if $elt->parent;

        regarding the update: in the code I wrote, the twig is purged after each element, that's why you need the #not_empty attribute, because within the twig handler, every single element appears empty, except if it contains text.

Re: XML Newbie
by rcrews (Novice) on Nov 16, 2012 at 05:08 UTC

    Assuming gibsonca.xml looks like this:

    <gibsonca>
    <abc>fds </abc> <!-- ok -->
    <ddd></ddd> <!-- not ok -->
    <eee> </eee> <!-- not ok -->
    </gibsonca>
    

    The following program will print "3" and "4".

    #!/opt/perl/bin/perl -T
    use strict;
    use warnings;
    use Carp;
    use English qw(-no_match_vars);
    use Try::Tiny;
    use XML::LibXML;
    
    our $VERSION = '0.1';
    
    my $file = 'gibsonca.xml';
    my $dom;
    
    open my $fh, '<', $file
        or carp "Can't open $file: $OS_ERROR";
    
    try {
        $dom = XML::LibXML->load_xml(
            {   IO           => $fh,
                line_numbers => 1,
            }
        );
    }
    catch {
        print "Error parsing $file";
    };
    
    close $fh
        or carp "Can't close $file: $OS_ERROR";
    
    for my $e ( $dom->findnodes('//*') ) {
    
        my $t = $e->textContent();
        $t =~ s{\A \s+ \z}{}xms;
    
        if ( !$t ) {
            print $e->line_number() . "\n";
        }
    }
    
    exit 0;
    __END__
    
Re: XML Newbie
by runrig (Abbot) on Nov 16, 2012 at 16:15 UTC
    Here's an example w/XML::Rules. Note that getting at the underlying expat parser is undocumented, but I'm sure Jenda would be willing to add something as part of the official API :-)
    use strict; use warnings; use XML::Rules; my @rules = ( gibsonca => undef, _default => sub { no warnings 'uninitialized'; return if $_[1]->{_content} =~ /\S/; my $p = $_[4]{parser}; print $p->current_line(),"\n"; return; }, ); my $xr = XML::Rules->new(rules => \@rules); $xr->parse(<<XML); <gibsonca> <abc>fds </abc> <!-- ok --> <ddd></ddd> <!-- not ok --> <eee> </eee> <!-- not ok --> </gibsonca> XML
Re: XML Newbie
by choroba (Cardinal) on Nov 16, 2012 at 08:27 UTC
    Using XML::XSH2, a wrapper around XML::LibXML:
    open gibsonca.xml ; for //* { $t = .//text() ; if not($t and xsh:matches($t,'\S')) echo xsh:lineno(.) ; }
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: XML Newbie
by Anonymous Monk on Nov 16, 2012 at 03:46 UTC
    Yeah, blah blah blah XML::Twig, there you go