Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Difficulty Mapping Data

by walkingthecow (Friar)
on Sep 07, 2010 at 20:02 UTC ( [id://859205]=perlquestion: print w/replies, xml ) Need Help??

walkingthecow has asked for the wisdom of the Perl Monks concerning the following question:

Morning Monks!

I have been trying to figure out the differences between two XML files for a little while now, and I have gotten pretty far, but I am stuck at one point. Basically, I am comparing the files, and I am trying to make sure that all host-aliases that exist in file1 also exist in file2. Where I am stuck is that I need to make sure that if host-alias www.foo.com exists under hostid "bobjones" in file1, then the host-alias www.foo.com exists under the same hostid in file2.

More domains can exist in file2 than file1, but all domains that are in file1 must be in file2, and they must be under the same hostid.

Here is a sample of the the input XML
<host id="bobjones" root-directory="."> <host-alias>www.foo.com</host-alias> <host-alias>www.bar.com</host-alias> <host-alias>www.dj.com</host-alias> </host>
And below is the code that I have. It's not finished, but it's what I have thus far:
#!/usr/bin/perl use strict; use warnings; use Getopt::Long; use Pod::Usage; my %alias_hash; my %host_hash; my %host_contents; my %seen; my $host_id; my $dbh; my $file1; my $file2; GetOptions( 'h|help' => sub { pod2usage( { -verbose => 1, -input = +> \*DATA, } ); exit; }, 'm|man' => sub { pod2usage( { -verbose => 2, -input = +> \*DATA, } ); exit; }, 'f1|file1=s' => \$file1, 'f2|file2=s' => \$file2, ); pod2usage( -verbose => 1 ) unless $file1 and $file2; open(my $file1_handle, '<', $file1) or die "Could not open $file1 ($!) +\n"; while (my $line=<$file1_handle>) { chomp $line; if ($line =~ /host id="(.*?)"/) { $host_id = $1; $host_hash{$host_id} = -1; } if ($line =~ m{<host-alias>(.*?)</host-alias>}) { $alias_hash{$host_id}{$1} = -1; } } close $file1_handle; open(my $file2_handle, '<', $file2) or die "Could not open $file2 ($!) +\n"; while (my $line=<$file2_handle>) { chomp $line; if ($line =~ /host id="(.*?)"/) { $host_id = $1; $host_hash{$host_id}++; } if ($line =~ m{<host-alias>(.*?)</host-alias>}) { $alias_hash{$host_id}{$1}++; } } close $file1_handle; for my $k1 ( keys %host_hash ) { if ($host_hash{$k1} == -1) { print "$k1\n"; } } for my $k1 ( keys %alias_hash ) { for my $k2 ( keys %{ $alias_hash{$k1} } ) { if ($alias_hash{$k1}{$k2} == -1) { print "$k2\n"; } } }

Replies are listed 'Best First'.
Re: Difficulty Mapping Data
by BrowserUk (Patriarch) on Sep 07, 2010 at 20:43 UTC

    If you combine the ID and alias for each combination in file 1; eg:

    %hash1 = { "bobjones-www.foo.com" => 1, "bobjones-www.bar.com" => 1, "bobjones-www.dj.com" => 1, };

    Then by combining the ID and aliases from file 2 in a similar way, you can do a direct lookup to determine if the same pairing exists in file 1, as you've just read from file two.

    What that won't tell you is if the alias exists in file 1 under a different ID.

    If the latter is (also) a requirement--it's unclear from your spec--then instead, key the hash constructed from file 1 by the alias, with the ID as the value:

    %hash1 = { www.foo.com => 'bobjones', www.bar.com => 'bobjones', www.dj.com => 'bobjones', };

    Or if it is possible and legal for a single alias to appear under two (or more) IDs in file one, a secondary hash.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Difficulty Mapping Data
by ikegami (Patriarch) on Sep 07, 2010 at 20:12 UTC
    [ For reference, the OP is asking a simpler version of a question he asked previously. ]

    I am trying to make sure that all host-aliases that exist in file1 also exist in file2.

    Put differently, you want to know if there's a host-alias in file1 that doesn't exist in file2. (It's often convenient to convert questions that use "all" into questions that use "each".)

    Load up the host-aliases from file2. Store them in a hash keyed by the host-alias for ease of lookup. The value doesn't matter. (I'd use something true like 1.)

    Then iterate through the host-alias in file 1. For each host-alias you find in file1, check if it's in file2 (by checking if it's in the hash).

    You can stop iterating as soon as you find one that's missing and still satisfy your spec, but you might want to continue iterating to find all the missing host-aliases.

      My apologies if I am not understanding what you're saying. My question in this thread may not be very well worded.

      I am able to see if all host-aliases that exist in file1 also exist in file2. The problem that I now face is that I need to know if the host-alias exists under the same "host id" in both files.

      So, if host-alias www.foo.com exists under "host id" bobjones in file1, it also must exist under bobjones in file2. I hope that makes sense :)

        I am trying to make sure that all host-aliases that exist in file1 also exist in file2.

        and

        I need to know if the host-alias exists under the same "host id" in both files.

        are very different problems. Unfortunately, I've already spent too much time answering the first question and cannot help you at this time.

        if host-alias www.foo.com exists under "host id" bobjones in file1, it also must exist under bobjones in file2.

        Is the following also true?

        if host-alias www.foo.com exists under "host id" bobjones in file2, it also must exist under bobjones in file1.

Re: Difficulty Mapping Data
by dasgar (Priest) on Sep 08, 2010 at 07:08 UTC

    When I first saw the mention of XML, I was tempted to suggest using something like XML::Simple to parse the data. However, I wasn't sure if your "sample data" included all of the possible XML tags from your real data files. So that got me thinking about doing a custom parsing of the data.

    Anyways, I decided to challenge myself to see if I could come up with working code that would actually do the job without using an XML parsing module. Well, it may not be the "best" way, but the code below appears to do the job. Hopefully this rough bit of code is good enough to give you some ideas on how to do your file comparison. Enjoy!

    Sample File 1 - data1.txt

    Sample File 2 - data2.txt

    Code:

    Output:

    HostID: bobjones, Host-Alias: www.foo.com was missing from file 'data2 +.txt'
Re: Difficulty Mapping Data
by murugu (Curate) on Sep 08, 2010 at 14:42 UTC
    Thanks dasgar for sample input. I have used XML::Twig and xpath expression to do this.
    use strict; use warnings; use XML::Twig; my $twiga=XML::Twig->new(); # create the twig for data1 $twiga->parsefile( 'data1.xml'); # build it my $twigb=XML::Twig->new(); # create the twig for data2 $twigb->parsefile( 'data2.xml'); # build it foreach my $t ($twiga->get_xpath('//host[@id]')) { my $att = $t->{'att'}->{'id'}; unless ($twigb->get_xpath("//host[\@id='$att']")) { print "$att is not found in data2.xml\n"; } foreach my $alias ($t->findnodes("host-alias")) { my $web = $alias->text; unless ($twigb->get_xpath("//host[\@id='$att']/host-alias[stri +ng()='$web']")) { print "$att with $web is not found in data2.xml\n" } } }

    Regards,
    Murugesan Kandasamy
    use perl for(;;);

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://859205]
Approved by sweetblood
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2024-04-25 05:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found