http://qs321.pair.com?node_id=11117783

audioboxer has asked for the wisdom of the Perl Monks concerning the following question:

Hi Everyone -

I've been sitting here for 4 hours and can't figure out the logic to do this. I am parsing data from an XML file and need to output to a CSV. I can parse the data just fine, I just can't figure out how to write it out and then parse it to a csv file that would ultimately look like the desired output below. I certainly appreciate the help. I've tried arrays and hashes and got nowhere.

Desired output

"sku","File 1", "File 2", "File 6", "File 7"
"00123","www.site.com/path/to/file/file.pdf","www.site.com/path/to/file/file.pdf",,
"00124","www.site.com/path/to/file/file.pdf",,"www.site.com/path/to/file/file.pdf",
"00125",,"www.site.com/path/to/file/file.pdf",,"www.site.com/path/to/file/file.pdf"

Perl Script

use 5.010; use strict; use warnings; use utf8; use XML::LibXML; my $source_file = 'C:\Users\User\Files\perl\data\example\example.xml'; my $dom = XML::LibXML->load_xml(location => $source_file); foreach my $part ($dom->findnodes('/products/product')) { my $sku = $part->findnodes('sku'); my $download_item_name; my $download_item_name_url; foreach my $downloads($part->findnodes('downloads')){ foreach my $download_group($downloads->findnodes('group')){ foreach my $downloads_group_items($download_group->findnodes('gr +oup_items')){ foreach my $download_item($downloads_group_items->findnodes('i +tem')){ $download_item_name = $download_item->findnodes('name'); $download_item_name_url = $download_item->findnodes('url'); print $sku . " -> " . "->" . $download_item_name . "->" . $d +ownload_item_name_url . "\n"; } } } } }

XML Data

<products> <product> <sku>00123</sku> <downloads> <group> <group_items> <item> <name>File 1</name> <url>www.site.com/path/to/file/file.pdf</url> </item> <item> <name>File 2</name> <url>www.site.com/path/to/file/file.pdf</url> </item> </group_items> </group> </downloads> </product> <product> <sku>00124</sku> <downloads> <group> <group_items> <item> <name>File 1</name> <url>www.site.com/path/to/file/file.pdf</url> </item> <item> <name>File 6</name> <url>www.site.com/path/to/file/file.pdf</url> </item> </group_items> </group> </downloads> </product> <product> <sku>00125</sku> <downloads> <group> <group_items> <item> <name>File 7</name> <url>www.site.com/path/to/file/file.pdf</url> </item> <item> <name>File 2</name> <url>www.site.com/path/to/file/file.pdf</url> </item> </group_items> </group> </downloads> </product> </products>

Replies are listed 'Best First'.
Re: Logic Help - Output to csv
by haukex (Archbishop) on Jun 07, 2020 at 19:58 UTC

    Assuming that you need to figure out the columns dynamically, here's one way to do it:

    use warnings; use strict; use XML::LibXML; use Text::CSV; my $dom = XML::LibXML->load_xml(location => 'example.xml'); my %items; for my $product ($dom->findnodes('/products/product')) { my $sku = $product->findvalue('sku'); for my $downloads ($product->findnodes('downloads')) { for my $group ($downloads->findnodes('group')) { for my $group_items ($group->findnodes('group_items')) { for my $item ($group_items->findnodes('item')) { my $name = $item->findvalue('name'); my $url = $item->findvalue('url'); $items{$sku}{$name} = $url; } } } } } my @columns = sort keys %{{map {$_=>1} map {keys %$_} values %items}}; my $csv = Text::CSV->new({binary=>1, auto_diag=>2, eol=>$/, always_quote=>1 }); $csv->print(select, ["sku", @columns]); for my $sku (sort keys %items) { $csv->print(select, [ $sku, map { $items{$sku}{$_} } @columns ]); }

    Note that this does assume that SKUs and the item names per SKU are unique (Update: in other words, that later <item>s don't overwrite earlier ones in the %items hash). Also, note that unless you need to separate out the <downloads>, <group>, or <group_items> elements, you don't need all those nested loops, this produces the same output:

    for my $product ($dom->findnodes('/products/product')) { my $sku = $product->findvalue('sku'); for my $item ( $product->findnodes( 'downloads/group/group_items/item') ) { my $name = $item->findvalue('name'); my $url = $item->findvalue('url'); $items{$sku}{$name} = $url; } }
Re: Logic Help - Output to csv
by haukex (Archbishop) on Jun 07, 2020 at 22:11 UTC

    Assuming none of the downloads, group, or group_items elements can be repeated in a single product, here's a slightly simpler way with XML::Rules (but XML::LibXML would probably still be my personally preferred solution):

    use warnings; use strict; use XML::Rules; use Text::CSV; my $parser = XML::Rules->new( stripspaces => 3|4, rules => [ 'products|downloads|group|group_items' => 'pass', product => 'by sku', 'sku|name|url' => 'content', item => sub { $_[1]->{name} => $_[1]->{url} }, _default => sub { die "Unknown tag $_[0]" } ], ); my $itms = $parser->parse_file('example.xml'); my @columns = sort keys %{{map {$_=>1} map {keys %$_} values %$itms}}; my $csv = Text::CSV->new({binary=>1, auto_diag=>2, eol=>$/, always_quote=>1 }); $csv->print(select, ["sku", @columns]); for my $sku (sort keys %$itms) { $csv->print(select, [$sku, map { $itms->{$sku}{$_} } @columns ]); }
      Thank you Haukex, you池e always helping me out. Works great. I have no idea what you did but it gives me a place to start and learn.

      Follow up question. I知 trying to print this out to a file but I知 getting errors about print usage. This is what I did

      open my $fh, "<:encoding(utf8)", "output.csv" or die "output.csv: $!"; my $csv = Text::CSV->new({binary=>1, auto_diag=>2, eol=>$/, always_quote=>1 }); $csv->print($fh, (select, ["sku", @columns])); for my $sku (sort keys %$itms) { $csv->print($fh, (select, [$sku, map { $itms->{$sku}{$_} } @column +s ])); }
      Thanks again
        open my $fh, "<:encoding(utf8)", "output.csv" or die "output.csv: $!";

        You have called your file output.csv but have opened it for reading. That's contradictory.

        I知 getting errors about print usage

        Why keep them a secret? In general, always supply the full, exact text of any error message you have.

        I have no idea what you did but it gives me a place to start and learn.

        Yes, XML::Rules does take a bit of studying its docs to understand all the rules. As for my XML::LibXML code, I hope that's a bit more clear. If you add use Data::Dump; at the top and add a dd \%items; just after the loop, you'll see the data structure:

        { "00123" => { "File 1" => "www.site.com/path/to/file/file.pdf", "File 2" => "www.site.com/path/to/file/file.pdf", }, "00124" => { "File 1" => "www.site.com/path/to/file/file.pdf", "File 6" => "www.site.com/path/to/file/file.pdf", }, "00125" => { "File 2" => "www.site.com/path/to/file/file.pdf", "File 7" => "www.site.com/path/to/file/file.pdf", }, }

        The other tricky bit is the line that generates the @columns array. It has two main parts: map {keys %$_} values %items takes the values of the %items, which are hashes themselves ({ "File 1" => "www..." etc.), and then uses map to get the keys of each of those hashes. In other words, the map statement will return the list "File 1", "File 2", "File 1", "File 6", .... The second part is the keys %{{map {$_=>1} ...}}, which is an idiom to get only unique values from a list. It does this by first building an anonymous hash with {map {$_=>1} values}, and then immediately dereferencing the hash (%{...}) and getting its keys. Then, all that's left to do is sort the unique strings.

        I知 trying to print this out to a file but I知 getting errors about print usage.

        What hippo said: first, you're opening the file for reading instead of writing. Second, note that the first argument of $csv->print should be the filehandle, which you did, but note that I was getting the output filehandle (STDOUT by default) using select, so you need to remove that call. This works for me:

        open my $fh, ">:encoding(UTF-8)", "output.csv" or die "output.csv: $!" +; my $csv = Text::CSV->new({binary=>1, auto_diag=>2, eol=>$/, always_quote=>1 }); $csv->print($fh, ["sku", @columns]); for my $sku (sort keys %$itms) { $csv->print($fh, [$sku, map { $itms->{$sku}{$_} } @columns ]); } close $fh;
Re: Logic Help - Output to csv
by Tux (Canon) on Jun 08, 2020 at 06:53 UTC

    Desired output

    "sku","File 1", "File 2", "File 6", "File 7" "00123","www.site.com/path/to/file/file.pdf","www.site.com/path/to/fil +e/file.pdf",, "00124","www.site.com/path/to/file/file.pdf",,"www.site.com/path/to/fi +le/file.pdf", "00125",,"www.site.com/path/to/file/file.pdf",,"www.site.com/path/to/f +ile/file.pdf"

    Maybe I am just nitpicking, but I sincerely hope you do not want the spaces after the separator (", ") in the header line.

    If you do, why not after the first separator? The required output as you show is extremely hard to generate and results in incorrect CSV data.

    For all other advice, read haukex' answers.

    For more speed, use Text::CSV_XS instead of Text::CSV.


    Enjoy, Have FUN! H.Merijn

      Text::CSV already calls Text::CSV_XS when available, so in most case it won't have any impact on speed, and in some other cases it will stop working altogether.

Re: Logic Help - Output to csv
by haukex (Archbishop) on Jun 07, 2020 at 19:42 UTC

    Do you know the columns "File 1", "File 2", "File 6", "File 7" beforehand or do you have to figure those out from the XML file?

      I don't, I'm getting that from the XML data. The only conclusion I could come to is that I would need two arrays or hashes and iterate through them.