Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

perl parsing

by cbtshare (Monk)
on Oct 04, 2017 at 03:25 UTC ( [id://1200636]=perlquestion: print w/replies, xml ) Need Help??

cbtshare has asked for the wisdom of the Perl Monks concerning the following question:

Hello All, I am doing some parsing of a file.the file out is read based on horizontal headers which are dynamic based on the machine specifications.I want to parse just devices and names, but not sure how to attribute all the devices to the names. Sample:

name Brian shirt yellow socks black device ipad 2001 device ipad 2001 device ipad 2001 tag no tag 0 name Andrew shirt orange socks black device ipad 2009 tag no tag 0 name ryan shirt blue socks black device ipad 2005 device cell 2009 tag yes tag 1

I have the following so far, but not sure how to get all the devices and year of the device belonging to each name.I can parse for the name

my @file = `cat text.txt`; foreach my $line (@file) { while $line =~ /name \s+(*.?) \s+(.*?)/mgx my $name = $1; }

Replies are listed 'Best First'.
Re: perl parsing
by Laurent_R (Canon) on Oct 04, 2017 at 06:27 UTC
    You've been given a solution that presumably works fine, but I would like to comment with a side note.
    my @file = `cat text.txt`;
    Calling the system or shell for reading the file is really poor practice in Perl (except possibly for command-line one-liners). Perl offers all the tools to do that with much better control on what happens and what to do if something goes wrong.

    Look at the way poj opens and reads the file in pure Perl, that's much better.

      There's also File::Slurp, which is quite useful if you want to read an entire file in one go:

      use File::Slurp; my @file = read_file('text.txt');
        Please, don't recommend broken modules.

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: perl parsing
by AnomalousMonk (Archbishop) on Oct 04, 2017 at 05:48 UTC
    I can parse for the name ...

    You could parse for the name if the code compiled, but it doesn't. After some fixes, you can get the following, but there seems to be another problem.

    c:\@Work\Perl\monks\cbtshare>perl -wMstrict -le "my @file = `cat text.txt`; ;; foreach my $line (@file) { while ($line =~ /name \s+(.*?) \s+(.*?)/mgx) { my $name = $1; print qq{name '$name' other '$2'}; } } " name 'Brian' other '' name 'Andrew' other '' name 'ryan' other ''
    Why is  $2 always empty?

    Update: Also, is there any point to the  /g modifier in the  /name \s+(.*?) \s+(.*?)/mgx match?


    Give a man a fish:  <%-{-{-{-<

Re: perl parsing
by poj (Abbot) on Oct 04, 2017 at 06:02 UTC
    how to attribute all the devices to the names

    I guess you want to build a Hash of Arrays (HoA)

    #!/usr/bin/perl use strict; use Data::Dumper; my $infile = 'text.txt'; open IN,'<',$infile or die "Could not open $infile : $!"; my $name; my %hash = (); while (<IN>){ s/^\s+|\s+$//g; # trim leading/trailing spaces my ($col1,$col2) = split /\s+/,$_,2; if ($col1 eq 'name'){ $name = $col2; } elsif ($col1 eq 'device') { push @{$hash{$name}},$col2; } else { # skip line } } close IN; print Dumper \%hash;
    poj
Re: perl parsing
by Marshall (Canon) on Oct 04, 2017 at 08:33 UTC
    A rather strange looking solution, but with an approach that can be extended to many such situations: (and no I don't think this is the "best" solution).
    #!/usr/bin/perl use strict; use warnings; my $line; while ( defined ($line = <DATA>)) { if ($line =~ /^name/) { $line = process_record ($line); redo if defined $line; # another name line } } sub process_record { my $line = shift; (my $name) = $line =~ /^name\s+(\w+)/; my %devices; while (defined ($line = <DATA>) and $line !~ /^name/) { if ( (my $device) = $line =~ /^device\s+(\w+\s+\w+)/) { $device =~ s/(\w+)\s+(\w+)/$1 $2/; $devices{$device}=1; } } print "$name:\n"; print " device $_\n" foreach keys %devices; return $line; } =PRINTS: Brian: device ipad 2001 Andrew: device ipad 2009 ryan: device ipad 2005 device cell 2009 =cut __DATA__ socks something name Brian shirt yellow socks black device ipad 2001 device ipad 2001 device ipad 2001 tag no tag 0 name Andrew shirt orange socks black device ipad 2009 tag no tag 0 name ryan shirt blue socks black device ipad 2005 device cell 2009 tag yes tag 1
      thank you all!!

      one issue is that Brian has 3 devices, you code prints one device ipad 2001 device ipad 2001 device ipad 2001

        How could you change Marshall's solution or perhaps poj's here to give you the results you want?


        Give a man a fish:  <%-{-{-{-<

        Well I figured that these were "dupes". Consider what would happen if $devices{$device}=1; was changed to $devices{$device}++; and what that would mean for adapting the printout of the hash to show the number of identical devices.
Re: perl parsing
by kcott (Archbishop) on Oct 05, 2017 at 03:11 UTC

    G'day cbtshare,

    Here's the technique I might have used for this task:

    #!/usr/bin/env perl use strict; use warnings; use autodie; use constant { IN_FILE => 'pm_1200636_text.txt', HEADER => 0, KEY => 1, VALUE => 2, }; my %parsed; { open my $fh, '<', IN_FILE; my $name; while (<$fh>) { my @fields = split; if ($fields[HEADER] eq 'name') { $name = $fields[KEY]; next; } if ($fields[HEADER] eq 'device') { push @{$parsed{$name}{$fields[KEY]}}, $fields[VALUE]; next; } } } # For testing only use Data::Dump; dd \%parsed;

    This only reads a record at a time, so there should be no memory issues that might occur when slurping entire files. The only data that persists after the anonymous block is %parsed: process that as necessary. Also note that as $fh goes out of scope at the end of the anonymous block, Perl automatically closes this for you (there's no need for a close statement in this instance).

    I used the same data as you posted (see the spoiler).

    $ cat pm_1200636_text.txt name Brian shirt yellow socks black device ipad 2001 device ipad 2001 device ipad 2001 tag no tag 0 name Andrew shirt orange socks black device ipad 2009 tag no tag 0 name ryan shirt blue socks black device ipad 2005 device cell 2009 tag yes tag 1

    Output from a sample run:

    { Andrew => { ipad => [2009] }, Brian => { ipad => [2001, 2001, 2001] }, ryan => { cell => [2009], ipad => [2005] }, }

    See also: "perldsc - Perl Data Structures Cookbook"; autodie; open; and, Data::Dump. Everything else is very straightforward and basic Perl, but feel free to ask if anything is unclear.

    — Ken

      Thank very much .Your solution is quite similar to that of poj. I will attempt to explain what is being done and then how I used what I understood to try and arrive at the solution I need.

      while (<IN>){ #remove spaces from the beginning or the end of the file s/^\s+|\s+$//g; # splits the files based on columns based on space and limit the amoun +t split by 4 my ($col1,$col2,$col3) = split /\s+/,$_,4; #checks to see if the word name is matched to get the variable next ov +er which would be the actual name , then put it in variable $name if ($col1 eq 'name'){ $name = $col2; #checks to see if the word device is matched to get the variable next +over which would be the actual type, then next over is another attrib +ute(not on the example) } elsif ($col1 eq 'device') { ##Here the push name, device type and other variable into a hash push @{$hash{$name}},$col2, $col3; } else { # skip line } } close IN;
      #prints everything print Dumper \%hash

      My issue now comes when I need to print out the content in a structure way, or into a file name device $col3 device $col3 I can sort through hash and get the name only, not all the other attributes.But why? I put them all into the hash right?

      foreach my $line(keys %hash) { print $line }

      I believe you are doing somewhat similar

      ##defining the fields you want including the file, HEADER would be the + first field and if name or device then KEY is the next value over an +d VALUE the next use constant { IN_FILE => 'pm_1200636_text.txt', HEADER => 0, KEY => 1, VALUE => 2, }; my %parsed; { open my $fh, '<', IN_FILE; my $name; while (<$fh>) { my @fields = split; if ($fields[HEADER] eq 'name') { $name = $fields[KEY]; next; }

      This is the part that gives me issues since I need to print the values in a specified format, so data dumper wouldnt work , any help please?

      if ($fields[HEADER] eq 'device') { push @{$parsed{$name}{$fields[KEY]}}, $fields[VALUE]; next; } } }

        Your analysis of what the code is doing is mostly correct. In places, you indicate that operations are being performed on "files"; both solutions are reading the files line-by-line, and those operations are being performed on "records". Consider these corrections:

        #remove spaces from both the beginning orand the end of the filerecord
        # splits the filesrecords based on ...

        You also appear to have misunderstood the LIMIT argument of split: you've used a value of 4 in two places, which doesn't make much sense as the maximum number of fields of any record is 3. Further reading of that documentation will explain why "@fields = split;" needs no arguments nor any preprocessing to trim whitespace.

        The data structures produced by the two solutions are different: an HoA and an HoHoA. We both provided a link to perldsc: perhaps you need to read, reread or study in more detail.

        The part that seems to elude you, in both cases, is how to translate the information in the data structures to whatever output format you need. You wrote (at the end of each of those analyses, respectively):

        "My issue now comes when I need to print out the content in a structure way, ..."
        "This is the part that gives me issues since I need to print the values in a specified format, ..."

        Without any knowledge of the required output format, there's no way we can help. Again, the perldsc documentation has several sections on accessing the data in complex structures: the answer probably lies therein.

        There are a few other areas where it looks like you really don't understand certain fundamentals. For instance, using the name $line for the variable that holds a key in:

        foreach my $line(keys %hash) { print $line }

        would seem to indicate that you don't know what keys does.

        I would recommend that you bookmark perlintro and refer to it often. Make sure you understand the very basic information it presents, then follow links to related functions, in-depth documentation, tutorials, advanced topics, and so on, as necessary. For instance, the section on Hashes has links to keys and values (I half suspect that, in the code previously mentioned, "values %hash" was probably closer to what you wanted, instead of "keys %hash"); you'll also find many others such as perldata (fuller details), perlreftut (tutorial), and even perldsc (advanced topic already mentioned). Do note that's just some of the links in one of many sections: the entire document is like that and I think you'll find it a most useful resource.

        — Ken

Re: perl parsing
by Marshall (Canon) on Oct 06, 2017 at 02:34 UTC
    I saw your question about accounting for Brian having more than one of the same device. Here is yet another solution... I didn't use a HoH in my first solution partly because that can be a difficult concept for beginners.

    In general I don't recommend approaches that require reading the entire input file into memory and then parsing that memory copy of the file because that often essentially means that the data is being "handled" in some way more than once and can take a lot of memory in the process. Some of the files that I work with can get quite large.

    #!/usr/bin/perl use strict; use warnings; my %devices; # a HOH Hash of Hash {name}{device} my $current_name; while ( my $line = <DATA>) { $current_name = $1 if ($line =~ m/^name\s+(\w+)\s+/); if ( (my $device) = $line =~ /^device\s+([\w\s]+)\n/) { $device =~ s/[ ]+/ /g; # multiple-space to a single space $devices{$current_name}{$device}++; } } # print the %devices hash - requires 2 loops foreach my $name (sort keys %devices) { print "$name:\n"; foreach my $device (keys %{$devices{$name}}) { print " $devices{$name}{$device}\t$device\n"; } } =Prints Andrew: 1 ipad 2009 Brian: 3 ipad 2001 ryan: 1 ipad 2005 1 cell 2009 =cut __DATA__ socks something name Brian shirt yellow socks black device ipad 2001 device ipad 2001 device ipad 2001 tag no tag 0 name Andrew shirt orange socks black device ipad 2009 tag no tag 0 name ryan shirt blue socks black device ipad 2005 device cell 2009 tag yes tag 1

      Thank I am understanding a bit better now, but I tried your solution , mainly the HOH printing but I am getting error.."Not a Hash reference at line 42

      foreach my $line(keys %hash) { print "$line\n"; ##This works and prints the names foreach my $sit (keys %{$hash{$line}}) #### <--line 42 line throwin +g error { print "$hash{$line}{$sit}\n"; } }
        Show your complete code. I can't tell what your problem is from this snippet.

        I avoided a HoH (Hash of Hash) in my first code post partly because as I suspected beginners have problems with this. You are proving me right.

        I suggest that use perhaps my first code that doesn't use any complicated data structures. That will be easier for you to work with?

        This code took me some minutes to write. It very well could be that it will take you literally hours to understand it. You will not learn if you don't put in the effort.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1200636]
Approved by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2024-04-19 03:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found