Help with parsing a file

Odar has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Help with parsing a file by GrandFather (Saint) on May 28, 2022 at 23:59 UTC
ThereareabunchofissueswithyourcodethatI'llmentioninpassingtohelpyoutowardPerlish style programming instead of C style. The first issue is an almost complete lack of optional white space which I find hard to read. Use white space as you would for writing prose - that's probably what people read most of and what brains are trained to parse, so keep it simple for brains. An immediate issue is that you don't show how you parse your input data so we can't tell what is in $row. That means we don't know what is in @face_ac and the line pushing @temp into it looks dubious to me. So lets throw all of that away to start with and build something new. First, we want this to be a small self contained correct example so we start off with strictures and some baked in data. There is a hint that you know this, but always use strictures (use strict; use warnings; - see The strictures, according to Seuss). `use strict; use warnings; my $fileStr = <<STR; F001 1.2 F101 3.2 solvent1 0 solvent2 3 F001 2.2 F101 7.2 solvent1 5 solvent2 0 STR open my $fIn, '<', $fileStr or die "Couldn't open \$fileStr: $!\n";` [download] This adds strictures, provides sample data as though it were in an external file and opens an input file handle to it. Now set up a loop to parse the input data. Perl allows us to tell it what constitutes an end of line character sequence so we take advantage of that to read the data one record at a time: `# Look for the empty line between records local $/ = "\n\n"; while (defined (my $record = <$fIn>)) {` [download] Parse the lines. Note that %recordData is declared inside the loop because we don't need it outside the loop or before the loop. Always declare variables in the smallest scope and initialize them when they are declared if appropriate (arrays and hashes are empty by default so usually they don't need to be initialized). You are familiar with split already, but grep and map may be new. Pop off and skim their documentation. In this case we are using grep to remove empty lines and map to generate a key value pair for each line. Then we use grep to build a list of solvents and a list of Fs: `my %recordData = map{split /\s+/, $_} grep {length $_} split "\n", + $record; my @solvents = grep {/^solvent\d+/} keys %recordData; my @fractions = grep {/^F\d+/} keys %recordData;` [download] Now we can find the solvent with the zero value. We assume there is one and only one. There could be error checking around this, but I'm skipping it for now. Note that grep operates on a list and generates a list so $zeroSolvent needs to in list context so the value of the first element of the list generated by grep is assigned to it: `my ($zeroSolvent) = grep {!$recordData{$_}} @solvents;` [download] and now we can generate the report for the record: `print "${zeroSolvent}_$_ => $recordData{$_}\n" for @fractions; }` [download] That prints: `solvent1_F101 => 3.2 solvent1_F001 => 1.2 solvent2_F101 => 7.2 solvent2_F001 => 2.2` [download] The code above concatenated together is: use strict; use warnings; my $fileStr = <<STR; F001 1.2 F101 3.2 solvent1 0 solvent2 3 F001 2.2 F101 7.2 solvent1 5 solvent2 0 STR open my $fIn, '<', \$fileStr or die "Couldn't open \$fileStr: $!\n"; # Look for the empty line between records local $/ = "\n\n"; while (defined (my $record = <$fIn>)) { my %recordData = map{split /\s+/, $_} grep {length $_} split "\n", + $record; my @solvents = grep {/^solvent\d+/} keys %recordData; my @fractions = grep {/^F\d+/} keys %recordData; my ($zeroSolvent) = grep {!$recordData{$_}} @solvents; print "${zeroSolvent}_$_ => $recordData{$_}\n" for @fractions; } [download] There may be follow up questions. :-D This is not the solution that a person with experience in other programming languages might come up with first off, but it's worth exploring in detail because tools such as grep and map can clean up code something wonderful (they can also obscure code something dreadful). Update: I should note that `"${zeroSolvent}_$_ => $recordData{$_}\n"` use variable interpolation. Perl expands the contents of variables used inside double quoted strings. The `${zeroSolvent}` bit lets us use the variable zeroSolvent with an underscore character following it in the string without Perl seeing zeroSolvent_ as the variable name instead. Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply] [d/l] [select]
Re^2: Help with parsing a file by Odar (Novice) on May 29, 2022 at 18:57 UTC
Thank you very much for your help and pointing out my mistakes and omissions, I am learning a lot from this type of feedback. I have added some additional description about the format of the data file (e.g. the data blocks are not separated by an empty line but by three lines of text with an empty line above and bellow) and also updated my code to show how I was trying to parse the file.	[reply]
Re: Help with parsing a file by Marshall (Canon) on May 28, 2022 at 23:04 UTC
It is rare for array indices to appear in Perl code for this sort of problem. Here is another way. Perhaps helpful or not to you, this was my general thought process: 1. I started by writing "while(){" without filling in cndx yet. 2. I saw that you had blank line separated records. So, I just coded a line to get that record and coded the subroutine. There are many ways to write this sub, I just picked an obvious one 3. Then I applied your rules to get the solvent name from that record. 4. Then I wrote loop to iterate over F values 5. then I decided to end on eof and filled in while cndx with an eof check. So, that is how I got to draft #1. Now I see that I could move getting the record into the while condx and stop going on a null hash. All sorts of improvements could be made. I wanted to demo iterating over the keys of the record and getting a subset of matching keys with grep. This is not perfect code, but I hope easy for you to understand. use strict; use warnings; use Data::Dumper; my %results; #pick a better name for this!! my $eof_seen = 0; while (!$eof_seen) { my %record = get_record(); my ($solvent) = grep {/^solvent/ and $record{$_}==0}keys %record; foreach my $F (grep {/^F/}keys %record) { $results{$solvent."_".$F}= $record{$F}; } $eof_seen=1 if (eof(DATA)); } print Dumper \%results; sub get_record #blank line separated records { my %record; my $line; while (defined ($line = <DATA>) and $line !~ /^\s*$/) { my ($key, $value) = split ' ',$line; $record{$key} = $value; } return %record; } =Prints $VAR1 = { 'solvent1_F101' => '3.2', 'solvent1_F001' => '1.2', 'solvent2_F101' => '7.2', 'solvent2_F001' => '2.2' }; =cut __DATA__ F001 1.2 F101 3.2 solvent1 0 solvent2 3 F001 2.2 F101 7.2 solvent1 5 solvent2 0 [download]	[reply] [d/l]
Re^2: Help with parsing a file by Odar (Novice) on May 29, 2022 at 19:19 UTC
Thank you for helping with this Marshall, very great full. Based on some feedback and solutions provided I have realised I have missed a key info in my attempt to strip the problem to its most basic form (I have updated the question). The key info is that the data blocks are actually not separated by an empty line but by three lines of text with an empty line at the top and bottom and there are more than two of them.Apologies for the confusion.	[reply]
Re: Help with parsing a file (updated) by LanX (Saint) on May 28, 2022 at 23:03 UTC
maybe use strict; use warnings; use Data::Dump qw/pp dd/; local $/ = ""; # $INPUT_RECORD_SEPARATOR to +split paragraphs � my %res; # result-set while (my $block = <DATA>) { # DATA as file-handle my (%f_num, $prefix); for my $line ( split /\n/, $block ) { my ($k,$v) = split /\s+/, $line; # key=value $f_num{$k} = $v # collect F<num> if $k =~ /^F\d+$/; $prefix = $k # catch <solvent...> = 0 if $k =~ /^solvent/ and $v == 0; # * } $res{"${prefix}_$_"} = $f_num{$_} # copy with prefix for keys %f_num; } pp \%res; # display __DATA__ F001 1.2 F101 3.2 solvent1 0 solvent2 3 F001 2.2 F101 7.2 solvent1 5 solvent2 0 [download] `{ solvent1_F001 => 1.2, solvent1_F101 => 3.2, solvent2_F001 => 2.2, solvent2_F101 => 7.2, }` [download] Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery} update *) added `$k =~ /^solvent/ and` to make it more fault tolerant. an if-elsif-else chain would be even better to catch errors with unexpected data �) changed from "\n\n" to match longer gaps too update Added comments. If you need further explanation, feel free to ask	[reply] [d/l] [select]
Re^2: Help with parsing a file (updated) by Odar (Novice) on May 29, 2022 at 19:36 UTC
Thank you LanX, works great but I have realised that in my attempt to simplify the example I have missed a key info i.e. the data blocks are not separated by an empty line but by three lines of text with an empty line above and below. Also the blocks can be more than two.I have updated the question.	[reply]
Re^3: Help with parsing a file (updated) by tybalt89 (Monsignor) on May 29, 2022 at 22:05 UTC
Words are not enough, please post a section of your real data that is "long enough" to show all the problems...	[reply]
Re^4: Help with parsing a file (updated) by Odar (Novice) on May 29, 2022 at 23:34 UTC
Re^5: Help with parsing a file (updated) by GrandFather (Saint) on May 30, 2022 at 04:15 UTC
Re^5: Help with parsing a file (updated) by LanX (Saint) on May 29, 2022 at 23:46 UTC
Re^4: Help with parsing a file (updated) by Odar (Novice) on May 30, 2022 at 13:21 UTC
Re: Help with parsing a file by jwkrahn (Abbot) on May 29, 2022 at 03:57 UTC
`$ echo "F001 1.2 F101 3.2 solvent1 0 solvent2 3 F001 2.2 F101 7.2 solvent1 5 solvent2 0 " \| perl -e' $/ = ""; while ( <> ) { my %x = split; my ( $key ) = grep $x{ $_ } eq "0", keys %x; for ( sort keys %x ) { next if $x{ $_ } eq int $x{ $_ }; print "${key}_$_ => $x{$_}\n"; } } ' solvent1_F001 => 1.2 solvent1_F101 => 3.2 solvent2_F001 => 2.2 solvent2_F101 => 7.2` [download]	[reply] [d/l]
Re: Help with parsing a file by GrandFather (Saint) on May 31, 2022 at 03:14 UTC
Small tweaks to the example code I posted earlier accommodates the "real" file format: use strict; use warnings; my $fileStr = <<STR; Property job 1 : Activity coefficients ln(gamma) ; Settings job 1 : T= 298.15 K ; x(6)= 1.0000 ; Units job 1 : Concentrations x : mole fraction ; Nr Compound ln(gamma) 1 F002 4.66656083 2 F011 26.13597035 3 F101 32.47411476 4 F11-1 29.58963453 5 F111 30.24092207 6 h2o 0.00000000 7 acetonitrile 2.14102090 8 chlorobenzene 8.72282917 9 chcl3 6.98143674 10 cyclohexane 10.20251798 11 1,2-dichloroethane 6.32324557 12 ch2cl2 5.50767091 13 1,2-dimethoxyethane 2.56706253 14 n,n-dimethylacetamide -1.64673734 Property job 2 : Activity coefficients ln(gamma) ; Settings job 2 : T= 298.15 K ; x(7)= 1.0000 ; Units job 2 : Concentrations x : mole fraction ; Nr Compound ln(gamma) 1 F002 1.69945785 2 F011 0.74578421 3 F101 2.67268035 4 F11-1 1.64808218 5 F111 1.95840198 6 h2o 2.08530828 7 acetonitrile 0.00000000 8 chlorobenzene 1.08379112 9 chcl3 0.46576330 10 cyclohexane 3.71606919 11 1,2-dichloroethane -0.02354847 12 ch2cl2 -0.23798262 13 1,2-dimethoxyethane 1.22044280 14 n,n-dimethylacetamide 0.44524110 STR open my $fIn, '<', \$fileStr or die "Couldn't open \$fileStr: $!\n"; # Look for the empty line between records local $/ = "Nr Compound"; while (defined (my $record = <$fIn>)) { my @lines = grep {/^\s*\d+\s+\S+\s+-?\d+\.\d+/} split "\n", $recor +d; next if !@lines; my %recordData = map{/\d+\s+(\S+)\s+(\S+)/; ($1, $2)} @lines; my @solvents = grep {!/^F\d+/} keys %recordData; my @fractions = grep {/^F\d+/} keys %recordData; my ($zeroSolvent) = grep {0.0 == $recordData{$_}} @solvents; print "${zeroSolvent}_$_ => $recordData{$_}\n" for @fractions; } [download] Prints: `h2o_F101 => 32.47411476 h2o_F011 => 26.13597035 h2o_F002 => 4.66656083 h2o_F11-1 => 29.58963453 h2o_F111 => 30.24092207 acetonitrile_F111 => 1.95840198 acetonitrile_F101 => 2.67268035 acetonitrile_F11-1 => 1.64808218 acetonitrile_F002 => 1.69945785 acetonitrile_F011 => 0.74578421` [download] The key differences are choosing a different string to recognize records and only keeping interesting lines for processing from each record. Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply] [d/l] [select]
Re^2: Help with parsing a file by Odar (Novice) on May 31, 2022 at 21:31 UTC
Thank you very much GrandFather, works like a charm!	[reply]
Re: Help with parsing a file by tybalt89 (Monsignor) on May 30, 2022 at 13:51 UTC
Adjusted for your latest data. #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11144249 use warnings; open my $fh, '<', \<<END; # FIXME with normal open Property job 1 : Activity coefficients ln(gamma) ; Settings job 1 : T= 298.15 K ; x(6)= 1.0000 ; Units job 1 : Concentrations x : mole fraction ; Nr Compound ln(gamma) 1 F002 4.66656083 2 F011 26.13597035 3 F101 32.47411476 4 F11-1 29.58963453 5 F111 30.24092207 6 h2o 0.00000000 7 acetonitrile 2.14102090 8 chlorobenzene 8.72282917 9 chcl3 6.98143674 10 cyclohexane 10.20251798 11 1,2-dichloroethane 6.32324557 12 ch2cl2 5.50767091 13 1,2-dimethoxyethane 2.56706253 14 n,n-dimethylacetamide -1.64673734 Property job 2 : Activity coefficients ln(gamma) ; Settings job 2 : T= 298.15 K ; x(7)= 1.0000 ; Units job 2 : Concentrations x : mole fraction ; Nr Compound ln(gamma) 1 F002 1.69945785 2 F011 0.74578421 3 F101 2.67268035 4 F11-1 1.64808218 5 F111 1.95840198 6 h2o 2.08530828 7 acetonitrile 0.00000000 8 chlorobenzene 1.08379112 9 chcl3 0.46576330 10 cyclohexane 3.71606919 11 1,2-dichloroethane -0.02354847 12 ch2cl2 -0.23798262 13 1,2-dimethoxyethane 1.22044280 14 n,n-dimethylacetamide 0.44524110 END local $_ = do{ local $/; <$fh> }; # slurp entire file my %solvent_face; $solvent_face{ "$3_$1" } = $2 while /\b (F\S+) \h+ ([-\d.]+) (?=.*? (\S+) \h+ 0.00000000 )/gsx; use Data::Dump 'dd'; dd \%solvent_face; [download] Outputs: `{ "acetonitrile_F002" => 1.69945785, "acetonitrile_F011" => 0.74578421, "acetonitrile_F101" => 2.67268035, "acetonitrile_F11-1" => 1.64808218, "acetonitrile_F111" => 1.95840198, "h2o_F002" => 4.66656083, "h2o_F011" => 26.13597035, "h2o_F101" => 32.47411476, "h2o_F11-1" => 29.58963453, "h2o_F111" => 30.24092207, }` [download] Correct ?	[reply] [d/l] [select]
Re^2: Help with parsing a file by Odar (Novice) on May 31, 2022 at 20:59 UTC
Wow, it works brilliantly on the full file, thank you tybalt89. I will study your solution to make sure I understand it fully. I hope you wouldn't mind if I get stuck and ask a clarification question, please.	[reply]
Re: Help with parsing a file (hash keys are unique) by LanX (Saint) on May 30, 2022 at 17:19 UTC
This looks very different to what you presented first, especially `"F11-1"` doesn't fit any description yet. Furthermore I have to express great doubt, that the desired output is really clever. > showing 2 blocks only (the actual file has 61). 61 blocks and 9 solvents, means you will have many cases where the same "solvent" is chosen. Hence your projected hash-keys `<solvent>_<F...>` will be overwritten each time you have such a collision. (hash keys are unique!!!) Do you really want the data of max 9 last different solvents only? I'd say what you really need is an AoH = array of hashes `@res = ( { solvent => 'h2o', F001 => 1.2, F101 => 3.2, ... }, { solvent => 'acetonitrile', F001 => 2.2, F101 => 7.2, ... }, ... );` [download] And of course you have to be sure that there is only one solvent = 0 in each block, otherwise ... edit or a HoAoH hash of arrays of hashes `$res{'h20'}[0]{F001}= 2.2; ...` [download] Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^2: Help with parsing a file (hash keys are unique) by Odar (Novice) on May 31, 2022 at 21:40 UTC
Apologies for the confusion my first attempt to explain the problem caused and not being clear enough about the number of blocks LanX. The blocks are 61 because in the real file there are 61 solvents instead of the truncated solvent list I have provided just to safe space. Very big thank you for providing help on this and also an inspiration for one of the solutions provided by tybalt89.	[reply]
Re: Help with parsing a file by tybalt89 (Monsignor) on May 29, 2022 at 14:59 UTC
`#!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11144249 use warnings; open my $fh, '<', \<<END; # FIXME with normal open F001 1.2 F101 3.2 solvent1 0 solvent2 3 three lines of text that are not important and can be skipped F001 2.2 F101 7.2 solvent1 5 solvent2 0 END local $_ = do{ local $/; <$fh> }; # slurp entire file my %solvent_face; $solvent_face{ "$3_$1" } = $2 while /\b (F\w+) \h+ ([\d.]+) (?=.? ^ (\S+) \h+ 0 \h \n )/gmsx; use Data::Dump 'dd'; dd \%solvent_face;` [download]	[reply] [d/l]
Re^2: Help with parsing a file by Odar (Novice) on May 29, 2022 at 18:10 UTC
Thank you for your help tybalt89. I think your solution prints out (adds to the hash) only the two records for solvent1 from the first (top) block.Still very useful as I learn a lot.	[reply]
Re^3: Help with parsing a file by tybalt89 (Monsignor) on May 29, 2022 at 18:24 UTC
Nope, does all F*** Outputs: `{ solvent1_F001 => 1.2, solvent1_F101 => 3.2, solvent2_F001 => 2.2, solvent2_F101 => 7.2, }` [download]	[reply] [d/l]
Re: Help with parsing a file by tybalt89 (Monsignor) on May 30, 2022 at 18:32 UTC
Here's a couple of different choices inspired by LanX's Re: Help with parsing a file (hash keys are unique) #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11144249 use warnings; my $file = 'data.11144249'; open my $fh, '<', $file or die "$! opening $file"; local $_ = do{ local $/; <$fh> }; # slurp entire file my %solvent_face; # as requested $solvent_face{ "$3_$1" } = $2 while /\b (F\S+) \h+ ([-\d.]+) (?=.? (\S+) \h+ 0.00000000 )/gsx; # by solvent, then arrays of each F** push @{ $solvent_face{ $3 }{ $1 } }, $2 while /\b (F\S+) \h+ ([-\d.]+) (?=.? (\S+) \h+ 0.00000000 )/gsx; # by F*, then arrays of solvent push @{ $solvent_face{ $1 }{ $3 } }, $2 while /\b (F\S+) \h+ ([-\d.]+) (?=.? (\S+) \h+ 0.00000000 )/gsx; use Data::Dump 'dd'; dd \%solvent_face; [download] Outputs: { "acetonitrile" => { "F002" => [1.69945785], "F011" => [0.74578421], "F101" => [2.67268035], "F11-1" => [1.64808218], "F111" => [1.95840198], }, "acetonitrile_F002" => 1.69945785, "acetonitrile_F011" => 0.74578421, "acetonitrile_F101" => 2.67268035, "acetonitrile_F11-1" => 1.64808218, "acetonitrile_F111" => 1.95840198, "F002" => { acetonitrile => [1.69945785], h2o => [4.66656083] }, "F011" => { acetonitrile => [0.74578421], h2o => [26.13597035] }, "F101" => { acetonitrile => [2.67268035], h2o => [32.47411476] }, "F11-1" => { acetonitrile => [1.64808218], h2o => [29.58963453] }, "F111" => { acetonitrile => [1.95840198], h2o => [30.24092207] }, "h2o" => { "F002" => [4.66656083], "F011" => [26.13597035], "F101" => [32.47411476], "F11-1" => [29.58963453], "F111" => [30.24092207], }, "h2o_F002" => 4.66656083, "h2o_F011" => 26.13597035, "h2o_F101" => 32.47411476, "h2o_F11-1" => 29.58963453, "h2o_F111" => 30.24092207, } [download]	[reply] [d/l] [select]
Re^2: Help with parsing a file by Odar (Novice) on May 31, 2022 at 21:10 UTC
This one also works perfectly. Thank you!	[reply]
Re: Help with parsing a file (real input please) by LanX (Saint) on May 29, 2022 at 19:12 UTC
> Update ... > ... > but in the actual file there are three lines of text not many will write speculative code for this vague description, at least I won't. You need to provide a real data snippet "from the actual file". E.g. it's completely unclear how to distinguish "text" from the rest. And in general please show more effort to code it yourself with the techniques learned from this thread. otherwise ppl will start to complain that this is not a "code writing service" Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^2: Help with parsing a file by Odar (Novice) on May 29, 2022 at 19:44 UTC
LanX, I completely understand and agree with you, my bad. The help I have received should be enough for me to continue. Could I ask one last question as I couldn't find an explanation on the site, please? What is the function of the d/l and the select links on the right side of each reply? I assume I need to select and answer using the select link, right? Thank you!	[reply]
Re^3: Help with parsing a file by LanX (Saint) on May 29, 2022 at 20:18 UTC
> The help I have received should be enough for me to continue. first of all please provide a good example. Folks here love to help but hate to waste time. > What is the function of the d/l and the select links They are rarely used, it's about "downloading" `<code>` sections, i.e. displaying them in original format and isolation without HTML pollution. I mostly just click the `download` beneath a code section. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]


Welcome to the Monastery
	PerlMonks

Help with parsing a file

update

update

edit