Odar has asked for the wisdom of the Perl Monks concerning the following question:
Hello, I would like to ask a question about how to parse the file format given below. :
F001 1.2
F101 3.2
solvent1 0
solvent2 3
F001 2.2
F101 7.2
solvent1 5
solvent2 0
The file I would like to create a hash that looks like this:
solvent1_F001 => 1.2
solvent1_F101 => 3.2
solvent2_F001 => 2.2
solvent2_F101 => 7.2
The hash key becomes the solvent with value zero and underscore the string starting with F. The hash value is the number on the right of the string starting with F. I am new to Perl and programming and I am struggling to figure it out. I have tried to write the code below but can not
figure out what the hash %solvent_face hash indices should be. I think I may not be using the right approach. Thank you in advance for the help.
my @temp;
my @face_ac;
my $size;
my %solvent_face;
open(my $fh,'<', $file) || die "Can not open file: $!";
while (my $row = <$fh>) {
chomp $row;
if ($row=~/\d+ F/) {
@temp=split(' ',$row);
push @face_ac, @temp;
}
if ($row=~/\d+ [a-z]/) {
@temp=split(' ',$row);
if ($temp[2]==0) {
$size=@face_ac
for (my $i = 1; $i < $size+1; $i++) {
$solvent_face{$temp[1]."_".$face_ac[???]}=$face_ac[???];
}
print "@temp\n";
}
}
}
close $fh;
Update Thank you all for providing help with this. Reading some of the replies made me realised I didn't do a good job at properly describing the file structure with the data hence this update. Sorry about that. The example data provided above is just a representative example rather that the actual full data. In the real file the number of strings starting with F can vary and is usually in the range of 3 to 10. The number of solvents can also vary in number up to ca. 60. Also I am using placeholders for the solvent names i.e. solvent1, solvent2 but the actual data file consists of real solvent names that can start with either letter or a number (e.g. hexane, 1-butanol, 1,3-dimethylbenzene, ch2cl2, n-methyl-2-pyrrolidinone). Also for simplicity I have added a new line between the two blocks in this example but in the actual file there are three lines of text that are not important and can be skipped
Update2 - Here is a proper representative example of the data structure showing 2 blocks only (the actual file has 61).
Property job 1 : Activity coefficients ln(gamma) ;
Settings job 1 : T= 298.15 K ; x(6)= 1.0000 ;
Units job 1 : Concentrations x : mole fraction ;
Nr Compound ln(gamma)
1 F002 4.66656083
2 F011 26.13597035
3 F101 32.47411476
4 F11-1 29.58963453
5 F111 30.24092207
6 h2o 0.00000000
7 acetonitrile 2.14102090
8 chlorobenzene 8.72282917
9 chcl3 6.98143674
10 cyclohexane 10.20251798
11 1,2-dichloroethane 6.32324557
12 ch2cl2 5.50767091
13 1,2-dimethoxyethane 2.56706253
14 n,n-dimethylacetamide -1.64673734
Property job 2 : Activity coefficients ln(gamma) ;
Settings job 2 : T= 298.15 K ; x(7)= 1.0000 ;
Units job 2 : Concentrations x : mole fraction ;
Nr Compound ln(gamma)
1 F002 1.69945785
2 F011 0.74578421
3 F101 2.67268035
4 F11-1 1.64808218
5 F111 1.95840198
6 h2o 2.08530828
7 acetonitrile 0.00000000
8 chlorobenzene 1.08379112
9 chcl3 0.46576330
10 cyclohexane 3.71606919
11 1,2-dichloroethane -0.02354847
12 ch2cl2 -0.23798262
13 1,2-dimethoxyethane 1.22044280
14 n,n-dimethylacetamide 0.44524110
Re: Help with parsing a file
by GrandFather (Saint) on May 28, 2022 at 23:59 UTC
|
ThereareabunchofissueswithyourcodethatI'llmentioninpassingtohelpyoutowardPerlish style programming instead of C style. The first issue is an almost complete lack of optional white space which I find hard to read. Use white space as you would for writing prose - that's probably what people read most of and what brains are trained to parse, so keep it simple for brains.
An immediate issue is that you don't show how you parse your input data so we can't tell what is in $row. That means we don't know what is in @face_ac and the line pushing @temp into it looks dubious to me. So lets throw all of that away to start with and build something new.
First, we want this to be a small self contained correct example so we start off with strictures and some baked in data. There is a hint that you know this, but always use strictures (use strict; use warnings; - see The strictures, according to Seuss).
use strict;
use warnings;
my $fileStr = <<STR;
F001 1.2
F101 3.2
solvent1 0
solvent2 3
F001 2.2
F101 7.2
solvent1 5
solvent2 0
STR
open my $fIn, '<', $fileStr or die "Couldn't open \$fileStr: $!\n";
This adds strictures, provides sample data as though it were in an external file and opens an input file handle to it. Now set up a loop to parse the input data. Perl allows us to tell it what constitutes an end of line character sequence so we take advantage of that to read the data one record at a time:
# Look for the empty line between records
local $/ = "\n\n";
while (defined (my $record = <$fIn>)) {
Parse the lines. Note that %recordData is declared inside the loop because we don't need it outside the loop or before the loop. Always declare variables in the smallest scope and initialize them when they are declared if appropriate (arrays and hashes are empty by default so usually they don't need to be initialized). You are familiar with split already, but grep and map may be new. Pop off and skim their documentation. In this case we are using grep to remove empty lines and map to generate a key value pair for each line. Then we use grep to build a list of solvents and a list of Fs:
my %recordData = map{split /\s+/, $_} grep {length $_} split "\n",
+ $record;
my @solvents = grep {/^solvent\d+/} keys %recordData;
my @fractions = grep {/^F\d+/} keys %recordData;
Now we can find the solvent with the zero value. We assume there is one and only one. There could be error checking around this, but I'm skipping it for now. Note that grep operates on a list and generates a list so $zeroSolvent needs to in list context so the value of the first element of the list generated by grep is assigned to it:
my ($zeroSolvent) = grep {!$recordData{$_}} @solvents;
and now we can generate the report for the record:
print "${zeroSolvent}_$_ => $recordData{$_}\n" for @fractions;
}
That prints:
solvent1_F101 => 3.2
solvent1_F001 => 1.2
solvent2_F101 => 7.2
solvent2_F001 => 2.2
The code above concatenated together is:
use strict;
use warnings;
my $fileStr = <<STR;
F001 1.2
F101 3.2
solvent1 0
solvent2 3
F001 2.2
F101 7.2
solvent1 5
solvent2 0
STR
open my $fIn, '<', \$fileStr or die "Couldn't open \$fileStr: $!\n";
# Look for the empty line between records
local $/ = "\n\n";
while (defined (my $record = <$fIn>)) {
my %recordData = map{split /\s+/, $_} grep {length $_} split "\n",
+ $record;
my @solvents = grep {/^solvent\d+/} keys %recordData;
my @fractions = grep {/^F\d+/} keys %recordData;
my ($zeroSolvent) = grep {!$recordData{$_}} @solvents;
print "${zeroSolvent}_$_ => $recordData{$_}\n" for @fractions;
}
There may be follow up questions. :-D
This is not the solution that a person with experience in other programming languages might come up with first off, but it's worth exploring in detail because tools such as grep and map can clean up code something wonderful (they can also obscure code something dreadful).
Update: I should note that "${zeroSolvent}_$_ => $recordData{$_}\n" use variable interpolation. Perl expands the contents of variables used inside double quoted strings. The ${zeroSolvent} bit lets us use the variable zeroSolvent with an underscore character following it in the string without Perl seeing zeroSolvent_ as the variable name instead.
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] [d/l] [select] |
|
Thank you very much for your help and pointing out my mistakes and omissions, I am learning a lot from this type of feedback. I have added some additional description about the format of the data file (e.g. the data blocks are not separated by an empty line but by three lines of text with an empty line above and bellow) and also updated my code to show how I was trying to parse the file.
| [reply] |
Re: Help with parsing a file
by Marshall (Canon) on May 28, 2022 at 23:04 UTC
|
It is rare for array indices to appear in Perl code for this sort of problem. Here is another way.
Perhaps helpful or not to you, this was my general thought process:
1. I started by writing "while(){" without filling in cndx yet.
2. I saw that you had blank line separated records.
So, I just coded a line to get that record and coded the subroutine.
There are many ways to write this sub, I just picked an obvious one
3. Then I applied your rules to get the solvent name from that record.
4. Then I wrote loop to iterate over F values
5. then I decided to end on eof and filled in while cndx with an eof check.
So, that is how I got to draft #1. Now I see that I could move getting the record into the while condx and stop going on a null hash. All sorts of improvements could be made. I wanted to demo iterating over the keys of the record and getting a subset of matching keys with grep. This is not perfect code, but I hope easy for you to understand.
use strict;
use warnings;
use Data::Dumper;
my %results; #pick a better name for this!!
my $eof_seen = 0;
while (!$eof_seen)
{
my %record = get_record();
my ($solvent) = grep {/^solvent/ and $record{$_}==0}keys %record;
foreach my $F (grep {/^F/}keys %record)
{
$results{$solvent."_".$F}= $record{$F};
}
$eof_seen=1 if (eof(DATA));
}
print Dumper \%results;
sub get_record #blank line separated records
{
my %record;
my $line;
while (defined ($line = <DATA>) and $line !~ /^\s*$/)
{
my ($key, $value) = split ' ',$line;
$record{$key} = $value;
}
return %record;
}
=Prints
$VAR1 = {
'solvent1_F101' => '3.2',
'solvent1_F001' => '1.2',
'solvent2_F101' => '7.2',
'solvent2_F001' => '2.2'
};
=cut
__DATA__
F001 1.2
F101 3.2
solvent1 0
solvent2 3
F001 2.2
F101 7.2
solvent1 5
solvent2 0
| [reply] [d/l] |
|
Thank you for helping with this Marshall, very great full. Based on some feedback and solutions provided I have realised I have missed a key info in my attempt to strip the problem to its most basic form (I have updated the question). The key info is that the data blocks are actually not separated by an empty line but by three lines of text with an empty line at the top and bottom and there are more than two of them.Apologies for the confusion.
| [reply] |
Re: Help with parsing a file (updated)
by LanX (Saint) on May 28, 2022 at 23:03 UTC
|
use strict;
use warnings;
use Data::Dump qw/pp dd/;
local $/ = ""; # $INPUT_RECORD_SEPARATOR to
+split paragraphs °
my %res; # result-set
while (my $block = <DATA>) { # DATA as file-handle
my (%f_num, $prefix);
for my $line ( split /\n/, $block ) {
my ($k,$v) = split /\s+/, $line; # key=value
$f_num{$k} = $v # collect F<num>
if $k =~ /^F\d+$/;
$prefix = $k # catch <solvent...> = 0
if $k =~ /^solvent/ and $v == 0; # *
}
$res{"${prefix}_$_"} = $f_num{$_} # copy with prefix
for keys %f_num;
}
pp \%res; # display
__DATA__
F001 1.2
F101 3.2
solvent1 0
solvent2 3
F001 2.2
F101 7.2
solvent1 5
solvent2 0
{
solvent1_F001 => 1.2,
solvent1_F101 => 3.2,
solvent2_F001 => 2.2,
solvent2_F101 => 7.2,
}
update
*) added $k =~ /^solvent/ and to make it more fault tolerant.
an if-elsif-else chain would be even better to catch errors with unexpected data
°) changed from "\n\n" to match longer gaps too
update
Added comments. If you need further explanation, feel free to ask | [reply] [d/l] [select] |
|
Thank you LanX, works great but I have realised that in my attempt to simplify the example I have missed a key info i.e. the data blocks are not separated by an empty line but by three lines of text with an empty line above and below. Also the blocks can be more than two.I have updated the question.
| [reply] |
|
| [reply] |
|
|
|
|
Re: Help with parsing a file
by jwkrahn (Abbot) on May 29, 2022 at 03:57 UTC
|
$ echo "F001 1.2
F101 3.2
solvent1 0
solvent2 3
F001 2.2
F101 7.2
solvent1 5
solvent2 0
" | perl -e'
$/ = "";
while ( <> ) {
my %x = split;
my ( $key ) = grep $x{ $_ } eq "0", keys %x;
for ( sort keys %x ) {
next if $x{ $_ } eq int $x{ $_ };
print "${key}_$_ => $x{$_}\n";
}
}
'
solvent1_F001 => 1.2
solvent1_F101 => 3.2
solvent2_F001 => 2.2
solvent2_F101 => 7.2
| [reply] [d/l] |
Re: Help with parsing a file
by GrandFather (Saint) on May 31, 2022 at 03:14 UTC
|
use strict;
use warnings;
my $fileStr = <<STR;
Property job 1 : Activity coefficients ln(gamma) ;
Settings job 1 : T= 298.15 K ; x(6)= 1.0000 ;
Units job 1 : Concentrations x : mole fraction ;
Nr Compound ln(gamma)
1 F002 4.66656083
2 F011 26.13597035
3 F101 32.47411476
4 F11-1 29.58963453
5 F111 30.24092207
6 h2o 0.00000000
7 acetonitrile 2.14102090
8 chlorobenzene 8.72282917
9 chcl3 6.98143674
10 cyclohexane 10.20251798
11 1,2-dichloroethane 6.32324557
12 ch2cl2 5.50767091
13 1,2-dimethoxyethane 2.56706253
14 n,n-dimethylacetamide -1.64673734
Property job 2 : Activity coefficients ln(gamma) ;
Settings job 2 : T= 298.15 K ; x(7)= 1.0000 ;
Units job 2 : Concentrations x : mole fraction ;
Nr Compound ln(gamma)
1 F002 1.69945785
2 F011 0.74578421
3 F101 2.67268035
4 F11-1 1.64808218
5 F111 1.95840198
6 h2o 2.08530828
7 acetonitrile 0.00000000
8 chlorobenzene 1.08379112
9 chcl3 0.46576330
10 cyclohexane 3.71606919
11 1,2-dichloroethane -0.02354847
12 ch2cl2 -0.23798262
13 1,2-dimethoxyethane 1.22044280
14 n,n-dimethylacetamide 0.44524110
STR
open my $fIn, '<', \$fileStr or die "Couldn't open \$fileStr: $!\n";
# Look for the empty line between records
local $/ = "Nr Compound";
while (defined (my $record = <$fIn>)) {
my @lines = grep {/^\s*\d+\s+\S+\s+-?\d+\.\d+/} split "\n", $recor
+d;
next if !@lines;
my %recordData = map{/\d+\s+(\S+)\s+(\S+)/; ($1, $2)} @lines;
my @solvents = grep {!/^F\d+/} keys %recordData;
my @fractions = grep {/^F\d+/} keys %recordData;
my ($zeroSolvent) = grep {0.0 == $recordData{$_}} @solvents;
print "${zeroSolvent}_$_ => $recordData{$_}\n" for @fractions;
}
Prints:
h2o_F101 => 32.47411476
h2o_F011 => 26.13597035
h2o_F002 => 4.66656083
h2o_F11-1 => 29.58963453
h2o_F111 => 30.24092207
acetonitrile_F111 => 1.95840198
acetonitrile_F101 => 2.67268035
acetonitrile_F11-1 => 1.64808218
acetonitrile_F002 => 1.69945785
acetonitrile_F011 => 0.74578421
The key differences are choosing a different string to recognize records and only keeping interesting lines for processing from each record.
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] [d/l] [select] |
|
| [reply] |
Re: Help with parsing a file
by tybalt89 (Monsignor) on May 30, 2022 at 13:51 UTC
|
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11144249
use warnings;
open my $fh, '<', \<<END; # FIXME with normal open
Property job 1 : Activity coefficients ln(gamma) ;
Settings job 1 : T= 298.15 K ; x(6)= 1.0000 ;
Units job 1 : Concentrations x : mole fraction ;
Nr Compound ln(gamma)
1 F002 4.66656083
2 F011 26.13597035
3 F101 32.47411476
4 F11-1 29.58963453
5 F111 30.24092207
6 h2o 0.00000000
7 acetonitrile 2.14102090
8 chlorobenzene 8.72282917
9 chcl3 6.98143674
10 cyclohexane 10.20251798
11 1,2-dichloroethane 6.32324557
12 ch2cl2 5.50767091
13 1,2-dimethoxyethane 2.56706253
14 n,n-dimethylacetamide -1.64673734
Property job 2 : Activity coefficients ln(gamma) ;
Settings job 2 : T= 298.15 K ; x(7)= 1.0000 ;
Units job 2 : Concentrations x : mole fraction ;
Nr Compound ln(gamma)
1 F002 1.69945785
2 F011 0.74578421
3 F101 2.67268035
4 F11-1 1.64808218
5 F111 1.95840198
6 h2o 2.08530828
7 acetonitrile 0.00000000
8 chlorobenzene 1.08379112
9 chcl3 0.46576330
10 cyclohexane 3.71606919
11 1,2-dichloroethane -0.02354847
12 ch2cl2 -0.23798262
13 1,2-dimethoxyethane 1.22044280
14 n,n-dimethylacetamide 0.44524110
END
local $_ = do{ local $/; <$fh> }; # slurp entire file
my %solvent_face;
$solvent_face{ "$3_$1" } = $2 while
/\b (F\S+) \h+ ([-\d.]+) (?=.*? (\S+) \h+ 0.00000000 )/gsx;
use Data::Dump 'dd'; dd \%solvent_face;
Outputs:
{
"acetonitrile_F002" => 1.69945785,
"acetonitrile_F011" => 0.74578421,
"acetonitrile_F101" => 2.67268035,
"acetonitrile_F11-1" => 1.64808218,
"acetonitrile_F111" => 1.95840198,
"h2o_F002" => 4.66656083,
"h2o_F011" => 26.13597035,
"h2o_F101" => 32.47411476,
"h2o_F11-1" => 29.58963453,
"h2o_F111" => 30.24092207,
}
Correct ?
| [reply] [d/l] [select] |
|
Wow, it works brilliantly on the full file, thank you tybalt89. I will study your solution to make sure I understand it fully. I hope you wouldn't mind if I get stuck and ask a clarification question, please.
| [reply] |
Re: Help with parsing a file (hash keys are unique)
by LanX (Saint) on May 30, 2022 at 17:19 UTC
|
This looks very different to what you presented first, especially "F11-1" doesn't fit any description yet.
Furthermore I have to express great doubt, that the desired output is really clever.
> showing 2 blocks only (the actual file has 61).
61 blocks and 9 solvents, means you will have many cases where the same "solvent" is chosen.
Hence your projected hash-keys <solvent>_<F...> will be overwritten each time you have such a collision. (hash keys are unique!!!)
Do you really want the data of max 9 last different solvents only?
I'd say what you really need is an AoH = array of hashes
@res = (
{
solvent => 'h2o',
F001 => 1.2,
F101 => 3.2,
...
},
{
solvent => 'acetonitrile',
F001 => 2.2,
F101 => 7.2,
...
},
...
);
And of course you have to be sure that there is only one solvent = 0 in each block, otherwise ...
edit
or a HoAoH hash of arrays of hashes
$res{'h20'}[0]{F001}= 2.2;
...
| [reply] [d/l] [select] |
|
| [reply] |
Re: Help with parsing a file
by tybalt89 (Monsignor) on May 29, 2022 at 14:59 UTC
|
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11144249
use warnings;
open my $fh, '<', \<<END; # FIXME with normal open
F001 1.2
F101 3.2
solvent1 0
solvent2 3
three lines of text
that are not important
and can be skipped
F001 2.2
F101 7.2
solvent1 5
solvent2 0
END
local $_ = do{ local $/; <$fh> }; # slurp entire file
my %solvent_face;
$solvent_face{ "$3_$1" } = $2 while
/\b (F\w+) \h+ ([\d.]+) (?=.*? ^ (\S+) \h+ 0 \h* \n )/gmsx;
use Data::Dump 'dd'; dd \%solvent_face;
| [reply] [d/l] |
|
| [reply] |
|
{
solvent1_F001 => 1.2,
solvent1_F101 => 3.2,
solvent2_F001 => 2.2,
solvent2_F101 => 7.2,
}
| [reply] [d/l] |
Re: Help with parsing a file
by tybalt89 (Monsignor) on May 30, 2022 at 18:32 UTC
|
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11144249
use warnings;
my $file = 'data.11144249';
open my $fh, '<', $file or die "$! opening $file";
local $_ = do{ local $/; <$fh> }; # slurp entire file
my %solvent_face;
# as requested
$solvent_face{ "$3_$1" } = $2 while
/\b (F\S+) \h+ ([-\d.]+) (?=.*? (\S+) \h+ 0.00000000 )/gsx;
# by solvent, then arrays of each F***
push @{ $solvent_face{ $3 }{ $1 } }, $2 while
/\b (F\S+) \h+ ([-\d.]+) (?=.*? (\S+) \h+ 0.00000000 )/gsx;
# by F***, then arrays of solvent
push @{ $solvent_face{ $1 }{ $3 } }, $2 while
/\b (F\S+) \h+ ([-\d.]+) (?=.*? (\S+) \h+ 0.00000000 )/gsx;
use Data::Dump 'dd'; dd \%solvent_face;
Outputs:
{
"acetonitrile" => {
"F002" => [1.69945785],
"F011" => [0.74578421],
"F101" => [2.67268035],
"F11-1" => [1.64808218],
"F111" => [1.95840198],
},
"acetonitrile_F002" => 1.69945785,
"acetonitrile_F011" => 0.74578421,
"acetonitrile_F101" => 2.67268035,
"acetonitrile_F11-1" => 1.64808218,
"acetonitrile_F111" => 1.95840198,
"F002" => { acetonitrile => [1.69945785], h2o => [4.66656083] },
"F011" => { acetonitrile => [0.74578421], h2o => [26.13597035] },
"F101" => { acetonitrile => [2.67268035], h2o => [32.47411476] },
"F11-1" => { acetonitrile => [1.64808218], h2o => [29.58963453] },
"F111" => { acetonitrile => [1.95840198], h2o => [30.24092207] },
"h2o" => {
"F002" => [4.66656083],
"F011" => [26.13597035],
"F101" => [32.47411476],
"F11-1" => [29.58963453],
"F111" => [30.24092207],
},
"h2o_F002" => 4.66656083,
"h2o_F011" => 26.13597035,
"h2o_F101" => 32.47411476,
"h2o_F11-1" => 29.58963453,
"h2o_F111" => 30.24092207,
}
| [reply] [d/l] [select] |
|
| [reply] |
Re: Help with parsing a file (real input please)
by LanX (Saint) on May 29, 2022 at 19:12 UTC
|
> Update ...
> ...
> but in the actual file there are three lines of text
not many will write speculative code for this vague description, at least I won't.
You need to provide a real data snippet "from the actual file".
E.g. it's completely unclear how to distinguish "text" from the rest.
And in general please show more effort to code it yourself with the techniques learned from this thread.
otherwise ppl will start to complain that this is not a "code writing service"
| [reply] |
|
LanX, I completely understand and agree with you, my bad. The help I have received should be enough for me to continue. Could I ask one last question as I couldn't find an explanation on the site, please? What is the function of the d/l and the select links on the right side of each reply? I assume I need to select and answer using the select link, right? Thank you!
| [reply] |
|
> The help I have received should be enough for me to continue.
first of all please provide a good example.
Folks here love to help but hate to waste time.
> What is the function of the d/l and the select links
They are rarely used, it's about "downloading" <code> sections, i.e. displaying them in original format and isolation without HTML pollution.
I mostly just click the download beneath a code section.
| [reply] [d/l] [select] |
|
|