Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: comparing file contents

by Util (Priest)
on Nov 22, 2011 at 23:09 UTC ( [id://939555]=note: print w/replies, xml ) Need Help??


in reply to comparing file contents

The problems are numerous. A line-by-line walkthrough is needed to cover them. Since you have persevered, and desire to improve your Perl skills, here is that walkthrough:
  • #!/usr/bin/perl
    This line needs to be followed with these two lines:
        use strict;
        use warnings;
    The "strict" and "warnings" lines turn on error checking for common mistakes, and force you to declare your variables.
  • print "reading 'ansvarig_vardenhet_utf8.txt'\n";
    Google says that "Ansvarig vardenhet" is Swedish for "Responsible care unit". If that is true, then this print statement (and its twin further down) is lying. They are not reading 'ansvarig_vardenhet_utf8.txt'; they are reading 'icd-10-codes.txt' and 'female1.txt'. If you just need reassurance about what point your program has reached, then use something like print "P1\n"; and print "P2\n";. Otherwise, include the actual file names. Do not lie; it will just confuse the next person who works on the program.
  • open(WIN, "icd-10-codes.txt");
    As written, if the ICD file is missing, then your program will seem to run correctly, but will silently fail to do any replacements. Silent failure is not acceptable. die() is used to halt a program on a critical error; it is often coupled with or so that the die happens when the first part of a statement fails. The $! variable contains the reason for the failure.
         open(WIN, "icd-10-codes.txt") or die "Cannot open file: $!"; .
    Some other changes that would be improvements, but are not mandatory:
    • You are not telling Perl whether to read or write the file, so it defaults to read. It is better to be explicit.
          open( WIN, '<', "icd-10-codes.txt" )
    • WIN is a "bareword" filehandle. We now encourage the use of a real variable for the filehandle.
          open( my $win_fh, '<', "icd-10-codes.txt" )
    • If you use a variable to hold the filename, then you can locate the actual filename at the top of the program, where such configurable info is usually located. It will also allow you to put the filename in the "die" error message, and in the "reading ansvarig_vardenhet_utf8.txt" line already discussed, without repeating yourself.
          my $icd_file = 'icd-10-codes.txt';
          #print "Reading '$icd_file'\n";
          open( my $win_fh, '<', $icd_file ) or die "Cannot open '$icd_file': $!";
    By the way, the missing "strict" and "warnings" lines, and the missing '<' in "open", tell me that you are learning Perl from outdated material. I highly recommend Learning Perl or Modern Perl (or its free PDF).
  • while($line = <WIN>) {
    Since we added "use strict", we now have to declare all variables, such as $line. Also, $line needs to have its newline removed, otherwise the ICD description will have the newline embedded in it, which will mis-format your replacements. The chomp() function removes a trailing newline if one is present.
        while ( my $line = <WIN> ) {
            chomp $line;
  • @columns = split(/\t/, $line);
    Since each line has exactly two fields, you should name them. You can split() any extra fields into a generic array, to help find any future problems with the file formatting.
        my ($lookup_code, $icd_code_and_text, @rest) = split /\t/, $line;
        warn "Extra tab in file '$icd_file at line $.\n" if @rest;
  • #print "$columns[1]";
    That line is clearly for debugging, but if you use it, the output will be indistinct from its neighbors. You need a terminating newline.
        #print "$columns[1]\n";
  • foreach my $value (@columns) { #print "$value\n" }
    When you need to inspect more than a single simple variable, use Data::Dumper. In fact, use it to inspect even single simple variables, too.
        # This "use" line goes at the top, just after "use warnings".
        use Data::Dumper; $Data::Dumper::Useqq=1;

        print Dumper \@columns;
    Also, your closing brace is not lined up with the 'f' in 'foreach'. The Perl compiler does not care, but alignment will help any human trying to read your code.
  • close(WIN);
    This would need "or die" if it was anything more complex than *reading* a file on a *local* filesystem. It is fine as-is.

Note that you have not preserved *any* data from the ICD file, so no ICD lookups can be done below this point! We will discuss how to fix this later, by copying the needed info to a hash or array.

(OK, before I added the my before @columns, you did technically preserve *some* ICD data, but since it was only the data from the last line in the file, it hardly counts.)

  • print "reading 'ansvarig_vardenhet_utf8.txt'\n"; open(IN, "female1.txt"); while($line = <IN>) {
    Fix as described above, including chomp().
  • %col = split(/ /, $line);
    No. This is nonsense. It would only make sense if you had a file with a single line, which contained all the keys and values separated by spaces. Instead of a hash, we need an array (declared, of course).
        my @cols = split / /, $line;
  • #print "$columns[1]\n";
    Use Data::Dumper.
  • foreach my $val (%col) {
    "for" is easier to type than "foreach", and either of them work. We have already changed from hash to array.
        for my $val (@cols) {
  • if($val == $columns[0]){ $val =~ s/$val/$columns[1]/g;
    The == operator compares two vars *numerically*. The eq operator compares two vars as strings, so it is the one you want. In other words, "3" == "3.0", but "3" ne "3.0". Also, if you have just determined that $val is an exact match, then s/// is overkill; just replace $var by assigning to it.
        if ( $val eq $columns[0] ) {
            $val = $columns[1];
    Note that, even with those changes, this code will not work, because @columns does not contain the whole table of ICD translations. Even if it did contain them, you would need *two* loops, nested, for this method to work. That would be needlessly complex and inefficient. Later, we will use a simple hash instead of @columns and two loops.
  • print IN ">>$val\n";
    This looks confused. print IN $anything_at_all will write to the IN file. Are you trying to overwrite the data in 'female1.txt'? Luckily, Perl will not let you, because you opened the file as read-only. Your original description says "...should be replaced with ICD-code files corresponding...", but you are not outputting the entire line. Is the '>>' a debugging marker, or does it indicate that you are trying to append to a file? Going by your original description, I think you need:
         print join( ' ', @columns ), "\n";
  • close(IN);
    Again, fine when reading locally.
  • I am a beginner in PERL.
    You are a beginner in Perl. (See PERL as shibboleth and the Perl community.)
    Welcome to the Perl Community!

The problem with your *approach* is just the failure to preserve the ICD data from the first half of the program so that the second half can use it. This can be done by declaring %lookup_icd above the first while() loop, and then within the while() loop, use:
    $lookup_icd{$icd_code} = $icd_code_and_text;

With a few more additions, it would look like this:

#!/usr/bin/perl use strict; use warnings; use Data::Dumper; $Data::Dumper::Useqq = 1; my $icd_path = 'icd-10-codes.txt'; my $in_path = 'female1.txt'; my $out_path = 'female1_corrected.txt'; open my $icd_fh, '<', $icd_path or die "Cannot open '$icd_path': $!"; #print "Reading '$icd_path'\n"; my %lookup_icd; while ( my $line = <$icd_fh> ) { chomp $line; my ( $lookup_code, $icd_code_and_text ) = split /\t/, $line; # print Dumper $lookup_code, $icd_code_and_text; if ( exists $lookup_icd{$lookup_code} ) { warn "Replacing $lookup_code;" . " was '$lookup_icd{$lookup_code}'," . " now '$icd_code_and_text'\n"; } $lookup_icd{$lookup_code} = $icd_code_and_text; } close $icd_fh; #print Dumper \%lookup_icd; #print "Reading '$in_path'\n"; #print "Writing '$out_path'\n"; open my $in_fh, '<', $in_path or die "Cannot open '$in_path': $!"; open my $out_fh, '>', $out_path or die "Cannot open '$out_path': $!"; while ( my $line = <$in_fh> ) { chomp $line; my @cols = split / /, $line; for my $possible_icd (@cols) { my $replacement_icd = $lookup_icd{$possible_icd}; if ($replacement_icd) { $possible_icd = $replacement_icd; } } print {$out_fh} join( ' ', @cols ), "\n"; } close $in_fh; close $out_fh or warn "Cannot close '$out_path': $!\nSome data may not have been + written.";

I strongly recommend spending some time with either of the books I mentioned above. You have ventured a little past "baby Perl", and so now will be better served by tutorial than by blind exploration.

Replies are listed 'Best First'.
Re^2: comparing file contents
by kumaar1986 (Initiate) on Nov 23, 2011 at 00:33 UTC

    Thanks for your step by step explanation of the code and how to approach to make it right. the explanation is so informative and it helped me to learn insight of PERL. Thanks a lot.

Re^2: comparing file contents
by kumaar1986 (Initiate) on Nov 23, 2011 at 01:27 UTC

    This looks confused. print IN $anything_at_all will write to the IN file. Are you trying to overwrite the data in 'female1.txt'? Luckily, Perl will not let you, because you opened the file as read-only. Your original description says "...should be replaced with ICD-code files corresponding...", but you are not outputting the entire line. Is the '>>' a debugging marker, or does it indicate that you are trying to append to a file? Going by your original description, I think you need:

    I need to overwrite the result in 'female1.txt' instead of writing in a new output file

    .

    can that be achieved by changing the 'female1.txt' in to read and write mode

    print {$out_fh} join( ' ', @cols ), "\n"

    to

    print {$in_path} join( ' ', @cols ), "\n"

    Let me know if there is any other way to do this. thanks for your help in advance. your explanations are really good. I appreciate your time and effort for helping

      Yes, it could be achieved by opening the file in read+write mode, or by re-opening the file in write mode after its read-only filehandle was closed. However, either of these methods would require other code changes, to hold all the lines of output until after all the input had occurred. The simplest way is to append this code to the bottom of my previous code:
      unlink $in_path or die "Could not delete '$in_path': $!"; use File::Copy; move( $out_path, $in_path ) or die "Could not move file '$out_path' to '$in_path': $!";

        Final help with this program please. Instead of processing just the female1.txt how to process all the files in the directory and its sub directories and compare it with the ICD-codes-text file.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://939555]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (7)
As of 2024-04-23 18:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found