2 Hash Tables, 4 Keys...what to do?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Perl Masters,

Okay, I am resubmitting a question I posted yesterday. (I rushed to get yesterday's posting out and it looked awful--sorry about that!) Here is what I need to do:-

1) Two files exist, and each one includes lines formatted like this:

word1 word2 word3 word4 num1 num2 num3 word5 word6 ...

Note: the first 4 words and 3 nums will always exist, but the last string of words (word5 word6, etc.) changes from one line to the next--sometimes it's a list that goes up to word30, sometimes it only goes to (minimum) word6.

2) I need to grab the first 4 entries of each line in file1 and find their match in file2 (e.g. "word1 word2 word3 word4" eq "word1 word2 word3 word4"). If there is a match, print each line from each file!

3) Once a match is made, jump to word5 on the same line and check to see if the string of words at the end (e.g. after num3) is equal. (I don't care about the number of words after num3--if there are 20 words in both files, the 20 words must match and be in the same order.) If unequal, print it!

I have included the fledgling beginnings of some code I have. If someone could recommend what to put in the commented areas, that would be GREAT!

NOTE: doing linear scans over associative arrays is not an option--the files are way too big. That is why I am trying to get hash tables and multiple keys working.

Thank You!

open (IN1s,"$ARGV[0].sum");
open (IN2s,"$ARGV[1].sum");
open (XLOUT,">pt.forxl");
open (ASTONLY,">ast.only");
open (PTONLY,">pt.only");

@in1s = <IN1s>;
@in2s = <IN2s>;

%AstContent = ();

$ln = 0;
while ($in1s[$ln] ne "") {
    chop ($in1s[$ln]);
    @in1sal = split(/\s+/,$in1s[$ln]);
    $astlength = @in1sal; 
    $astlast = $astlength - 1;
    for ($i = 0; $i <= 3; $i++) {
        $AstStartEndWithClocks = join (" ",$AstStartEndWithClocks,$in1
+sal[$i]);
    }
    for ($i = 7; $i <= $astlast; $i++) {
        $AstMasterList = join (" ",$AstMasterList,$in1sal[$i]);
    }
    $AstContent{$AstStartEndWithClocks,$AstMasterList} = @in1sal;
        #
        # I know the above hash table is wrong, but I don't
        # know how to create a table with 2 keys.  In short,
        # take the current list (@in1sal) and assign 2 keys
        # to it.
        #
    $AstStartEndWithClocks = (); # undef in case the
                                 # same pattern of 4
                                 # words comes up again

    $AstMasterList = ();         # undef this because
                                 # its length can change
                                 # from one line to the
                                 # next...
    undef (@in1sal);
    $ln++;
}

%PTContent = ();

$ln = 0;
while ($in2s[$ln] ne "") {
    chop ($in2s[$ln]);
    @in2sal = split(/\s+/,$in2s[$ln]);
    $ptlength = @in2sal; 
    $ptlast = $ptlength - 1;
    for ($i = 0; $i <= 3; $i++) {
        $PTStartEndWithClocks = join (" ",$PTStartEndWithClocks,$in2sa
+l[$i]);
    }
    for ($i = 7; $i <= $ptlast; $i++) {
        $PTMasterList = join (" ",$PTMasterList,$in2sal[$i]);
    }
    $PTContent{$PTStartEndWithClocks,$PTMasterList} = @in2sal;
        #
        # Same deal as above: I know this is wrong but I
        # don't know how to assign 2 keys to the current
        # list (@in2sal)...
        #
    $PTStartEndWithClocks = ();
    $PTMasterList = ();
    undef (@in2sal);
    $ln++;
}

# Parse each hash table (AstContent and PTContent)--when
# AstStartEndWithClocks and PTStartEndWithClocks match,
# print the result to file XLOUT.

# Now, if there was an 
# AstStartEndWithClocks/PTStartEndWithClocks match, check
# to see of $AstMasterList and $PTMasterList match.  If they
# do NOT, print the line to the screen.

# If AstContent's AstStartEndWithClocks cannot be matched
# in PTContent, write the line to the file ASTONLY.

# If PTContent's PTStartEndWithClocks cannot be matched in
# AstContent, write the line to the file PTONLY.
[download]

janitored by ybiC: Balanced <readmore> tags around longish codeblock, to avoid/reduce vertical scrolling

Comment on 2 Hash Tables, 4 Keys...what to do? Download Code

Replies are listed 'Best First'.
Re: 2 Hash Tables, 4 Keys...what to do? by Roy Johnson (Monsignor) on Jun 03, 2005 at 15:37 UTC
In file1, is word1..word4 unique (only 1 line has any specific word1..word4)? Is it feasible to read in the whole file? If so, my approach would be something like: `my %file1; while (<FILE1>) { # Capture first four words, and everything after the next 3 words +(numbers) my ($k1, $k2) = /(\w+(?:\s+\w+){3})(?:\s+\w+){3}(.)/; # Index by k1, store k2 and line $file1{$k1} = [$k2, $_]; } while (<FILE2>) { my ($k1, $k2) = .... # same regex as before if ($file1{$k1}) { # print file1 line and file2 line print $file1{$k1}->[1]; print; # Check if k2's are equal if ($file1{$k1}->[0] ne $k2) { print "K2's are not equal! (see lines printed above)\n"; } } }` [download] I've probably misunderstood some of what you want, but I think the parsing of the keys could be a significant help to you. Update: fixed regex as per nobull's note. Caution:* Contents may have been coded under pressure.	[reply] [d/l]
Re^2: 2 Hash Tables, 4 Keys...what to do? by nobull (Friar) on Jun 03, 2005 at 16:43 UTC
`/((\w+(?:\s+\w+){4})(?:\s+\w+){3}(.)/;` [download] I think you meant `/(\w+(?:\s+\w+){3})(?:\s+\w+){3}(.)/;` [download]	[reply] [d/l] [select]
Re: 2 Hash Tables, 4 Keys...what to do? by ikegami (Patriarch) on Jun 03, 2005 at 17:34 UTC
If the input files are sorted (or if you use an external tool to sort them before running the Perl script), a very memory-efficient algorithm can be used: my ($f1_k1, $f1_k2); my ($f2_k1, $f2_k2); my $f1_line; my $f2_line; my $regexp = qr/^((\w+(?:\s+\w+){3})(?:\s+\w+){3}(.*)/; for (;;) { # Read from FILE1 if appropriate. if (not defined $f1_line) { $f1_line = <FILE1>; last unless defined $f1_line; # Extract the keys for comparison. ($f1_k1, $f1_k2) = $f1_line =~ $regexp; # Make sure the regexp matched. if (not defined $f1_k1) { undef $f1_line; next; } } # Read from FILE2 if appropriate. if (not defined $f2_line) { $f2_line = <FILE2>; last unless defined $f2_line; # Extract the keys for comparison. ($f2_k2, $f1_k2) = $f2_line =~ $regexp; # Make sure the regexp matched. if (not defined $f2_k1) { undef $f2_line; next; } } # Do the comparisons. my $cmp = $f1_k1 cmp $f2_k2; if ($cmp == 0) { print("Key 1 match:\n"); print("$f1_line\n"); print("$f2_line\n"); if ($f1_k2 eq $f2_k2) { print("...and key 2 matches.\n"); } else { print("...but key 2 doesn't match!\n"); } # Only read from one file in case of duplicate keys. undef $f2_line; } # Need to read from FILE1. undef $f1_line if $cmp < 0; # Need to read from FILE2. undef $f2_line if $cmp > 0; } [download] (The code isn't tested, but the idea is sound.)	[reply] [d/l]
Re: 2 Hash Tables, 4 Keys...what to do? by Jaap (Curate) on Jun 03, 2005 at 15:16 UTC
What is the size of the files you are typically working with? How fast does it need to be? How much memory does your system have? You could (and should imho) have simplified the problem by making the files have only 2 "columns" each. That column one has 4 words in it doesn't matter, neither that there are numbers between column 1 and 2.	[reply]
Re: 2 Hash Tables, 4 Keys...what to do? by GrandFather (Saint) on Jun 04, 2005 at 01:22 UTC
The main thing here is that there is only one key: the first four words. Read more... The code (2 kB) Read more... File1 (702 Bytes) Read more... File2 (683 Bytes) Read more... Result (618 Bytes) Raise your eyes and look out of the rut.	[reply] [d/l] [select]
Re^2: 2 Hash Tables, 4 Keys...what to do? by duff (Parson) on Jun 04, 2005 at 01:57 UTC
If the input files can be reordered (sorted), then they don't have to be read entirely into memory but rather one line from each at a time. It seems at first blush that this could be another one of those "simplifying assumptions" duff	[reply]
Re^3: 2 Hash Tables, 4 Keys...what to do? by GrandFather (Saint) on Jun 06, 2005 at 20:26 UTC
Bows humbly, "Yes master". Mumbles "figured that out for me-self, but it was too late then". Food for thought, not fuel for flames	[reply]
Re: 2 Hash Tables, 4 Keys...what to do? by nobull (Friar) on Jun 03, 2005 at 16:37 UTC
I am resubmitting a question I posted yesterday. (I rushed to get yesterday's posting out and it looked awful--sorry about that! You should enroll in the monastery yourself. It's quick and painless. Then you'll be able to update your nodes.	[reply]
Re: 2 Hash Tables, 4 Keys...what to do? by TedPride (Priest) on Jun 03, 2005 at 17:17 UTC
Just use the first four fields, plus delimiters, as your hash key. Put the contents of one file in an array and associate line numbers with hash key: `for (@file1) { $h{join ' ',(split / /)[0..3]} = $c++; }` [download] Then run through the other file, matching key and printing if rest isn't equal: `while (<DATA>) { $key = join ' ',(split / /)[0..3]; if (exists $h{$key}) { print if $_ ne $file1[$h{$key}]; } }` [download] You'll want to change the delimiter to something other than spaces, but the following is a working section of code for demonstration purposes: `use strict; use warnings; my ($c, %h, $key); my @file1 = ( "A B C D E F G\n", "A C B D E F G\n", "D B C A E F G\n", "B B C D E F G\n" ); for (@file1) { $h{join ' ',(split / /)[0..3]} = $c++; } while (<DATA>) { $key = join ' ',(split / /)[0..3]; if (exists $h{$key}) { print if $_ ne $file1[$h{$key}]; } } __DATA__ A B C D E F G A C B D E F H D B C A I F G B B C D E F G` [download]	[reply] [d/l] [select]
Re: 2 Hash Tables, 4 Keys...what to do? by TedPride (Priest) on Jun 03, 2005 at 17:44 UTC
Well yes, but what are the chances the files are sorted? You could also assume they aren't sorted, make a pass through both to find out which lines match between the two, pass through one again to retrieve just those lines, and pass through the other to match. This would only require storing lines with a key match in memory instead of an entire file.	[reply]
Re^2: 2 Hash Tables, 4 Keys...what to do? by ikegami (Patriarch) on Jun 03, 2005 at 17:53 UTC
Well yes, but what are the chances the files are sorted? 100% if the the application that produced them produces sorted files. And if it doesn't, you could use an speed- and memory-efficient program (possibly written in C) to do the sorting beforehand.	[reply]


more useful options
	PerlMonks