Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: compare files by words

by Zaxo (Archbishop)
on May 31, 2007 at 06:48 UTC ( [id://618413]=note: print w/replies, xml ) Need Help??


in reply to compare files by words

Different in sequence, or not appearing in the other file? Any sort of uniqueness problem looks like it needs a hash, but is that really the problem you have?.

Algorithm::Diff may be helpful, but you haven't really said what you need.

After Compline,
Zaxo

Replies are listed 'Best First'.
Re^2: compare files by words
by sanPerl (Friar) on May 31, 2007 at 10:38 UTC
    Zaxo is correct. I also use Algorithm::Diff to great extend. It is simple to use (once you understand the nested Array structure) and acts like Unix Diff.
    To increase the speed, I suggest following method.
    1) First compare lines exactly as string compare, if they are same then just move ahead to next sets of lines.
    2) If the lines are NOT same then use Algorithm::Diff to understand difference.
    Regards,
    SanPerl
Re^2: compare files by words
by jajaja (Initiate) on May 31, 2007 at 07:19 UTC
    Its almost 2 same files. They differs in diacritic only. And i just need to know how many words have different diacritic. I dont need to know details.
      Its almost 2 same files. They differs in diacritic only. And i just need to know how many words have different diacritic. I dont need to know details.

      In this case your approach above seems fine. Did you try it? Did it fail somehow? One thing you "have" to do is to make it strict-safe. Then, for words comparison I'd write:

      no warnings 'uninitialized'; ($words1[$_] eq $words2[$_] ? $good : $bad)++ for 0..(@words1>@words2 ? $#words1 : $#words2);

      (I suppose you want to count a word as bad if it has not a correspondent one at all. Otherwise you should change > into <. In the latter case no wouldn't be necessary.)

      Update: you also probably don't want to split on / /, but on ' ' which is more likely to do what you mean, and in fact is also the default.

        How about
        use List::Util "min"; ... my $words = min(@words1, @words2); $total += $words; $bad += grep $words1[$_] ne $words2[$_], 0 .. ($words - 1); ... print "good:", $total - $bad;
        possibly switching max for min.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://618413]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-03-29 06:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found