Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Dumb SQL Question

by pileofrogs (Priest)
on Mar 02, 2010 at 19:44 UTC ( [id://826243]=perlquestion: print w/replies, xml ) Need Help??

pileofrogs has asked for the wisdom of the Perl Monks concerning the following question:

Blessings upon ye monks!

First, this question is total Monk abuse. It's an SQL question, not a perl question (though I will be using the advice generated by this question in a perl script.) However, the Monks are the best people on the net, the smartest, the best able to parse my stupidly-worded questions, sexiest, best dressed etc... If I wanted advice buying a car, I would probably ask on PerlMonks...

If you know a good, more appropriate venue for this kind of question, please feel free to include it with your downvote.

Achem.. the question...

I want to get all the rows out of a table where one particular field has the same value in at least two rows. Er... I want all the rows with matching values in a particular field, and I don't know what the values are, I just want the ones where that value matches.

Here's the background:

I'm looking for duplicate files on my computer.

I have two tables in sqlite like so:

create table files ( id integer primary key, path text unique, md5 int +eger); create table md5s ( id integer primary key, md5 text unique);

A process (writen in perl!) is presently running that is filling up these tables such that each file's md5 integer is the id integer in the md5s table. I want to get the names of files that have the same md5sum, and I'm almost positive I can do that with a select statement but I have no idea how to start.

Did I mention I suck at SQL? I suck at SQL.

Again, any links to good resource for a loser like me would be greatly appreciated and might even lead to fewer stupid questions like this wasting your time.

Let the downvotes begin!

--Pileofrogs

Replies are listed 'Best First'.
Re: Dumb SQL Question
by bart (Canon) on Mar 02, 2010 at 20:32 UTC
    This works in SQLite:
    select * from files where md5 in (select md5 from files group by md5 having count(*)>1)

    No need for a second table.

      Woo Hoo! That did it!

      Correct, once the values are in the DB, I only would need the md5s table to print the md5s, which I really don't need to do. The md5 id in the files table is adequate for my purpose.

      Using your suggestion as a jumping off place, I made this final query which does eveything, including the unnecessary printing of the MD5 sum.

      select files.path,md5s.md5 from files,md5s where files.md5 in ( select files.md5 from files group by files.md5 having count(*)>1 ) AND md5s.id=files.md5 order by files.md5;

      Thank You!
      --Pileofrogs

      Update: forgot the "order by". It was in the query when I ran it but fell out when I re-typed it here... D'oh! Thanks bart!

        You can print results in groups of the same files, if you add an "ORDER BY" clause. Personally I prefer to sort by file size, if you store it, but you can use "MD5" as well.
Re: Dumb SQL Question
by zwon (Abbot) on Mar 02, 2010 at 20:18 UTC

    In PostgreSQL you can do it like this:

    SELECT path, mds.md5 FROM ( SELECT md5, COUNT(*) AS cnt FROM files GROUP BY md5) AS mds LEFT JOIN files ON files.md5=mds.md5 WHERE mds.cnt > 1
Re: Dumb SQL Question
by fod (Friar) on Mar 02, 2010 at 20:20 UTC
    try SELECT path FROM files, md5s WHERE files.md5 = md5s.md5

      files.md5 is the integer id corresponding to the text md5s.md5, so the above SQL will return nothing.

      Said another way...

      select files.path,md5s.md5 where files.md5=md5s.id;

      returns a list of every file with it's MD5 sum.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://826243]
Approved by ww
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2024-04-24 11:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found