I use a database often, mostly for structuring and holding the data at EnergyTeachers.org. I usually use MySQL, but also sometimes sqlite. Here are some of my tricks.
It's one thing to identify data that is duplicate, and you see hints how to do it by counting members in a group grouped by what field(s) should be duplicate and showing those groupings that have a count larger than one. But we often want to do something to the duplicate, like delete it. A better method in SQL is to create two tables from one and find pairings across them. Here's the code for showing all duplicate rows where two fields match exactly:
SELECT t1.* FROM tablename as t1, tablename as t2 WHERE t1.uniqueid<>t2.uniqueid AND t1.field1=t2.field1 AND t1.field2=t2.field2 ORDER BY t1.field1, t1.field2
The first WHERE condition ensures we don't match rows to themselves. This only works on tables where there is a unique id.
We use ORDER BY to show the duplicates in adjacent rows in the result.
Finding multiple matches in a matching table
In the following example we search for pics with tags 1 and 3 but not 2 or 4:
SELECT picid, GROUP_CONCAT(tagid ORDER BY tagid) AS thispicstagids FROM matches GROUP BY `picid` HAVING thispicstagids REGEXP '(^|,)1(,[0-9]*)*,3(,[0-9]*)*,73(,|$)' AND thispicstagids NOT REGEXP '(^|,)2(,|$)' AND thispicstagids NOT REGEXP '(^|,)4(,|$)'
We create an ordered list of all the tags of each pic with the GROUP_CONCAT function.
Then we use a regular expression to search that list. Here's a breakdown of that expression: '^' starts us at the beginning of the string. If this wasn't here, due to the nature of the next part, we might match numbers that ended with the figures (like allowing '24' when you only wanted '4').
'([0-9]*,)*' searches for any amount (the last asterisk) of the group in parenthesis, being any amount of digits (the class declared by the square brackets) followed by a comma.
'1' searches for our first search id. Replace with the first figure in your array.
'(,[0-9]*)*' searches for any amount of the group of a comma and any number of digits. This is where the search should end if you are searching for only one tag.
',' then there should be at least one comma between the first tag and the second. Only insert this comma if you're looking for more than one tag.
'3' searches for our second search id. Replace with the second and following figures in your array.
'(,[0-9]*)*' searches for any amount of the group of a comma and any number of digits, allowing for tags to appear after your last search figure. If this wasn't here, we'd match 3 when 33 was present.
'$' matches the end of the string, for the same reason we matched the beginning.
The negative searches use NOT REGEXP, and search for our figures 2 and 4 preceded exclusively either by the beginning of the string or a comma, and followed exclusively either by a comma or the end of the string