Tuesday, February 24, 2009

Duplicate Entries in a Mapping Table with All Guid/Uniqueidentifier Columns

I recently ran into a case where a custom .net membership provider was incorrectly adding a record to a map table every time a user was updated. This wouldn't normally be a problem, except SharePoint's user security has a couple hang-ups when the number of groups doesn't match the distinct group list.

Needless to say, there was a LOT of duplicate records in the table; more than I was willing to weed through by hand. After performing the obligated web search, I couldn't find anything that fit my situation. Common Table Expressions looked promising, but had several limitations that I wasn't able to work around.

The problem could be prevented from the beginning with the correct keys/constraints, but that was not the case and I needed to clean up the table before making the correction.

The first bit of information that tells me how I am going to solve the issue is the DBMS. In this case it is an SQL Server 2005 instance.

Lets setup an example, you have a table with 3 "uniqueidentifier" columns: "ID", "FromMap", "ToMap".
CREATE TABLE [dbo].[DuplicateGuidMap](
 [ID] [uniqueidentifier] NOT NULL,
 [FromMap] [uniqueidentifier] NOT NULL,
 [ToMap] [uniqueidentifier] NOT NULL
) ON [PRIMARY]
Now lets populate the table. You can skip this part, nothing unusual here.
INSERT INTO [Test].[dbo].[DuplicateGuidMap]
           ([ID]
           ,[FromMap]
           ,[ToMap])
SELECT '0f6f2339-4597-4933-b8b0-de04804f4c04','e0bb5dad-f30e-4bcb-a4ec-12363af0e6cd','8a15fde9-f9f3-4ecb-acd4-9c78274f9d95' UNION ALL SELECT
'1a0d0991-6a12-4621-901d-0ca58a4fbf5a','88c3d3aa-663c-41e9-bdcf-f9022604b5fc','d4757f79-3452-4a3e-a4d8-1f2ea3a26c2a' UNION ALL SELECT
'14bc9408-9537-4d2a-b917-6e1b44099799','778cffba-a14b-447f-8648-96d640ae1b93','dd20127d-33ab-4d91-9202-8196a2ca90d6' UNION ALL SELECT
'3801a162-cdef-4ffc-b4b5-928f345fe68f','77121133-8c4f-4e28-b9fa-048707c41a7a','faae6e93-e91d-4729-8b69-367401130a42' UNION ALL SELECT
'e8386333-0e89-4ec0-8b2a-e16ae9544290','d385ea64-68f9-421c-bda0-3c0d9fd785e8','442fd8b3-a5d3-4074-9886-e9104d0f2421' UNION ALL SELECT
'a4b5dacb-3f21-43ea-b516-e7055c3831d1','10b7ec11-0cb8-4128-a146-2ca2ef0a9594','c09aa0e7-96fd-41fa-8fce-105a4ce908ee' UNION ALL SELECT
'2c340b29-cd5d-4036-92af-7ae24bdc47d9','dde6379c-6626-46c0-ac70-b7ddd0377e62','5ac17542-fc1d-4581-a616-5dd602e5f34f' UNION ALL SELECT
'd6d56501-98e0-4662-a72c-116b6e2a2e4c','08d0200d-375a-4022-af9e-bdc5ff4a57fe','4e001ae6-d8a0-45b5-a30a-1ab95b66dae4' UNION ALL SELECT
'ee392830-d077-48c0-a817-5a2169dc8bb6','abd1ed2d-3f4d-4f8d-9240-7aebc46ff021','3fdc2214-e11d-411d-b524-7cdd57037f6c' UNION ALL SELECT
'a37b664d-1735-444b-a326-2d23f8bf0b55','c0a1a8a0-efdd-4740-807a-f5f0aaba90ed','f95a288d-3e6c-43a4-9b59-54d382825e99' UNION ALL SELECT
'1d8a2558-0f50-4765-98a9-dd2cf2127bcf','e0bb5dad-f30e-4bcb-a4ec-12363af0e6cd','8a15fde9-f9f3-4ecb-acd4-9c78274f9d95' UNION ALL SELECT
'577a05de-4bbc-4b8d-93dc-22f9dd838e43','e446232a-02bb-47a0-b9a1-e5fab5c0aba3','faae6e93-e91d-4729-8b69-367401130a42'
We will assume that the relationship of "FromMap" to "ToMap" is 1 to Many, meaning that each "ToMap" should be unique. A good place to start would be to find out how many "ToMap" duplicates there are.
SELECT DISTINCT ToMap
  FROM DuplicateGuidMap
GROUP BY ToMap
HAVING Count(ToMap) > 1
There are 2 potential sets of duplicate records in the table now. ToMap:
  • FAAE6E93-E91D-4729-8B69-367401130A42
  • 8A15FDE9-F9F3-4ECB-ACD4-9C78274F9D95
Getting the list of all of the duplicated records is a trivial join, but I want a query that would return only the duplicate rows (leaving one of the duplicated records). Many DBA's who have worked for years would use cursors and possibly create functions. This is all well and good, but they are rather slow and consume a lot of CPU time. The idea of one SQL statement to rule them all doesn't come to mind. I tend to think in sets. I haven't found too many things I can't do in one Select statement. Knowing this, I banked heavily on my understanding of how SQL works and created the following query.
SELECT 
        dgm.ToMap
   , dgm.FromMap
   , dgm.ID
  FROM DuplicateGuidMap dgm
WHERE ToMap IN (
 SELECT DISTINCT ToMap
   FROM DuplicateGuidMap
 GROUP BY ToMap
 HAVING Count(ToMap) > 1
) AND ID NOT IN (
 SELECT TOP 1 ID
   FROM DuplicateGuidMap
 WHERE ToMap = dgm.ToMap
)
ORDER BY ToMap
The meat of the query lies in the Where clause.
ID NOT IN (
 SELECT TOP 1 ID
   FROM DuplicateGuidMap
 WHERE ToMap = dgm.ToMap
)
What's happening here? Since we are currently in the Where clause, we are banking on the fact that it is evaluated for every row that passes the first condition. So in the subquery, we are selecting the record we want to keep. The Where clause "ToMap = dgm.ToMap" grabs the "ToMap" in the main query for use in the subquery. This functionality works almost the same way as a SQL function without having to load and unload the function from memory every time. Having this included in the statement makes the query run faster. In a table with ~10,000 records, ~79 unique duplicates, and ~196 duplicates to remove the query took "00:00" (min:sec).

There is a couple points to make before we proceed. The Guid 'FAAE6E93-E91D-4729-8B69-367401130A42' case is not a valid dupicate and never occurred in the scenario. I added this case to dive farther into controlling which duplicates are actually valid. The only valid duplicate is '8A15FDE9-F9F3-4ECB-ACD4-9C78274F9D95' because the "FromMap"'s are the same. This is the solution I developed for originally.

We need to look at the "ID" Where clause. We can refine it to be: "FromMap = dgm.FromMap AND ToMap = dgm.ToMap".

The final query looks like the following. It returns a list of the duplicate records ready to delete leaving one of the duplicate records.
SELECT 
        dgm.ToMap
   , dgm.FromMap
   , dgm.ID
  FROM DuplicateGuidMap dgm
WHERE ToMap IN (
 SELECT DISTINCT ToMap
   FROM DuplicateGuidMap
 GROUP BY ToMap
 HAVING Count(ToMap) > 1
) AND ID NOT IN (
 SELECT TOP 1 ID
   FROM DuplicateGuidMap
 WHERE FromMap = dgm.FromMap AND ToMap = dgm.ToMap
)
ORDER BY ToMap
Never stop thinking outside the box. Most of the time you can find a simple answer with a little extra thought.

No comments: