# Excel semi-duplicates



## tech313 (Sep 2, 2008)

I've got an excel file with 1 sheet. It has 11 columns and about 8000 rows. There are a lot of duplicates in here that I need to remove. The problem is that they are not EXACT duplicates and so the easy way doesn't work. There may be an extra space here and there or a period or something like that. Does anyone have a macro that could go through and compare each row, taking into consideration these slight differences and kill the duplicates. It seems like it would be hard to code such a macro but somebody else in the world must have come across this problem and been able to take care of it without going through the entire file (multiple files actually) MANUALLY!

Please, thank you.


----------



## RSpecianjr (Jan 20, 2010)

Hey tech313,

Are you wanting to just compare the only the alpha-numeric strings to each other? ie not taking into account spacing, special characters, commas, periods etc

What do you want to do with the duplicates? Do you want to clear the cells that they are in or do you want to delete the cell and shift the rest of the data?

Regards,

Robert Specian Jr.


----------



## tech313 (Sep 2, 2008)

Yes basically that is what I want. It is a list of customer names, addresses and phone numbers. There is also a customer ID as the first field but I want to ignore that because it will be unique for each entry even though it may be the same customer data. If there is a dupe I want the entire row deleted and everything moved up.

Thanks!


----------



## RSpecianjr (Jan 20, 2010)

Hey tech313,

If you post an example workbook I can see what I can do for you.

Just to clarify, if any data other than CustomerID matches any other data, you want to remove the entire row of duplicate data, including the unique CustomerID.

Regards,

Robert Specian Jr.


----------



## tech313 (Sep 2, 2008)

Thanks for the reply. I am attaching an example workbook. To answer your question, yes. But the tricky part is this: notice in the test file, customers 1 through 3, and customer 10. They are the same customer, but they have unique 'custno' and different 'id' and also, within their name and address may be an extra space or period. On those that are essentially dupes, I need to delete any duplicate rows and leave 1 of each. There may be 3 or 4 of the same customer in my actual file. Thanks again for taking a look at this.


----------



## RSpecianjr (Jan 20, 2010)

Hey tech313,

I have the following code for you. See if this is what your looking for:


```
Function NewString(ByVal StartString As String) As String
'set variables
Dim TextLen, n As Integer
Dim sChar As String

TextLength = Len(StartString)
sChar = ""
        
        'loop through each character in string
        For n = 1 To TextLength
            'find what the current character is
            sChar = Mid(StartString, n, 1)
            'if 'IsAlphaNumeric' function returns true, then it will add the new alphanumeric
            'characters to a new string
            If IsAlphaNumeric(sChar) Then
                NewString = NewString + sChar
            End If
        Next
        
End Function

Private Function IsAlphaNumeric(ByVal sChr As String) As Boolean
    'Returns true if the Character is Uppercase, Lowercase, or a number
    IsAlphaNumeric = sChr Like "[A-Za-z0-9]"
End Function



Sub DeleteDuplicates()
'set variables
Dim EndRw, NumOfDups, C, DupRw As Long

'find last used row
EndRw = Worksheets("tcst10").Range("A65536").End(xlUp).Row

'uses 'NewString' function to remove all characters except alpha-numeric for "cname", "caddress1"
'"caddress2", "caddress3", "cphone"
With Worksheets("tcst10").Range("M2:P" & EndRw)
.FormulaR1C1 = "=NewString(RC[-11])"
.Value = .Value
End With

With Worksheets("tcst10").Range("Q2:Q" & EndRw)
.FormulaR1C1 = "=NewString(RC9)"
.Value = .Value
End With

'if the alphanumeric information has a duplicate, it will produce a 1 value, otherwise ""
With Worksheets("tcst10").Range("R2:R" & EndRw)
.FormulaR1C1 = "=if((Countif(C[-3],RC[-3])+Countif(C[-2],RC[-2])+Countif(C[-1],RC[-1])+Countif(C[-4],RC[-4])+Countif(C[-5],RC[-5])-5+COUNTBLANK(RC[-5]:RC[-1]))>0,1," & Chr(34) & Chr(34) & ")"
End With

'find the row of the first duplication
Worksheets("tcst10").Range("S1").Value = "=MATCH(1,R:R,0)"

'count the total number of duplications
NumOfDups = Application.WorksheetFunction.CountIf(Worksheets("tcst10").Range("R:R"), 1)

'as we remove duplications, the total number will decrease, error handling is needed
'for when there are no duplications left
On Error Resume Next

'loop through duplications deleting the first instance of the duplication
For C = 1 To NumOfDups
'Get first duplication row
DupRw = Worksheets("tcst10").Range("S1").Value
'if there are no duplications left, goto cleanup
If Err = 13 Then GoTo cleanup
'delete duplicate information
Worksheets("tcst10").Rows(DupRw).EntireRow.Delete
Next C

cleanup:
'clear data accumulated by macro
Worksheets("tcst10").Range("M:S").ClearContents


End Sub
```
If you have any questions, or if this isn't quite what your looking for let me know. It works wonders with your test book = ). I am using 2007 though, so let me know if that turns out to be an issue.

Hope this helps,

Robert Specian Jr.


----------



## tech313 (Sep 2, 2008)

Thank you. When I run the macro on my test sheet it deletes every row except 1. It should only delete 1 of each duplicate. Also when I run the macro on my actual production files it freezes up. They do have thousands of rows.


----------



## RSpecianjr (Jan 20, 2010)

Hey tech313,

The macro works fine for me. I am attaching the test book that I was working with. Try opening it up and see if it works. 

If it doesn't, and your using 2003, then its probably a slight difference in coding. I don't have 2003 anymore =/, so I wont be able to test it. Perhaps someone else can?

As far as it freezing up on your other one, it is probably just thinking very hard with the amount of data you have. It will probably speed up a bit if we switch it to manual calculation while running the macro.

Let me know what happens when you try the attached file.

Regards,

Robert Specian Jr.


----------



## Zazula (Apr 27, 2006)

RSpecianjr said:


> I don't have 2003 anymore =/, so I wont be able to test it. Perhaps someone else can?


It does work in my Excel 2003.


----------



## tech313 (Sep 2, 2008)

OK I guess I hadn't put it in mine correctly. I tried yours and it worked so I copied and pasted the macro again into my test copy and it also worked. Now I have put the macro in my big file. It finished the calculating and now excel is frozen but I assume, as you said, it is still working so I will leave it alone for a bit and see what happens.
Thank you very much for assisting with this!


----------



## RSpecianjr (Jan 20, 2010)

Hey tech313,

You can add this code:


```
Application.Calculation = xlCalculationManual
```
Right after:


> EndRw = Worksheets("tcst10").Range("A65536").End(xlUp).Row


and then we can set it back to automatic at the end of the code with this:


```
Application.Calculation =xlCalculationAutomatic
```
Right before:


> End Sub


Hope this helps,

Robert Specian Jr.


----------



## tech313 (Sep 2, 2008)

That speeds up the calculation definitely. But now it goes through and starts deleting every single line. Could you show me how to edit this function to only look at 1 column? This might help me out.


----------



## RSpecianjr (Jan 20, 2010)

Hey tech313,

My bad haha, we need to turn the calculations back on before it goes through the sequence to delete the rows, as it is based on a formula calculating. Lets just move the:


```
Application.Calculation =xlCalculationAutomatic
```
Line of code right before:


```
For C = 1 To NumOfDups
```
That should do it = P

Let me know how this works for you,

Robert Specian Jr.


----------

