31
SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107 VBA] Function Str_Comp(st1 As String, st2 As String) As Double ‘ returns a number showing % comparison between two names ‘ i.e. =Str_Comp(A1, B1) ‘ Format cell as Percentage to make it look pretty!! Dim MtchTbl(100, 100) Dim MyMax As Double, ThisMax As Double Dim i As Integer, j As Integer, ii As Integer, jj As Integer With WorksheetFunction st1$ = Trim$(.Proper(st1$)) st2$ = Trim$(.Proper(st2$)) End With MyMax# = 0 For i% = Len(st1$) To 1 Step -1 For j% = Len(st2$) To 1 Step -1 If Mid$(st1$, i%, 1) = Mid$(st2$, j%, 1) Then ThisMax# = 0 For ii% = (i% + 1) To Len(st1$) For jj% = (j% + 1) To Len(st2$) If MtchTbl(ii%, jj%) > ThisMax# Then ThisMax# = MtchTbl(ii%, jj%) End If Next jj% Next ii% MtchTbl(i%, j%) = ThisMax# + 1 If (ThisMax# + 1) > ThisMax# Then MyMax# = ThisMax# + 1 End If End If Next j% Next i% Str_Comp = MyMax# / ((Len(st1$) + Len(st2$)) / 2) End Function [/VBA] Here's the details:

SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

Embed Size (px)

Citation preview

Page 1: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107

VBA]Function Str_Comp(st1 As String, st2 As String) As Double‘‘ returns a number showing % comparison between two names‘‘ i.e. =Str_Comp(A1, B1)‘‘ Format cell as Percentage to make it look pretty!!‘‘Dim MtchTbl(100, 100)Dim MyMax As Double, ThisMax As DoubleDim i As Integer, j As Integer, ii As Integer, jj As Integer

With WorksheetFunctionst1$ = Trim$(.Proper(st1$))st2$ = Trim$(.Proper(st2$))End With

MyMax# = 0

For i% = Len(st1$) To 1 Step -1For j% = Len(st2$) To 1 Step -1If Mid$(st1$, i%, 1) = Mid$(st2$, j%, 1) ThenThisMax# = 0For ii% = (i% + 1) To Len(st1$)For jj% = (j% + 1) To Len(st2$)If MtchTbl(ii%, jj%) > ThisMax# ThenThisMax# = MtchTbl(ii%, jj%)End IfNext jj%Next ii%MtchTbl(i%, j%) = ThisMax# + 1If (ThisMax# + 1) > ThisMax# ThenMyMax# = ThisMax# + 1End IfEnd IfNext j%Next i%Str_Comp = MyMax# / ((Len(st1$) + Len(st2$)) / 2)

End Function[/VBA]

Here's the details:

Public Function VFuzzyLookup(Lookup_Value As String, Table_Array As Variant)

Find the best match for a given string in column 1 of an array of data obtained from an Excel rangeThis is functionally similar to VLookup, but it returns the best match, not the first exact matchThis function is not case-sensitive.

Page 2: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

If your data quality is poor, you are advised to display the retrieved index value from column 1 and use the MatchWord function on this index value to reveal the fuzzy-matching 'score' and discard all results below a threshold score.

Use VFuzzyLookup_Phrase if you are trying to match a phrase or sentence (a sequence of words separated by spaces) as that function is faster, and has additional logic for the word order.

Use vFuzzyLookup_Address if you are looking up names and addresses: that function has additional logic to normalise the common abbreviations and word-order conventions used in British addresses.

Yes, you read that correctly: you have a best-match lookup, vFuzzyLookup_Address() for lists of names and addresses. It could do with some performance-tuning (I wouldn't try it on lists exceeding 1024 members if you're in any kind of hurry) but it's usable.

The functions which will be of most interest to other programmers are probably MatchWord(), the simple string-comparison function which returns a percentage score based on Levenshtein edit distance; and MatchPhrase(), which uses MatchWord to create a comparison grid for the words of two sentences, and compares the constructed sequence of 'best match' words in the first sentence with their actual occurrence in the second.

This is the full list of functions , with internal page links:

VfuzzyLookup() VFuzzyLookup_Phrase VFuzzyLookup_Address MatchPhrase MatchWord Levenshtein NormaliseAddress StripChars Substitute

In another post I have coded up an alternative approach to the Levenshtein edit distance: a 'sum of common strings' score that may give better results for longer phrases and passages of text.

Option Explicit

Public Function VFuzzyLookup(Lookup_Value As String, Table_Array As Variant, Optional Col_Index_Num As Integer = 1)

' Find the best match for a given string in column 1 of an array of data obtained from an Excel range' This is functionally similar to VLookup, but it returns the best match, not the first exact match' This function is not case-sensitive.

' If your data quality is poor, you are advised to display the retrieved index value from column 1

Page 3: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

' and use the MatchWord() function on this index value to reveal the fuzzy-matching 'score' and' discard all results below a threshold score.

' Use VFuzzyLookup_Phrases if you are trying to match a phrase or sentence (a sequence of words' separated by spaces) as that function is faster, and has additional logic for the word order.

' Use vFuzzyLookup_Address if you are looking up names and addresses: that function has additional' logic to normalise the common abbreviations and word-order conventions used in British addresses.

' THIS CODE IS IN THE PUBLIC DOMAIN

Application.Volatile False

Dim dblBestMatch As Double

Dim iRowBest    As IntegerDim dblMatch    As DoubleDim iRow        As IntegerDim strTest     As StringDim strInput    As String

Dim iStartCol   As IntegerDim iEndCol     As IntegerDim iOffset     As Integer

If InStr(TypeName(Table_Array), "(") + InStr(1, TypeName(Table_Array), "Range", vbTextCompare) < 1 Then    'Table_Array is not an array     VFuzzyLookup = "#VALUE"    Exit FunctionEnd If

If InStr(1, TypeName(Table_Array), "Range", vbTextCompare) > 0 Then    Table_Array = Table_Array.ValueEnd If

' If you get a subscript-out-of-bounds error here, you're using a vector instead' of the 2-dimensional array that is the default 'Value' property of an Excel range.

iStartCol = LBound(Table_Array, 2)iEndCol = UBound(Table_Array, 2)iOffset = 1 - iStartCol

Col_Index_Num = Col_Index_Num - iOffset

If Col_Index_Num > iEndCol Or Col_Index_Num < iStartCol Then    'Out-of-bounds     VFuzzyLookup = "#VALUE"    Exit Function

Page 4: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

End If

    strInput = UCase(Lookup_Value)

    iRowBest = -1    dblBestMatch = 0

    For iRow = LBound(Table_Array, 1) To UBound(Table_Array, 1)

        strTest = ""        strTest = Table_Array(iRow, iStartCol)

        dblMatch = 0        dblMatch = MatchWord(strInput, strTest)

        If dblMatch = 1 Then ' Bail out on finding an exact match             iRowBest = iRow            Exit For        End If

        If dblMatch > dblBestMatch Then            dblBestMatch = dblMatch            iRowBest = iRow        End If

    Next iRow

    If iRowBest = -1 Then        VFuzzyLookup = "#NO MATCH"        Exit Function    End If

    VFuzzyLookup = Table_Array(iRowBest, Col_Index_Num)

End Function

Public Function MatchWord(ByVal str1 As String, ByVal str2 As String, Optional Compare As VbCompareMethod = vbTextCompare) As Double

' Returns a percentage estimate of how closely word 1 matches word 2' Edit distances exceeding the length of str1 are discarded, returning a percentage match of zero

' THIS CODE IS IN THE PUBLIC DOMAIN

Application.Volatile False

Dim maxLen As Integer

Page 5: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

Dim minLen As Integer

If Compare = vbTextCompare Then    str1 = UCase(str1)    str2 = UCase(str2)End If

    If str1 = str2 Then        MatchWord = 1        Exit Function    End If

    If Len(str1) > Len(str2) Then        maxLen = Len(str1)        minLen = Len(str2)    Else        maxLen = Len(str2)        minLen = Len(str1)    End If

    MatchWord = 0    MatchWord = Levenshtein(str1, str2)

    If MatchWord >= minLen Then        MatchWord = 0    Else        MatchWord = (maxLen - MatchWord) / maxLen    End If

End Function

Public Function VFuzzyLookup_Phrase(Lookup_Phrase As String, Table_Array As Variant, Optional Col_Index_Num As Integer = 1)

' Find the best match for a given phrase in column 1 of an array of data obtained from an Excel range

' Use this function to match a sentence (a sequence of words separated by spaces). Returns a score' based on matching word order weighted by the string-matching score of each individual word.

' This is functionally similar to VLookup, but it returns the best match, not the first exact match' This function is not case-sensitive.

' If your data quality is poor, you are advised to display the retrieved index value from column 1' and use the MatchPhrase() function on this index value to reveal the fuzzy-matching 'score'; consider' discarding all results below a threshold score.

' Use VFuzzyLookup for simple string comparisons.

Page 6: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

' Use vFuzzyLookup_Address if you are looking up names and addresses: that function has additional' logic to normalise the common abbreviations and word-order conventions used in British addresses.

' THIS CODE IS IN THE PUBLIC DOMAIN

Application.Volatile False

Dim dblBestMatch As Double

Dim iRowBest    As IntegerDim dblMatch    As DoubleDim iRow        As IntegerDim strTest     As StringDim strInput    As String

Dim iStartCol   As IntegerDim iEndCol     As IntegerDim iOffset     As Integer

If InStr(TypeName(Table_Array), "(") + InStr(1, TypeName(Table_Array), "Range", vbTextCompare) < 1 Then    'Table_Array is not an array     VFuzzyLookup_Phrase = "#VALUE"    Exit FunctionEnd If

If InStr(1, TypeName(Table_Array), "Range", vbTextCompare) > 0 Then   Table_Array = Table_Array.ValueEnd If

' If you get a subscript-out-of-bounds error here, you're using a vector instead' of the 2-dimensional array that is the default 'Value' property of an Excel range.

iStartCol = LBound(Table_Array, 2)iEndCol = UBound(Table_Array, 2)iOffset = 1 - iStartCol

Col_Index_Num = Col_Index_Num - iOffset

If Col_Index_Num > iEndCol Or Col_Index_Num < iStartCol Then    'Out-of-bounds     VFuzzyLookup_Phrase = "#VALUE"    Exit FunctionEnd If

    strInput = UCase(Lookup_Phrase)

    iRowBest = -1

Page 7: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

    dblBestMatch = 0

    For iRow = LBound(Table_Array, 1) To UBound(Table_Array, 1)

        strTest = ""        strTest = Table_Array(iRow, iStartCol)

        dblMatch = 0        dblMatch = MatchPhrase(strInput, strTest)    ' Consider coding up a  MatchPhrase_Express() function, with the preprocessing                                                    ' (StripChars, Split) of strInput done here, rather than repeatedly

        If dblMatch = 1 Then ' Bail out on finding an exact match            iRowBest = iRow            Exit For        End If

        If dblMatch > dblBestMatch Then            dblBestMatch = dblMatch            iRowBest = iRow        End If

    Next iRow

    If iRowBest = -1 Then        VFuzzyLookup_Phrase = "#NO MATCH"        Exit Function    End If

    VFuzzyLookup_Phrase = Table_Array(iRowBest, Col_Index_Num)

End Function

Public Function VFuzzyLookup_Address(Lookup_Address As String, Table_Array As Variant, Optional Col_Index_Num As Integer = 1)

' Find the best match for a given postal address in column 1 of a table of addresses obtained from an Excel range

' This is functionally similar to VLookup, but it returns the best match, not the first exact match' This function is not case-sensitive and ignores common abbreviations eg: 'St' for 'Street'

' If your data quality is poor, you are advised to display the retrieved index value from column 1' and use the MatchPhrase() function on this index value to reveal the fuzzy-matching 'score'; consider' discarding all results below a threshold score.

Page 8: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

' Use VFuzzyLookup for simple string comparisons.

' Use vFuzzyLookup_Phrase if you are looking up phrases or sentences that are not addresses: this address lookup' function discards a lot of common words like 'Street' and expands abbreviations like 'Ave' and 'Blvd'

' THIS CODE IS IN THE PUBLIC DOMAIN

Application.Volatile False

Dim dblBestMatch As Double

Dim iRowBest    As IntegerDim dblMatch    As DoubleDim iRow        As IntegerDim strTest     As StringDim strInput    As String

Dim iStartCol   As IntegerDim iEndCol     As IntegerDim iOffset     As Integer

If InStr(TypeName(Table_Array), "(") + InStr(1, TypeName(Table_Array), "Range", vbTextCompare) < 1 Then    'Table_Array is not an array     VFuzzyLookup_Address = "#VALUE"    Exit FunctionEnd If

If InStr(1, TypeName(Table_Array), "Range", vbTextCompare) > 0 Then    Table_Array = Table_Array.ValueEnd If

' If you get a subscript-out-of-bounds error here, you're using a vector instead' of the 2-dimensional array that is the default 'Value' property of an Excel range.

iStartCol = LBound(Table_Array, 2)iEndCol = UBound(Table_Array, 2)iOffset = 1 - iStartCol

Col_Index_Num = Col_Index_Num - iOffset

If Col_Index_Num > iEndCol Or Col_Index_Num < iStartCol Then    'Out-of-bounds     VFuzzyLookup_Address = "#VALUE"    Exit FunctionEnd If

    strInput = Lookup_Address    strInput = NormaliseAddress(strInput)

Page 9: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

If strInput = "" Then    'Out-of-bounds     VFuzzyLookup_Address = "#CANNOT READ ADDRESS"    Exit FunctionEnd If

    iRowBest = -1    dblBestMatch = 0

    For iRow = LBound(Table_Array, 1) To UBound(Table_Array, 1)

        strTest = ""        strTest = Table_Array(iRow, iStartCol)        strTest = NormaliseAddress(strTest)

        If strTest <> "" Then

            dblMatch = 0            dblMatch = MatchPhrase(strInput, strTest)   ' Consider coding up a  MatchPhrase_Express() function, with the preprocessing                                                        ' (StripChars, Split) of strInput done here, rather than repeatedly  

            If dblMatch = 1 Then ' Bail out on finding an exact match                iRowBest = iRow                Exit For            End If

            If dblMatch > dblBestMatch Then                dblBestMatch = dblMatch                iRowBest = iRow            End If

        End If ' strTest <> ""

    Next iRow

    If iRowBest = -1 Then        VFuzzyLookup_Address = "#NO MATCH"        Exit Function    End If

    VFuzzyLookup_Address = Table_Array(iRowBest, Col_Index_Num)

End Function

Page 10: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

Public Function MatchPhrase(ByVal Phrase1 As String, ByVal Phrase2 As String, Optional Compare As VbCompareMethod = vbTextCompare) As Double

' Function to compare two sentences. A version of this will be released to cater for the' specific needs of matching addresses, where we can make some assumptions about common' word-substitutions and abbreviations.

' THIS CODE IS IN THE PUBLIC DOMAIN

' This function consists of six processes:

' 1  Break out the phrases into arrays of words using the space character as the delimiter' 2  Populate a grid of word-matching scores for each word in Phrase 1 against Phrase 2;' 3  For each word in Phrase 1, identify the 'best match' from the words in Phrase 2' 4  Resolve 'collisions' - two or more words in phrase 1 matching the same word in phrase 2' 5  Compare the actual sequence of words in P1 with the positions of the matched words in P2' 6  Weight this comparison by the degree of matching measured at the level of individual words

' Process 4, resolving collisions, is an iterative loop inside process 3' Process 1 has an addditional step to check for deleted spaces

Dim arr1() As String            ' Phrase 1, broken out into individual wordsDim arr2() As String

Dim arrScores()    As Double    ' an array of percentage matches of each word in p1 against each word in p2

                                ' These two vectors are redundant in the sense that they hold information which                                ' can be extracted from arrScores(). However, using them saves a lot of looping:

Dim arrPositions() As Integer   ' For each word in p1, the position of the best-matching word in p2Dim arrSequence()  As Double    ' For each word in p1, a score for its concordance with a constructed sequence of matching words in P2

Dim n As Double                 ' should be an integer, but it will be used in floating-point                                ' division and I prefer to avoid casting in VBADim s1 As StringDim s2 As String

Page 11: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

Dim i As IntegerDim j As IntegerDim k As Integer

Dim iOffset As IntegerDim iShift As IntegerDim iDelete As Integer

Dim iPos As IntegerDim jPos As IntegerDim kPos As Integer

Dim iTotalLen As Integer

Dim dScore As DoubleDim dBest As DoubleDim dPenalty As Double

Dim d1 As DoubleDim d2 As Double

If Compare = vbTextCompare Then    Phrase1 = UCase(Phrase1)    Phrase2 = UCase(Phrase2)End If

If Phrase1 = Phrase2 Then    MatchPhrase = 1    Exit FunctionEnd If

' The line labels SplitSpace1 and SplitSpace2 are resynchronisation points for' restarting the process after restoring a deleted space in Phase1 or Phrase2.

Phrase1 = StripChars(Phrase1, " ")SplitSpace1:arr1 = Split(Phrase1, " ")

Phrase2 = StripChars(Phrase2, " ")SplitSpace2:arr2 = Split(Phrase2, " ")

ReDim arrScores(LBound(arr1) To UBound(arr1), LBound(arr2) To UBound(arr2))ReDim arrPositions(LBound(arr1) To UBound(arr1))ReDim arrSequence(LBound(arr1) To UBound(arr1))

' Test for deleted spaces. This is a lot of work, but a missing space is a' common error and the effects are out of all proportion to the size of the' error: so much so that I'm prepared to risk the occasional 'false alarm'.' It may even be worth repeating these two loops using fuzzy-matching with' Levenshtein scores rather than the simple string-comparisons shown below:

For i = LBound(arr1) To UBound(arr1) - 1

Page 12: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

    If arr1(i) <> "" And arr1(i + 1) <> "" Then

        s1 = arr1(i) & arr1(i + 1)

        For j = LBound(arr2) To UBound(arr2)            If UCase(arr2(j)) = UCase(s1) Then                Phrase2 = Substitute(Phrase2, arr2(j), arr1(i) & " " & arr1(i + 1), 1, Compare)                GoTo SplitSpace2            End If        Next j

    End If ' arr(i) = "" Or arr(i + 1) = "" Then

Next i

For j = LBound(arr2) To UBound(arr2) - 1

    If arr2(j) <> "" And arr2(j + 1) <> "" Then

        s2 = arr2(j) & arr2(j + 1)

        For i = LBound(arr1) To UBound(arr1)            If UCase(arr1(i)) = UCase(s2) Then                Phrase1 = Substitute(Phrase1, arr1(i), arr2(j) & " " & arr2(j + 1), 1, Compare)                GoTo SplitSpace1            End If         Next i

    End If

Next j

' Initialise the positions array with a negative value denoting 'not found'

For i = LBound(arr1) To UBound(arr1)    arrPositions(i) = -1    iTotalLen = iTotalLen + Len(arr1(i))Next i

' For each word in Phrase 1, identify the closest matching in Phrase 2 and record its position.

For i = LBound(arr1) To UBound(arr1)

    s1 = arr1(i)    dBest = 0    iPos = -1

    For j = LBound(arr2) To UBound(arr2)

        s2 = arr2(j)        dScore = 0

Page 13: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

        dScore = MatchWord(s1, s2, Compare)

        arrScores(i, j) = dScore        If dScore > dBest Then            dBest = dScore            iPos = j        End If

    Next j

    If iPos >= 0 Then        arrPositions(i) = iPos    End If

Next i

' Resolve collisions - two or more words in P1 that have 'best match' scores on the same word in p2' In theory this could be done without using the positions vector, as the information is in arrScores' In practice, arrPositions saves processing steps

For i = LBound(arrPositions) To UBound(arrPositions)

    iPos = arrPositions(i)

    For j = i + 1 To UBound(arrPositions)

        If iPos = arrPositions(j) And iPos >= 0 Then            ' Collision detected: which word has the best score?             d1 = arrScores(i, iPos)            d2 = arrScores(j, iPos)

            If d2 > d1 Then

                 'discard this recorded 'best match' position:                 arrScores(i, iPos) = -1

                'find the second-best score for d1                dBest = 0                kPos = -1                For k = LBound(arrScores, 2) To UBound(arrScores, 2)                    dScore = 0                    dScore = arrScores(i, k)                    If dScore > dBest Then                        dBest = dScore                        kPos = k                    End If                Next k                                 ' reset this conflicting position as word (i)'s match in phrase 2:                 arrPositions(j) = kPos                                 ' There is now a possibility that we have caused                ' a collision with a previous word in Phrase 1:

Page 14: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

                If k < i Then                    For k = LBound(arrPositions) To k - 1                        If arrPositions(k) = kPos Then                             'restart the loop at the colliding value                            i = k                            j = UBound(arr1) + 1                            Exit For                        End If                    Next k                End If ' k<1

            Else

                 ' discard this recorded 'best match' position:                 arrScores(j, iPos) = -1

                 'find the second-best score for d2 *after* the current position                dBest = 0                kPos = -1                For k = j + 1 To UBound(arr2)                    dScore = 0                    dScore = arrScores(j, k)                    If dScore > dBest Then                        dBest = dScore                        kPos = k                    End If                Next k

                arrPositions(j) = kPos

            End If ' d2 > d1

        End If

    Next j

Next i

' Constructing a sequence-matching score:

' If we were scoring jumbled sentences of unaltered words, we'd use an edit distance algorithm;' several are available, including replicating the Levenshtein distance at the word level. I've' chosen a crude single-pass algorithm with a forward bias, that 'expects' the word sequence to' resynchronise after each out-of-sequence word. It's quick, and the bias is valid - word-order' is not neutral in real-life examples, and the heavy penalty for word transpositions reflects' my belief that this is a more significant 'edit' than character transpositions in a word. A' more rigorous treatment would venture into the realms of natural-language

Page 15: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

processing; that is' out-of-scope for this application and far too ambitious for a self-contained function in VBA.

' Worked example:

' Compare two Phrases:'  "ABC DEF GHI JKL MNO PQR STU VWX",  "ABC DEF JKL STU MNO PQR VWX"

' Variable arrPositions records the placement of each word in phrase 1 in phrase 2:

' Phrase 1            "ABC DEF GHI JKL MNO PQR STU VWX"' Expected positions:   0   1   2   3   4   5   6   7' Actual position in p2 0   1  -1   2   4   5   3   6

' The variable arrSequence will capture the scores        ' Run the sequence-scoring loop:

' ABC   expected in position 0      found in 0                      Score 1/8' DEF   expected in position 1      found in 1                      Score 1/8' GHI   expected in position 2      DELETION     * frame shift -1 * Score NIL' JKL   expected in position 3-1    found in 2                      Score 1/8' MNO   expected in position 4-1    found in 4   * frame shift +1 * Score 1/8 * 7/8' PQR   expected in position 5      found in 5                      Score 1/8' STU   expected in position 6      found in 3   * frame shift -3 * Score 1/8 * (7/8)^3' VWX   expected in position 7-3    found in 6   * frame shift +2 * Score 1/8 * (7/8)^2

' Edit distance is 7: the out-of-sequence penalty of 7/8 will be applied seven times

' However, we do not deal with perfectly-matched words in real life, so we cannot apply' these penalties at the level of the entire phrase; we apply them at the level of the' individual word, where we can apply a weighting based on each word's Levenshtein score

' The exception is deleted words; we could consider the 'word match' weighting of zero' to be sufficient penalty but a more consistent result is obtained by applying a penalty' to the entire phrase

' Sanity check; run the function in reverse, testing Phrase 2 against phrase 1:

' Phrase 2            "ABC DEF JKL STU MNO PQR VWX"' Expected positions:   0   1   2   3   4   5   6' Actual position in p1 0   1   3   6   4   5   7

' ABC   expected in position 0      found in 0                      Score 1/8' DEF   expected in position 1      found in 1                      Score 1/8' JKL   expected in position 2      found in 3   * frame shift +1 * Score 1/8 * 7/8' STU   expected in position 3+1    found in 6   * frame shift +2 * Score 1/8 *

Page 16: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

(7/8)^2' MNO   expected in position 4+3    found in 4   * frame shift -3 * Score 1/8 * (7/8)^3' PQR   expected in position 5      found in 5                      Score 1/8' VWX   expected in position 6      found in 7   * frame shift -1 * Score 1/8 * 7/8

' Edit distance is 7: the out-of-sequence penalty of 7/8 will be applied seven times

' "But wasn't there an insertion, too? Phrase 1 has an extra word that isn't in Phrase 2!"

' Note that our choice of denominator (8, the longer of the two wordcounts) has the effect of' imputing a score of zero to the inserted word and applying a penalty of 7/8 to the entire phrase.

' A note on identifying the 'inserted word': actually, it's the word in Phrase 1 which didn't' score as 'best match' against any word in Phrase 2. It could've come a close second to any or' all of them.

If UBound(arr1) >= UBound(arr2) Then    n = UBound(arr1) + 1Else    n = UBound(arr2) + 1End If

dPenalty = 1 - (1 / n)iShift = 0       ' Sequence distance for out-of-place wordsiOffset = 0     ' Running total of this 'shift' variableiDelete = 0     ' Count the number of deletions

For i = LBound(arrPositions) To UBound(arrPositions)

    s1 = arr1(i)

    iPos = arrPositions(i)    iShift = iPos - i - iOffset

    Select Case iPos    Case Is < 0     'DELETION: no matching word was found in S2

        iShift = -1        arrSequence(i) = 0        iDelete = iDelete + 1

    Case Is = i + iOffset ' matched word is in the expected position

        iShift = 0        arrSequence(i) = 1 / n

Page 17: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

    Case Else

        arrSequence(i) = (dPenalty ^ Abs(iShift)) / n

    End Select

    iOffset = iOffset + iShift

Next i

MatchPhrase = 0

For i = LBound(arrPositions) To UBound(arrPositions)    dScore = 0    If arrPositions(i) > -1 Then        dScore = arrScores(i, arrPositions(i))        dScore = dScore * arrSequence(i)    Else         'apply a deletion penalty - this isn't as arbitrary as it might seem: it is a equivalent to the        '                           effect of an insertion, which acts by increasing the denominator        dScore = -Len(arr1(i)) / iTotalLen / n    End If    MatchPhrase = MatchPhrase + dScoreNext i

ExitFunction:

    Erase arrScores    Erase arrSequence    Erase arr1    Erase arr2

End Function

Private Function Minimum(ByVal a As Integer, _                         ByVal b As Integer, _                         ByVal c As Integer) As IntegerDim min As Integer

  min = a

  If b < min Then        min = b  End If

  If c < min Then        min = c  End If

Page 18: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

  Minimum = min

End Function

Private Function Levenshtein(ByVal s1 As String, ByVal s2 As String) As Integer

' Levenshtein Distance  - edit distance between two strings

' THIS CODE IS IN THE PUBLIC DOMAIN

Dim arr() As Integer    ' Scoring matrixDim n As Integer        ' length of s1Dim m As Integer        ' length of s2Dim i As Integer        ' iterates through s1Dim j As Integer        ' iterates through s2Dim s1_i As String      ' ith character of s1Dim s2_j As String      ' jth character of s2Dim cost As Integer     ' cost

n = Len(s1)m = Len(s2)

If n = 0 Then    Levenshtein = m    Exit FunctionEnd If

If m = 0 Then    Levenshtein = n    Exit FunctionEnd If

ReDim arr(0 To n, 0 To m) As Integer

For i = 0 To n    arr(i, 0) = iNext i

For j = 0 To m    arr(0, j) = jNext j

For i = 1 To n

    s1_i = Mid$(s1, i, 1)

    For j = 1 To m

        s2_j = Mid$(s2, j, 1)

Page 19: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

            If s1_i = s2_j Then                cost = 0            Else                cost = 1            End If

            arr(i, j) = Minimum(arr(i - 1, j) + 1, arr(i, j - 1) + 1, arr(i - 1, j - 1) + cost)

    Next j

Next i

  ' Step 7

  Levenshtein = arr(n, m)

ExitSub:    Erase arrEnd Function

Public Function NormaliseAddress(ByVal strAddress As String) As String

' This function is intended to remove or standardise common phrases' and abbreviations used in British postal addresses, allowing the use' of string-comparison algorithms in lists of names and addresses.

' Developers in other countries should review the word list used here,' as conventions probably differ in your local language or dialect.

strAddress = " " & UCase(strAddress) & " "

strAddress = Substitute(strAddress, ",", " ")strAddress = Substitute(strAddress, ".", " ")strAddress = Substitute(strAddress, "-", " ")strAddress = Substitute(strAddress, vbCrLf, " ")strAddress = Substitute(strAddress, " BLVD ", " BOULEVARD ")strAddress = Substitute(strAddress, " BVD ", " BOULEVARD ")strAddress = Substitute(strAddress, " AV ", " AVENUE ")strAddress = Substitute(strAddress, " AVE ", " AVENUE ")strAddress = Substitute(strAddress, " RD ", " ROAD ")strAddress = Substitute(strAddress, " WY ", " WAY ")strAddress = Substitute(strAddress, " EST ", " ESTATE ")strAddress = Substitute(strAddress, " PL ", " PLACE ")strAddress = Substitute(strAddress, " PK ", " PARK ")strAddress = Substitute(strAddress, " HSE ", " HOUSE ")strAddress = Substitute(strAddress, " H0 ", " HOUSE ")strAddress = Substitute(strAddress, " GDNS ", " GARDENS ")

strAddress = Substitute(strAddress, "&", "AND")strAddress = Substitute(strAddress, " LIMITED ", " LTD ")strAddress = Substitute(strAddress, " COMPANY ", " CO ")strAddress = Substitute(strAddress, " CORPORATION ", " CORP ")strAddress = Substitute(strAddress, " T/A ", " TA ")

Page 20: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

strAddress = Substitute(strAddress, " TRADING AS ", " TA ")

' Common personal titles: these are often applied inconsistently or' omitted, and must therefore be removed. Specific applications may' require additional titles and their abbreviations - military rank,' academic titles and degrees, courtesy titles of the aristocracy,' knighthoods and honours (particularly for lists of civil servants)

strAddress = Substitute(strAddress, " ESQ ", " ")strAddress = Substitute(strAddress, " MR ", " ")strAddress = Substitute(strAddress, " MRS ", " ")strAddress = Substitute(strAddress, " MISS ", " ")strAddress = Substitute(strAddress, " MS ", " ")strAddress = Substitute(strAddress, " MESSRS ", " ")strAddress = Substitute(strAddress, " SIR ", " ")strAddress = Substitute(strAddress, " OF ", " ")strAddress = Substitute(strAddress, " DR ", " ")strAddress = Substitute(strAddress, " OR ", " ")strAddress = Substitute(strAddress, " IN ", " ")strAddress = Substitute(strAddress, " THE ", " ")strAddress = Substitute(strAddress, " REVEREND ", " REV ")strAddress = Substitute(strAddress, " REVERENT ", " REV ")strAddress = Substitute(strAddress, " HONOURABLE ", " HON ")strAddress = Substitute(strAddress, " BROS ", " BROTHERS ")strAddress = Substitute(strAddress, " ASSOC ", " ASSOCIATION ")strAddress = Substitute(strAddress, " ASSN ", " ASSOCIATION ")

' Standardising 'St.', 'St', and 'Street'. Note that there are over 40 English' towns and place names that contain or consist entirely of the word 'Street'.' In addition, 'St' is a common abbreviation for 'Saint' in addresses.

' I have never seen a list of addresses where 'Street' and 'St' were used in a' consistent way, and the only workable solution is to delete them all:

strAddress = Substitute(strAddress, " STREET ", " ")strAddress = Substitute(strAddress, " ST ", " ")strAddress = Substitute(strAddress, " STR ", " ")

Do While InStr(strAddress, "  ") > 0    strAddress = Substitute(strAddress, "  ", " ")Loop

strAddress = Trim(strAddress)

NormaliseAddress = strAddress

End Function

Public Function StripChars(myString As String, ParamArray Exceptions()) As String

' Strip out all non-alphanumeric characters from a string in a single pass' Exceptions parameters allow you to retain specific characters (eg: spaces)

' THIS CODE IS IN THE PUBLIC DOMAIN

Page 21: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

Application.Volatile False

Dim i As IntegerDim iLen As IntegerDim chrA As String * 1Dim intA As IntegerDim j As IntegerDim iStart As IntegerDim iEnd As Integer

If Not IsEmpty(Exceptions()) Then    iStart = LBound(Exceptions)    iEnd = UBound(Exceptions)End If

iLen = Len(myString)

For i = 1 To iLen    chrA = Mid(myString, i, 1)    intA = Asc(chrA)    Select Case intA    Case 48 To 57, 65 To 90, 97 To 122        StripChars = StripChars & chrA    Case Else        If Not IsEmpty(Exceptions()) Then            For j = iStart To iEnd                If chrA = Exceptions(j) Then                    StripChars = StripChars & chrA                    Exit For ' j                End If            Next j        End If    End SelectNext i

End Function

Private Function Substitute(ByVal Text As String, _                           ByVal Old_Text As String, _                           ByVal New_Text As String, _                           Optional Instance As Integer = 0, _                           Optional Compare As VbCompareMethod = vbTextCompare _                            ) As String

'Replace all instances (or the nth instance ) of 'Old' text with 'New'' Unlike VB.Mid$ this method is not sensitive to length and can replace ALL instances' This is not exposed as a Public function because there is an Excel Worksheet function called Substitute()

' THIS CODE IS IN THE PUBLIC DOMAIN

Page 22: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

Dim iStart As IntegerDim iEnd As IntegerDim iLen As IntegerDim iInstance As IntegerDim strOut As String

iLen = Len(Old_Text)

If iLen = 0 Then    Substitute = Text    Exit FunctionEnd If

iEnd = 0iStart = 1

iEnd = InStr(iStart, Text, Old_Text, Compare)

If iEnd = 0 Then    Substitute = Text    Exit FunctionEnd If

strOut = ""

Do Until iEnd = 0

    strOut = strOut & Mid$(Text, iStart, iEnd - iStart)    iInstance = iInstance + 1

    If Instance = 0 Or Instance = iInstance Then        strOut = strOut & New_Text    Else        strOut = strOut & Mid$(Text, iEnd, Len(Old_Text))    End If

    iStart = iEnd + iLen    iEnd = InStr(iStart, Text, Old_Text, Compare)

Loop

iLen = Len(Text)strOut = strOut & Mid$(Text, iStart, iLen - iEnd)

Substitute = strOut

End Function

Public Function Contains(ByVal MainString As String, ParamArray SeekString()) As Boolean

' A version of Instr() that returns TRUE if any 'seek' string is a substring of the

Page 23: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

main string

' THIS CODE IS IN THE PUBLIC DOMAIN

Application.Volatile False

Dim i As IntegerDim j As IntegerDim arrX As VariantDim strTest As StringContains = False

If MainString = "" Then    Exit FunctionEnd If

For i = LBound(SeekString) To UBound(SeekString)

    If IsArray(SeekString(i)) Then        arrX = SeekString(i)        For j = LBound(arrX) To UBound(arrX)

            strTest = ""            strTest = Trim(arrX(j))

            If Len(strTest) <= Len(MainString) And Len(strTest) > 0 Then

                If InStr(1, MainString, strTest, vbTextCompare) > 0 Then                    Contains = True                    Exit Function                End If

            End If

        Next j        Erase arrX

    Else

        strTest = ""        strTest = Trim(SeekString(i))

        If Len(strTest) > 0 Then

            If Len(strTest) < Len(MainString) And Len(strTest) > 0 Then

                If InStr(1, MainString, strTest, vbTextCompare) > 0 Then                    Contains = True                    Exit Function                End If

            End If

Page 24: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

        End If

    End If

Next i

End Function

There's a particular point about the Phrase-matching function that could probably be improved:

' For each word in Phrase 1, identify the closest matching in Phrase 2 and record its position.

Access can incorporate globally-declared VBA functions in queries, so try this out:

    SELECT TOP 3

    "Buckingham Palace, LONDON SW1" AS SeekAddress,    MatchPhrase(            NormaliseAddress("Buckingham Palace, LONDON SW1"),            NormaliseAddress([tblAddress].[FullAddress])                ) AS MatchScore,    tblAddress.*

FROM     tblAddress

ORDER BY    MatchPhrase(            NormaliseAddress("Buckingham Palace, LONDON SW1"),            NormaliseAddress([tblAddress].[FullAddress])                 ) DESC

Function SOUNDEX(Surname As String) As String' The Custom Function "SOUNDEX" used in this macro was Developed by Richard J. Yanco.' This function follows the Soundex rules given at' http://home.utah-inter.net/kinsearch/Soundex.html' The VB Application "Soundex_Codes; By: Joe Was & John Walkenbach.

Dim Result As String, c As String * 1Dim Location As Integer

Page 25: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

Surname = UCase(Surname)

' First character must be a letterIf Asc(Left(Surname, 1)) < 65 Or Asc(Left(Surname, 1)) > 90 ThenSOUNDEX = ""Exit FunctionElse' St. is converted to SaintIf Left(Surname, 3) = "ST." ThenSurname = "SAINT" & Mid(Surname, 4)End If

' Convert to Soundex: letters to their appropriate digit,' A,E,I,O,U,Y ("slash letters") to slashes' H,W, and everything else to zero-length string

Result = Left(Surname, 1)For Location = 2 To Len(Surname)Result = Result & Category(Mid(Surname, Location, 1))Next Location

' Remove double lettersLocation = 2Do While Location < Len(Result)If Mid(Result, Location, 1) = Mid(Result, Location + 1, 1) ThenResult = Left(Result, Location) & Mid(Result, Location + 2)ElseLocation = Location + 1End IfLoop

' If category of 1st letter equals 2nd character, remove 2nd characterIf Category(Left(Result, 1)) = Mid(Result, 2, 1) ThenResult = Left(Result, 1) & Mid(Result, 3)End If

' Remove slashesFor Location = 2 To Len(Result)If Mid(Result, Location, 1) = "/" ThenResult = Left(Result, Location - 1) & Mid(Result, Location + 1)End IfNext

' Trim or pad with zeroes as necessarySelect Case Len(Result)Case 4SOUNDEX = ResultCase Is < 4SOUNDEX = Result & String(4 - Len(Result), "0")Case Is > 4SOUNDEX = Left(Result, 4)End SelectEnd IfEnd Function

Page 26: SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107.docx

Private Function Category(c) As String' Returns a Soundex code for a letterSelect Case TrueCase c Like "[AEIOUY]"Category = "/"Case c Like "[BPFV]"Category = "1"Case c Like "[CSKGJQXZ]"Category = "2"Case c Like "[DT]"Category = "3"Case c = "L"Category = "4"Case c Like "[MN]"Category = "5"Case c = "R"Category = "6"Case Else 'This includes H and W, spaces, punctuation, etc.Category = ""End SelectEnd Function