Jump to content

How to compare 2 strings to get a similarity percent in result


Recommended Posts

Hello,

How can I compare 2 strings to get a percent result about similarity ?

Example :

String 1 : "Hello Worlds !"

String 2 : "Hello my World !!!"

I need a % result, for example : 70 % similar...

Many thanks ! :-)

Link to comment
Share on other sites

Hi,

By using the StringCompare function with some math tricks.

Here you go :

#include <Misc.au3>

Local Const $s1 = "toto"
Local Const $s2 = "tata"

Local Const $a1 = StringSplit($s1, ""), $a2 = StringSplit($s2, "")

Local Const $iMax = _Iif($a1[0] > $a2[0], $a2[0], $a1[0])

Local $iDiffCount = 0

For $i = 1 To $iMax
    If StringCompare($a1[$i], $a2[$i], 2) <> 0 Then $iDiffCount += 1
Next

ConsoleWrite("Diff: " & $iDiffCount / $iMax * 100 & "%" & @CrLf)

Br, FireFox.

Edited by FireFox
Link to comment
Share on other sites

Wish I could cite the source:

Func _Typos(Const $st1, Const $st2, $anychar = '_', $anytail = '%') ; Get amount of typos between two strings
Local $s1, $s2, $pen, $del, $ins, $subst
If Not IsString($st1) Then Return SetError(-1, -1, -1)
If Not IsString($st2) Then Return SetError(-2, -2, -1)
If $st2 = '' Then Return StringLen($st1)
If $st2 == $anytail Then Return 0
If $st1 = '' Then
Return(StringInStr($st2 & $anytail, $anytail, 1) - 1)
EndIf
;~ $s1 = StringSplit(_LowerUnaccent($st1)), "", 2) ;; _LowerUnaccent() addon function not available here
;~ $s2 = StringSplit(_LowerUnaccent($st2)), "", 2) ;; _LowerUnaccent() addon function not available here
$s1 = StringSplit(StringLower($st1), "", 2)
$s2 = StringSplit(StringLower($st2), "", 2)
Local $l1 = UBound($s1), $l2 = UBound($s2)
Local $r[$l1 + 1][$l2 + 1]
For $x = 0 To $l2 - 1
Switch $s2[$x]
Case $anychar
    If $x < $l1 Then
     $s2[$x] = $s1[$x]
    EndIf
Case $anytail
    $l2 = $x
    If $l1 > $l2 Then
     $l1 = $l2
    EndIf
    ExitLoop
EndSwitch
$r[0][$x] = $x
Next
$r[0][$l2] = $l2
For $x = 0 To $l1
$r[$x][0] = $x
Next
For $x = 1 To $l1
     For $y = 1 To $l2
$pen = Not ($s1[$x - 1] == $s2[$y - 1])
$del = $r[$x-1][$y] + 1
$ins = $r[$x][$y-1] + 1
$subst = $r[$x-1][$y-1] + $pen
If $del > $ins Then $del = $ins
If $del > $subst Then $del = $subst
$r[$x][$y] = $del
If ($pen And $x > 1 And $y > 1 And $s1[$x-1] == $s2[$y-2] And $s1[$x-2] == $s2[$y-1]) Then
    If $r[$x][$y] >= $r[$x-2][$y-2] Then $r[$x][$y] = $r[$x-2][$y-2] + 1
    $r[$x-1][$y-1] = $r[$x][$y]
EndIf
Next
Next
Return ($r[$l1][$l2])
;~ ; usage
;~ Local $reference = "lexicographically"
;~ Local $Words[11][2] = [ _
;~ [$reference], _
;~ ["Lexicôgraphicaly"], _
;~ ["lexkographicaly"], _
;~ ["Lexico9raphically"], _
;~ ["lexioo9asdasraphically"], _
;~ ["Lexicographical"], _
;~ ["lexicographlcally"], _
;~ ["[email="Lex1cogr@phically"]Lex1cogr@phically[/email]"], _
;~ ["lexic0graphïca1yl"], _
;~ ["lexIcOgraphically"], _
;~ ["Lexlcographically"] _
;~ ]
;~ For $i = 0 To UBound($Words) - 1
;~ $Words[$i][1] = _Typos($Words[$i][0], $reference)
;~ Next
;~ _ArrayDisplay($Words, "Number of typos")
;~ ConsoleWrite("Usage of '_' and '%' wildcards in pattern:" & @LF & @TAB & "_Typos([email="'lex1c0gr@fhlâofznho'"]'lex1c0gr@fhlâofznho'[/email], 'LEx_c_gr%') = " & _Typos([email="'lex1c0gr@fhlofznho'"]'lex1c0gr@fhlofznho'[/email], 'lex_c_gr%') & @LF)
;~ ConsoleWrite("Does not always return the absolute minimum edit distance:" & @LF & @TAB & "_Typos('bdac', 'abcd') = " & _Typos('bdac', 'abcd') & @LF)
;~
EndFunc

got it, jchd:

Edited by jdelaney
IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.
Link to comment
Share on other sites

Thank's FireFox and Jdelaney for your help.

when I try in your script (Mr FireFox) :

Local Const $s1 = "pizza service"

Local Const $s2 = "Pizza Service"

result is 0 % (perfect for me)

But :

Local Const $s1 = "pizza service"

Local Const $s2 = "the pizza Service"

result is 100 % (is not good, il would like about 20 % of difference)

Link to comment
Share on other sites

If you need it case sensitive then just change this line in the example Firefox provided

If StringCompare($a1[$i], $a2[$i], 2) <> 0 Then $iDiffCount += 1
to this
If StringCompare($a1[$i], $a2[$i], 1) <> 0 Then $iDiffCount += 1

My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (NEW 2024-07-28 - Version 1.6.3.0) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2017-07-21 - Version 0.4.0.1) - Download - General Help & Support - Example Scripts
OutlookEX (2021-11-16 - Version 1.7.0.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX_GUI (2021-04-13 - Version 1.4.0.0) - Download
Outlook Tools (2019-07-22 - Version 0.6.0.0) - Download - General Help & Support - Wiki
PowerPoint (2021-08-31 - Version 1.5.0.0) - Download - General Help & Support - Example Scripts - Wiki
Task Scheduler (2022-07-28 - Version 1.6.0.1) - Download - General Help & Support - Wiki

Standard UDFs:
Excel - Example Scripts - Wiki
Word - Wiki

Tutorials:
ADO - Wiki
WebDriver - Wiki

 

Link to comment
Share on other sites

Thanks Water,

My problem was not with sensitive case,

Problem is :

Local Const $s1 = "pizza service"

Local Const $s2 = "the pizza Service"

result is 100 % of difference (is not good for me, il would like about 20 % of difference)

Link to comment
Share on other sites

result is 100 % of difference (is not good for me, il would like about 20 % of difference)

Yes because it starts from the left to right, I don't know what is best algorithm that would fit your need.

Maybe a second check from the opposite direction and take the less difference ?

Br, FireFox.

Link to comment
Share on other sites

My best bet is: Search for an algorithm written in Visual Basic and then translate it to AutoIt.

My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (NEW 2024-07-28 - Version 1.6.3.0) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2017-07-21 - Version 0.4.0.1) - Download - General Help & Support - Example Scripts
OutlookEX (2021-11-16 - Version 1.7.0.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX_GUI (2021-04-13 - Version 1.4.0.0) - Download
Outlook Tools (2019-07-22 - Version 0.6.0.0) - Download - General Help & Support - Wiki
PowerPoint (2021-08-31 - Version 1.5.0.0) - Download - General Help & Support - Example Scripts - Wiki
Task Scheduler (2022-07-28 - Version 1.6.0.1) - Download - General Help & Support - Wiki

Standard UDFs:
Excel - Example Scripts - Wiki
Word - Wiki

Tutorials:
ADO - Wiki
WebDriver - Wiki

 

Link to comment
Share on other sites

Thanks for your link water, like I said there is different algorithms to check the similarity of strings and from this search I'm coming up with the link below.

@sambalec

Can you chose an algorithm from this page? Me or someone else will be glad to translate it for you ;)

Br, FireFox.

Edited by FireFox
Link to comment
Share on other sites

Local $reference = "pizza service"
Local $Words[4][4] = [ _
  [$reference], _
  ["the pizza service"], _
  ["tha piza service"], _
  ["pitza sarvace"]]
For $i = 0 To UBound($Words) - 1
 $Words[$i][1] = _Typos($Words[$i][0], $reference)
 $Words[$i][2] = (StringLen($reference) - $Words[$i][1]) / StringLen($reference)
 $Words[$i][3] = Abs(1-(StringLen($reference) - $Words[$i][1]) / StringLen($reference))
Next
_ArrayDisplay($Words, "Number of typos")
Exit
Func _Typos(Const $st1, Const $st2, $anychar = '_', $anytail = '%') ; Get amount of typos between two strings
 Local $s1, $s2, $pen, $del, $ins, $subst
 If Not IsString($st1) Then Return SetError(-1, -1, -1)
 If Not IsString($st2) Then Return SetError(-2, -2, -1)
 If $st2 = '' Then Return StringLen($st1)
 If $st2 == $anytail Then Return 0
 If $st1 = '' Then
  Return (StringInStr($st2 & $anytail, $anytail, 1) - 1)
 EndIf
;~ $s1 = StringSplit(_LowerUnaccent($st1)), "", 2) ;; _LowerUnaccent() addon function not available here
;~ $s2 = StringSplit(_LowerUnaccent($st2)), "", 2) ;; _LowerUnaccent() addon function not available here
 $s1 = StringSplit(StringLower($st1), "", 2)
 $s2 = StringSplit(StringLower($st2), "", 2)
 Local $l1 = UBound($s1), $l2 = UBound($s2)
 Local $r[$l1 + 1][$l2 + 1]
 For $x = 0 To $l2 - 1
  Switch $s2[$x]
   Case $anychar
    If $x < $l1 Then
     $s2[$x] = $s1[$x]
    EndIf
   Case $anytail
    $l2 = $x
    If $l1 > $l2 Then
     $l1 = $l2
    EndIf
    ExitLoop
  EndSwitch
  $r[0][$x] = $x
 Next
 $r[0][$l2] = $l2
 For $x = 0 To $l1
  $r[$x][0] = $x
 Next
 For $x = 1 To $l1
  For $y = 1 To $l2
   $pen = Not ($s1[$x - 1] == $s2[$y - 1])
   $del = $r[$x - 1][$y] + 1
   $ins = $r[$x][$y - 1] + 1
   $subst = $r[$x - 1][$y - 1] + $pen
   If $del > $ins Then $del = $ins
   If $del > $subst Then $del = $subst
   $r[$x][$y] = $del
   If ($pen And $x > 1 And $y > 1 And $s1[$x - 1] == $s2[$y - 2] And $s1[$x - 2] == $s2[$y - 1]) Then
    If $r[$x][$y] >= $r[$x - 2][$y - 2] Then $r[$x][$y] = $r[$x - 2][$y - 2] + 1
    $r[$x - 1][$y - 1] = $r[$x][$y]
   EndIf
  Next
 Next
 Return ($r[$l1][$l2])
EndFunc   ;==>_Typos

output: (against the expected)

|String|Count wrong|Percent correct|Percent Wrong

[0]|pizza service|0|1|0

[1]|the pizza service|4|0.692307692307692|0.307692307692308

[2]|tha piza service|5|0.615384615384615|0.384615384615385

[3]|pitza sarvace|3|0.769230769230769|0.230769230769231

or, switch the comparison to be against the actual:

using:

Local $reference = "pizza service"
Local $Words[4][4] = [ _
  [$reference], _
  ["the pizza service"], _
  ["tha piza service"], _
  ["pitza sarvace"]]
For $i = 0 To UBound($Words) - 1
 $Words[$i][1] = _Typos($Words[$i][0], $reference)
 $Words[$i][2] = (StringLen($Words[$i][0]) - $Words[$i][1]) / StringLen($Words[$i][0])
 $Words[$i][3] = Abs(1-(StringLen($Words[$i][0]) - $Words[$i][1]) / StringLen($Words[$i][0]))
Next
_ArrayDisplay($Words, "Number of typos")

output:

[0]|pizza service|0|1|0

[1]|the pizza service|4|0.764705882352941|0.235294117647059

[2]|tha piza service|5|0.6875|0.3125

[3]|pitza sarvace|3|0.769230769230769|0.230769230769231

IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.
Link to comment
Share on other sites

symbalec,

What percent similar are the following two sets of strings (by your definition of similar)?

abcd

acbd

and

z

zzz

kylomas

also: these strings

the boy

the boy

Edited by kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Link to comment
Share on other sites

A percentage is not obviously the most informative measure since it depends on the length of the string. My function returns the number of edits required to change string1 into string2.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

the Pizza Service

Service Pizza the

I need to get same result :-)

these wouldn't return the same result...these would:

the Pizza Service

Pizza Service the

Local $reference = "Pizza Service"
Local $Words[4][4] = [ _
  [$reference], _
  ["the Pizza Service"], _
  ["Pizza Service the"], _
  ["Service Pizza the"]]
For $i = 0 To UBound($Words) - 1
 $Words[$i][1] = _Typos($Words[$i][0], $reference)
 $Words[$i][2] = (StringLen($Words[$i][0]) - $Words[$i][1]) / StringLen($Words[$i][0])
 $Words[$i][3] = Abs(1-(StringLen($Words[$i][0]) - $Words[$i][1]) / StringLen($Words[$i][0]))
Next
_ArrayDisplay($Words, "Number of typos")

output:

[0]|Pizza Service|0|1|0

[1]|the Pizza Service|4|0.764705882352941|0.235294117647059

[2]|Pizza Service the|4|0.764705882352941|0.235294117647059

[3]|Service Pizza the|14|0.176470588235294|0.823529411764706

IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.
Link to comment
Share on other sites

A percentage is not obviously the most informative measure since it depends on the length of the string. My function returns the number of edits required to change string1 into string2.

I know, trying to understand the OP's rules...

sambelec,

Try this

local $str1 = 'the pizaa service', $init_len1 = stringlen($str1)
local $str2 = 'pizaa service the', $init_len2 = stringlen($str2)

for $1 = 1 to stringlen($str1)
    for $2 = 1 to stringlen($str2)
        if stringmid($str1,$1,1) = stringmid($str2,$2,1) then
            $str2 = stringreplace($str2,stringmid($str2,$2,1),'_')
            $str1 = stringreplace($str1,stringmid($str1,$1,1),'_')
        endif
    next
next

$str1 = stringreplace($str1,'_','')
$str2 = stringreplace($str2,'_','')

ConsoleWrite('String1 is ' & round( (stringlen($str1)/$init_len2)*100,2 ) & '% different from string2' & @LF)
ConsoleWrite('String2 is ' & round( (stringlen($str2)/$init_len1)*100,2 ) & '% different from string1' & @LF)

kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Link to comment
Share on other sites

My remark was not towards you kylomas.

Fuzzy question, fuzzy answer.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

My remark was not towards you kylomas.

Fuzzy question, fuzzy answer.

Yes, I know, been trying to get specifications.

@sambalec,

Please define exactly what you want.

kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Link to comment
Share on other sites

sambalec, (follow up from 04/02/2013)

The code that I posted simply eliminates "like" letters from each subject string. Therefore, this will produce differences of "0" percent:

;local $str1 = 'the pizaa service', $init_len1 = stringlen($str1)
;local $str2 = 'pizaa service the', $init_len2 = stringlen($str2)
local $str1 = 'zzzzzz', $init_len1 = stringlen($str1)
local $str2 = 'z', $init_len2 = stringlen($str2)

for $1 = 1 to stringlen($str1)
for $2 = 1 to stringlen($str2)
if stringmid($str1,$1,1) = stringmid($str2,$2,1) then
$str2 = stringreplace($str2,stringmid($str2,$2,1),'_')
$str1 = stringreplace($str1,stringmid($str1,$1,1),'_')
endif
next
next

$str1 = stringreplace($str1,'_','')
$str2 = stringreplace($str2,'_','')

ConsoleWrite('String1 is ' & round( (stringlen($str1)/$init_len2)*100,2 ) & '% different from string2' & @LF)
ConsoleWrite('String2 is ' & round( (stringlen($str2)/$init_len1)*100,2 ) & '% different from string1' & @LF)

Do you see why we are asking for further specifications?

kylomas

edit: addfitional info

This version leaves duplicate characters, so "z" compared to "zzzzzz" is 500% different (because there are 5 "z'" left over)

;local $str1 = 'the pizaa service', $init_len1 = stringlen($str1)
;local $str2 = 'pizaa service the', $init_len2 = stringlen($str2)
local $str1 = 'zzzzzz', $init_len1 = stringlen($str1)
local $str2 = 'z', $init_len2 = stringlen($str2)

for $1 = 1 to stringlen($str1)
    for $2 = 1 to stringlen($str2)
        if stringmid($str1,$1,1) = stringmid($str2,$2,1) then
            $str2 = stringreplace($str2,stringmid($str2,$2,1),'_',1)
            $str1 = stringreplace($str1,stringmid($str1,$1,1),'_',1)
        endif
    next
next

$str1 = stringreplace($str1,'_','')
$str2 = stringreplace($str2,'_','')

ConsoleWrite('String1 is ' & round( (stringlen($str1)/$init_len2)*100,3 ) & '% different from string2' & @LF)
ConsoleWrite('String2 is ' & round( (stringlen($str2)/$init_len1)*100,3 ) & '% different from string1' & @LF)
Edited by kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...