compare strings

fgthhhh · April 26, 2010

i have a main-string and other sub-strings,find the sub-string which is like the main-string most. i don't know how to do it

example: main-string: aqwert

sub-strings:

+qwerb

+gfdgf

+qwbt

result:

+qwerb : 80% ( have same qwer)

+gfdgf: 0% ( nothing like)

+qwbt: 20% ( have same qw)

pls help me, thx

water · April 26, 2010

I think "StringRegExp" will do what you need. I'm no expert so the following example checks for any character in the pattern and therefore doesn't give the result you need:

#include <array.au3>
$string = "aqwert"
$pattern = "qwbt"
$R = StringRegExp($string,"[" & $pattern & "]",3)
If IsArray($R) Then
    MsgBox(0,"",UBound($R)*100/StringLen($pattern) & "% match")
Else
    MsgBox(0,"","0% match")
EndIf

Maybe some RegExpr Guru can jump in and give you the correct expression.

whim · April 26, 2010

This might help as well

wim

fgthhhh · April 26, 2010

StringRegExp worked like magic but i still don't understand how it work

Approximate string matching showed me more than really complicated :idea:

anyway, thanks you two so much, i will need research more

jchd · April 26, 2010

You can use my Typos() fuzzy comparison function: Typos.au3

It computes the edit distance between two strings, that is the number of omissions, insertions, changes or swap of letters necessary to transform one string into the other. If you compare several strings in succession and keep one having the smallest errors (typos) you'll be home.

Optionally, you can use two distinct wildcards in the second string: _ and % (the same characters than in SQL LIKE.)

_ is a single character joker, much like ? in Windows filename patterns

% may represent one or more characters, like Windows * (but % may only appear at the end of the second parameter)

Try it and post again if you have problems using it.

fgthhhh · April 29, 2010

hi jchd, u wrote a awesome script

but i don't understand what the function return ?

0 is the same?

higher number mean more mistake?

i try

$asd = _Typos("aqwert", "qwertb")

MsgBox(0,"",$asd)

it return 2

what does it mean?

Edited April 29, 2010 by fgthhhh

czardas · April 29, 2010

hi jchd, u wrote a awesome script
but i don't understand what the function return ?
0 is the same?
higher number mean more mistake?
i try
$asd = _Typos("aqwert", "qwertb")
MsgBox(0,"",$asd)
it return 2
what does it mean?

I took a quick look at jchd's code. It seems that the return value 2 means that there are two changes needed to convert one string to the other. The changes are as follows:

1. Delete the first character => a

2. Add a character on the end => b.

This converts one string to the other in 2 steps. jchd will be able to tell you if I'm wrong about this.

Edited April 29, 2010 by czardas

jchd · April 29, 2010

That's correct.

If typos($str1, $str2) = 0 Then MsgBox(0, $str1 & ' and ' & $str2 & ' are identical (case-sensitive wise).')

; Computes the number of typos (Damerau-Levenshtein distance) between two strings.

; Four types of differences are counted:

; insertion of a character, abcd ab#cd

; deletion of a character, abcd acd

; exchange of a character abcd ab$d

; inversion of adjacent chars abcd acbd

;

; This function does NOT satisfy the so-called "triangle inequality", which means

; more simply that it makes NO attempt to compute the MINIMUM edit distance in all

; cases. If you need that, you should use more complex algorithms.

;

; This simple function allows a fuzzy compare for e.g. recovering from typical

; human typos in short strings like names, address, cities... while getting rid of

; minor scripting differences (accents, ligatures).

;

; Strings are unaccented then lowercased.

; String $st2 can be used as a pattern similar to the SQL 'LIKE' operator:

; '_' and trailing '%' act as in LIKE. These wildcards can be passed as parameters

; but % should appear at most once for the function to work properly.

Another comment, comes from the C version I use for SQLite extension:

** TYPOS($str1, $str2)

** returns the "Damerau-Levenshtein distance" between StringLower(str1) and

** StringLower(str2). This is the number of insertions, omissions, changes

** and transpositions (of adjacent letters only).

**

** If the reference string is 'abcdef', it will return 1 (one typo) for

** 'abdef' missing c

** 'abcudef' u inserted

** 'abzef' c changed into z

** 'abdcef' c & d exchanged

**

** Only one level of "typo" is considered, e.g. the function will

** consider the following transformations to be 3 typos:

** 'abcdef' reference

** 'abdcef' c & d exchanged

** 'abdzcef' z inserted inside (c & d exchanged)

** In this case, it will return 3. Technically, it does not

** always return the minimum edit distance and doesn't satisfy

** the "triangle inequality" in all cases. It is nonetheless

** very useful to anyone having to lookup simple entry subject to

** user typo (e.g. name or city name).

**

** It will also accept '_' and a trailing '%' in str2, both acting

** as in SQL LIKE operator.

**

** You can use it this way:

** $str = "Leiwenschtein"

** If typos($str, 'leivencht%') <= 2;

** or this way:

** $nbErrors = typos($str1, $str2)

**

** NOTE: the implementation may seem naive but is open to several

** evolutions. Due to the complexity in O(n*m) you

** should reserve its use to _short_ fields only. There

** are much better algorithms for large fields (most of

** which are terrible for small strings.) The choice made

** reflects the typical need to match names, surnames,

** street addresses, cities or such data prone to typos

** in user input. Flexibility has been choosen over mere

** performance, because fuzzy search is _slow_ anyway.

** So you better have a 380% slower algo that retrieves

** the data you're after, than a 100% slow algo that misses

** them most of the times.

**

** | DO NOT use TYPOS in case StringInStr would do! for instance, if

** | your data contains a fixed substring (without typo),

** | then use:

** | If StringInStr($cityname, 'angel') Then

** | It will match 'Los Angeles' without question. If you try:

** | If typos($cityname, 'angel%') <= 4 Then

** | you will be overhelmed with data from everywhere, since up

** | to 4 typos allows for typically _many_ values (cities, here).

Hope this clears some mud. If you still have practical problems using it in real-world, post here.

fgthhhh · April 30, 2010

thanks mate for answer.

i want ask a question

can i use your script for auto-correct word?

if yes, can u show me an example?

ex: "thraa" how can it auto-correct to "three"?

can i compare the "thraa" with some possible words and choose the best?

Edited April 30, 2010 by fgthhhh

jchd · April 30, 2010

You may have some (relative) success in doing so, but mostly for limited cases. For instance, this function works well in selecting words from a list which have a spelling close to a given word. It was designed in this goal as an extension to a database engine.

In your example, only a human brain or really "smart" program can chose which of threw, three, tharm (for instance) should be the replacement for thraa. For making the (right) correction by program, you have to identify he context, the grammar, the partial semantics and devise a target global semantics to infer the right correction.

For spelling or grammar correction, you'll have much better time using one of the available libraries specialized in those task.

fgthhhh · April 30, 2010

all my words is just limit from one to twenty( 1->20) so it will not have threw or tharm

can u show me a way to correct the word?

i really need an example to understand the code :idea:

Edited April 30, 2010 by fgthhhh

jchd · April 30, 2010

Do you mean the numbers 1 to 20 in plain text?

If so, place the text in an array $A and find the minimum of typos($A[$i], $word), if any.

Try to come up with somehing of your own.

fgthhhh · April 30, 2010

help me checking if it's ok

$numeros[0]="one"
$numeros[1]="two"
$numeros[2]="three"
$numeros[3]="four"
$numeros[4]="five"
$numeros[5]="six"
$numeros[6]="seven"
$numeros[7]="eight"
$numeros[8]="nine"
$numeros[9]="ten"
$numeros[10]="eleven"
$numeros[11]="twelve"
$numeros[12]="thirteen"
$numeros[13]="fourteen"
$numeros[14]="fifteen"
$numeros[15]="sixteen"
$numeros[16]="seventeen"
$numeros[17]="eighteen"
$numeros[18]="nineteen"
$numeros[19]="twenty"

$test_word = "thraa"
dim $result[20]
for $k = 0 to 19
    $result[$k] = typos($numeros[$k], $test_word)
next
_ArraySort($result) ; or _arraymin($result)

; then i can get the lowest result but i can't get the correct word

i stucked here, i don't know how to get the correct answer

Edited April 30, 2010 by fgthhhh

jchd · April 30, 2010

Hey, calm down. There is no need to brag like you do!

Use something along this line:

#include <String.au3>

Local Const $numeros[20] = [ _
    "one", _
    "two", _
    "three", _
    "four", _
    "five", _
    "six", _
    "seven", _
    "eight", _
    "nine", _
    "ten", _
    "eleven", _
    "twelve", _
    "thirteen", _
    "fourteen", _
    "fifteen", _
    "sixteen", _
    "seventeen", _
    "eighteen", _
    "nineteen", _
    "twenty" _
]

Local $test_word = "thraa"
Local $bestMatch = StringLen($test_word), $bestMatchIdx, $typos
For $k = 0 To UBound($numeros) - 1
    $typos = Typos($numeros[$k], $test_word)
    If $typos < $bestMatch Then
        $bestMatch = $typos
        $bestMatchIdx = $k
    EndIf
next
ConsoleWrite(StringFormat("Best match for '%s' is %s (%u) with %u spelling errors.\n", $test_word, $numeros[$bestMatchIdx], $bestMatchIdx + 1, $bestMatch))

fgthhhh · April 30, 2010

great

u are my hero, that extractly what i need :idea:

Malkey · April 30, 2010

_EditDistance() function from here, appears to be another version of the Typos() function from post #5 , this thread.

#include <String.au3>
#include <Array.au3>
#include <Math.au3>

Local Const $numeros[21] = ["zero", "one", "two", "three", "four", "five", "six", _
        "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", _
        "fifteen", "sixteen", "seventeen", "eighteen", "nineteen", "twenty"]

Local $test_word = "thraa"
Local $bestMatch = StringLen($test_word), $bestMatchIdx, $typos
For $k = 0 To UBound($numeros) - 1
    $typos1 = Typos($numeros[$k], $test_word)
    ConsoleWrite("Typos => " & $typos1 & " ")
    $typos = _EditDistance($numeros[$k], $test_word)
    ConsoleWrite($typos & " <= _EditDistance" & @CRLF)
    If $typos < $bestMatch Then
        $bestMatch = $typos
        $bestMatchIdx = $k
    EndIf
Next

ConsoleWrite(StringFormat("Best match for '%s' is '%s' with %u different, non-matching characters.\n", $test_word, $numeros[$bestMatchIdx], $bestMatch))


Func _EditDistance($s1, $s2)
    Local $m[StringLen($s1) + 1][StringLen($s2) + 1], $i, $j
    $m[0][0] = 0; boundary conditions
    For $j = 1 To StringLen($s2)
        $m[0][$j] = $m[0][$j - 1] + 1; boundary conditions
    Next
    For $i = 1 To StringLen($s1)
        $m[$i][0] = $m[$i - 1][0] + 1; boundary conditions
    Next
    For $j = 1 To StringLen($s2);   outer loop
        For $i = 1 To StringLen($s1) ;  inner loop
            If (StringMid($s1, $i, 1) = StringMid($s2, $j, 1)) Then
                $diag = 0;
            Else
                $diag = 1
            EndIf
            $m[$i][$j] = _Min($m[$i - 1][$j] + 1, _ ; insertion
                    (_Min($m[$i][$j - 1] + 1, _ ;   deletion
                    $m[$i - 1][$j - 1] + $diag))) ; substitution
        Next
    Next
    Return $m[StringLen($s1)][StringLen($s2)] ; $m ;
EndFunc ;==>_EditDistance

Func Typos(Const $st1, Const $st2, $anychar = '_', $anytail = '%')
    Local $s1, $s2, $pen, $del, $ins, $subst
    If Not IsString($st1) Then Return SetError(-1, -1, -1)
    If Not IsString($st2) Then Return SetError(-2, -2, -1)
    If $st2 = '' Then Return StringLen($st1)
    If $st2 == $anytail Then Return 0
    If $st1 = '' Then
        Return (StringInStr($st2 & $anytail, $anytail, 1) - 1)
    EndIf
;~  $s1 = StringSplit(_LowerUnaccent($st1)), "", 2)     ;; _LowerUnaccent() addon function not available here
;~  $s2 = StringSplit(_LowerUnaccent($st2)), "", 2)     ;; _LowerUnaccent() addon function not available here
    $s1 = StringSplit(StringLower($st1), "", 2)
    $s2 = StringSplit(StringLower($st2), "", 2)
    Local $l1 = UBound($s1), $l2 = UBound($s2)
    Local $r[$l1 + 1][$l2 + 1]
    For $x = 0 To $l2 - 1
        Switch $s2[$x]
            Case $anychar
                If $x < $l1 Then
                    $s2[$x] = $s1[$x]
                EndIf
            Case $anytail
                $l2 = $x
                If $l1 > $l2 Then
                    $l1 = $l2
                EndIf
                ExitLoop
        EndSwitch
        $r[0][$x] = $x
    Next
    $r[0][$l2] = $l2
    For $x = 0 To $l1
        $r[$x][0] = $x
    Next
    For $x = 1 To $l1
        For $y = 1 To $l2
            $pen = Not ($s1[$x - 1] == $s2[$y - 1])
            $del = $r[$x - 1][$y] + 1
            $ins = $r[$x][$y - 1] + 1
            $subst = $r[$x - 1][$y - 1] + $pen
            If $del > $ins Then $del = $ins
            If $del > $subst Then $del = $subst
            $r[$x][$y] = $del
            If ($pen And $x > 1 And $y > 1 And $s1[$x - 1] == $s2[$y - 2] And $s1[$x - 2] == $s2[$y - 1]) Then
                If $r[$x][$y] >= $r[$x - 2][$y - 2] Then $r[$x][$y] = $r[$x - 2][$y - 2] + 1
                $r[$x - 1][$y - 1] = $r[$x][$y]
            EndIf
        Next
    Next
    Return ($r[$l1][$l2])
EndFunc ;==>Typos

Sign In

compare strings

Recommended Posts

fgthhhh

water

whim

fgthhhh

jchd

fgthhhh

czardas

jchd

fgthhhh

jchd

fgthhhh

jchd

fgthhhh

jchd

fgthhhh

Malkey

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

AutoIt Resources

Release

Beta