Comparing hundreds of sentences (Levenshtein etc.)

leuce · August 15, 2020

Hello everyone

I'm trying to flag similar sentences in a large text file (so as to remove all similars, except 1). I would like your input on better methods to perform the comparisons.

I found three implementations of Levenshtein and one implementation of Sift2 in the AutoIt forums. (I have no way of knowing if those scripts do what they claim to do, but the three Levenshtein methods consistently yield mostly the same results, and the sentences that they flag are truly similar, so that's okay.) In my non-realworld tests, the Sift2 comparison method is consistently faster than any of the Levenshtein ones.

My problem is not so much which comparison method or which implementation of it to use, but how to optimize the way I'm using it.

The method I currently use to flag similar sentences, is this:
1. Take a sentence (starting from the top) and compare it to all remaining sentences individually;
2. If a sentence is flagged as "similar", then remove from the list of sentences, and write the results to a file;
3. Then repeat, starting with the next sentence from the top.
(If a list has many similar sentences, then the process goes quicker towards the end of the list because previously flagged sentences are not re-checked.)

Below is the code for the method described above (excluding the third-party function that does the actual comparison).
- I could not figure out how to make _ArrayDelete work, so instead of deleting a flagged sentence from the array, I add "!!!!!" to the entry, and then I ignore entries that contain "!!!!!". 😛
- I attach the four scripts I'm currently testing on various lists of sentences, plus some sample results, for full disclosure.

#include <File.au3>
#include <Math.au3>
#include <Array.au3>
#include <Timers.au3>

$corrcount = 0

$fod = FileOpenDialog ("Select files containing segments", @ScriptDir, "All (*.*)")
$fo = FileOpen ($fod, 128)
$savefile = FileOpen ($fod & "_output.txt", 129)

; For some comparison methods, higher is better
$sthreshold = InputBox ("", "Threshold? Lower is more similar.", "7")
$methodused = "fuzzy trim, type 1 (leven)"

$fr = FileRead ($fo)

$lines = StringSplit ($fr, @CRLF, 1)

For $j = 1 to $lines[0] - 1

$firstline = $lines[$j]

ToolTip ("Now processing line " & $j & @CRLF & $firstline)

If StringInStr ($lines[$j], "!!!!!") Then
ContinueLoop
Else
FileWrite ($savefile, $firstline & @CRLF)
EndIf

For $k = $j + 1 to $lines[0]

If StringInStr ($lines[$k], "!!!!!") Then
ContinueLoop
EndIf

$simsim = Int (_Levenshtein($firstline, $lines[$k])) ; <<<--- Third-party comparitor function (not included in this post)

If $simsim < $sthreshold Then
FileWrite ($savefile, "{{" & $lines[$k] & "::" & $simsim & "}}" & @CRLF)
$lines[$k] = $lines[$k] & "::!!!!!"
$corrcount = $corrcount + 1
EndIf

Next

Next

FileWrite ($savefile, @CRLF & @CRLF & "Method used: " & $methodused & @CRLF & "Threshold: " & $sthreshold & @CRLF & "Total # of segments: " & $lines[0] & @CRLF & "Segments removed: " & $corrcount & @CRLF & @CRLF)
MsgBox (0, "Done!", "Method used: " & $methodused & @CRLF & "Threshold: " & $sthreshold & @CRLF & "Total # of segments: " & $lines[0] & @CRLF & "Segments removed: " & $corrcount & @CRLF & @CRLF & "Output file: " & $fod & "_output.txt", 0)

Exit

This method of flagging similar sentences works fine for lists with few sentences, short sentences, and lots of flagged sentences. But I'm hoping to use it on longer lists eventually (e.g. 10 000 sentences of more normal length, with fewer than 10% similar sentences), so if there is a better way of finding (flagging) similar sentences, or speeding things up, it would be nice to know.

The reason for all of this is, well, I'm a translator, and the translation tools we use give financial discounts to clients for similar sentences, but it's typically not possible to temporarily remove similar sentences from a text. The tools we use do count how many similar sentences there are, and show them on a case-by-case basis, do not have the ability to output them to a list.

Any ideas would be appreciated.

Samuel

==

Added:

Since writing the draft of this post, I added two additional checks (a length check and a number-of-spaces check) before the actual similarity check, in order to avoid sending hopeless candidates to the similarity checking function. It speeds things up tremendously.

Curiously, sorting the list of sentences by character length before running the script doesn't seem to help things, for example:
- Sentences pre-sorted by length: 2500 sentences, 630 flagged, 845 seconds
- Sentences NOT pre-sorted by length: 2500 sentences, 630 flagged, 750 seconds

I also tried a completely different method instead:
1. Take a sentence (starting from the top) and compare it to all remaining sentences individually UNTIL there is a match;
2. Then, flag the original sentence (not the matching one that was found) (and write the results to a file);
3. Then repeat, starting with the next sentence from the top.
In a single test on the 2500-sentence file mentioned above, this method took 3 times as long to complete; but in multiple tests on shorter files (e.g. 250 sentences), the alternative method was significantly faster (although obviously it selects other sentences as the ones to keep).

I can't tell from the similarity checking code whether comparison is case-sensitive, but for my purposes it doesn't have to be. Identical punctuation isn't necessary either, so I could remove all punctuation from the list before running the script, but while it will increase the number of matches, I don't see how that will increase over performance except by reducing the number of lines to check somewhat.

trim internal fuzzies.zip

Edited August 16, 2020 by leuce

jchd · August 16, 2020

Let's take another viewpoint: beside the fact that you delete a sentence once a similar one is found instead of exchanging them, the basic algoritm is similar to a recursion of buble sorts, of global complexity O(n!), which grows very rapidly in n. This, added to the basic property of the Damerau-Levenshtein distance which is essentially in O(nm) with n = number of entries and m their average length, makes the whole algorithm inherently slow.

I don't believe there exist a magic silver bullet to kill this bird, but actual run times highly depend on sentence complexity (length, among other considerations) and how you consider two sentences similar or not. I means typos is one possible source of difference; synonyms, addition of extra words or omission of words is another one; "Yoda talk" (phrase rearrangement) is yet a completely different one.

How would you compare and classify these sentences?
In Spain, the rain falls mainly on the plains.
In Spain, the rain mostly falls on the plains.
The rain in Spain falls mainly on the vast plains.
The rain in Spain predominantly falls on the plains.
Most of the autumn rain falls on the plains in Spain.

I think it would be very helpful if you could provide a real-world sufficiently large sample, along with a few examples of which sentences (line numbers) you'd regard as similar and a few which you'd consider "not similar enough, but close to". Be sure to include both typical and pathological cases. The samples you have included in the zip file are most probably too far from real-world.

It might reveal beneficial to adopt a completely different approach and most probably a mix/compound of several techniques. Depending on the language used, reducing phrases to their soundex (EN) or the equivalent for your language can be one good idea. Limiting compares between phrases having "similar enough" length can also be a great help. Counting words (or whitespaces) can also be a good clue. Comparing the distribution of letters (think StringToAsciiArray) can be a good criterion. I believe a clever mix of these techniques can be globally faster than a brute force distance comparison, but that highly depends on real-world data, your degree of acceptance of false positives and false negatives, and more.

Any useable criterion allowing to reasonably split the source in several distinct "candidate similar" buckets of sentences, each to be processed separately later, is worth the pain when the source gets large enough. This comes from the overall exponential complexity of comparisons and algorithm of managing compares.

Edited August 16, 2020 by jchd

leuce · August 18, 2020

Thanks for your input, JCHD.

Let me explain (if you're interested) in why I'm trying to do this.

Suppose we have a file of 1000 sentences in which these four sentences occur somewhere in the file (near the top, about a third of the way down, two thirds of the way down, and near the bottom of the file):

Click the link at the top of the page to jump to the bottom of the page.
Click the link at the bottom of the page to jump to the top of the page.
Press the green button at the bottom of the page to jump to the next page.
Press the red button at the bottom of the page to jump to the previous page.

These four sentences will each take about 20 seconds to translate by themselves, but if a translator could get access to the translations of previous sentences during translation, he can do the second sentence in 9 seconds, the third one in 15 seconds, the fourth one in 6 seconds. And clients know this. Clients want a discount for sentences 2, 3 and 4, based on the fact that they are similar to sentence 1. Translators don't mind giving such discounts, because we use special programs (called "CAT tools", computer assisted translation) that automatically find sentences that we translated previously and automatically suggest those sentences' translations to us. (How the discounts are calculated is not relevant here.)

But sometimes the job is too big for just one translator (given the specific deadline), and the file gets shared between e.g. four translators working independently. Given the location of these four sentences in the file, it is likely that each translator will get just one of them to translate without the benefit of having translated the others previously. So, from the four translators' perspective, they are all "new", non-discountable sentences, payable in full. From the client's perspective, sentence 2, 3 and 4 should be discounted.

The four translators may be able to continue to give the client a discount without doing extra work, if they can group the similar sentences together before splitting up the file among themselves. (Or, alternatively, if they can create a file that contains only the similar sentences (or only the first instance of each group of similar sentences), so that one translator can translate them first, before distributing the translations to his colleagues, so that they can re-use those translations or translate the other sentences faster.) This is what I'm trying to achieve, then: to find similar sentences and group them together, and flag (or mark for deletion) all instances except one.

All CAT tools have unique (and often secret) fancy ways of calculating percentages of fuzzy matches, but if I can catch most of fuzzy matches that the client's CAT tool will catch, I'll be more than happy. Also the precise match percentages are not relevant, although being able to tinker with the threshhold is necessary. Many modern CAT tools can improve matching in various ways e.g. stemming (also called "tokenizing"), but such single-percentage-point tweaks do not matter if the match threshhold is a coarse 75%. For this reason, any matching system that would cause the four sentences to be regarded as "similar" to each other (or at least sentence 1 & 2 and sentence 3&4) would do just fine.

As a matter of interest, here is how similar two CAT tools think these sentences are to each other:

Wordfast Classic 6
sentence 2 vs. sentence 1 = 82% similar
sentence 3 vs. sentence 2 = 66% similar
sentence 3 vs. sentence 1 = 61% similar
sentence 4 vs. sentence 3 = 78% similar
sentence 4 vs. sentence 2 = 65% similar
sentence 4 vs. sentence 1 = 61% similar

OmegaT 5 (freeware)
sentence 2 vs. sentence 1 = 88% similar
sentence 3 vs. sentence 2 = 64% similar
sentence 3 vs. sentence 1 = 58% similar
sentence 4 vs. sentence 3 = 87% similar
sentence 4 vs. sentence 2 = 64% similar
sentence 4 vs. sentence 1 = 58% similar

Different translators charge different rates and different translators/client calculate the discounts differently, but for , but that for illustrative purposes (and based on the above calculations) sentence 1 may have cost $2.00, sentence 2 = $1.00, sentence 3 = $1.80 and sentence 4 = $1.00 (giving a total discount of $2.20 on $8.00). But as I said, for the script's purposes, if the threshold is set to 60% or even 7%, we'll catch enough similars to make it worth the effort.

The big problem, as you know, is speed. In real-world texts, I've had 1000-sentence files take 1 hour or 2 hours. I'm dreaming of 10 minutes.

So, in the final versions of my scripts, I added these options:
- ignore sentences that are not roughly the same length
- option to skip sentences with too few spaces (because more words is better)
- option to skip sentences with too few or too many characters

But the comparison is still one by one, starting from the top of the file, and then skipping lines that have been found to match something else previously.

W.r.t. my previous post: Of the four comparison functions, WideBoyDixon's implementation of Levenshtein is the fastest on my real-world texts of over 500 sentences. The "alternative" method I mentioned in my first post appeared to work well at first, but it turns out that it only works at all if the similar sentences are already grouped.

Attached is the latest version of my scripts, plus a real-world example (clinical trial consent forms) and another real-world example (MuseScore 3's user manual).

Anyway, thanks for your helpful comments.

Samuel

MuseScore 3's user manual.zip clinical trial consent forms.zip the four scripts, v 3a.zip

jchd · August 18, 2020

It will take some time before I can spare enough time into your welcome samples and hopefully provide useful ideas.

Clearly AutoIt won't be the right tool for a practical result on large input but it can be used as a prototype bench to test various approaches.

Having done something in both comparing texts and translating books before, I'm sure stemming and removing stop words are mandatory, language-dependant, steps. Then phrase length and number of words are good criterions to group phrases into several potentially similar subsets.

Next there exist a number of ways to quantitize differences in the reduced number of phrases in each subset.

leuce · August 19, 2020

Please, I did not upload the samples because I wanted you to look at them. I have achieved what I wanted to achieve (which was mostly a proof-of-concept in the hope that I can show some other people what the possibilities are). You have already been helpful.

Edited August 19, 2020 by leuce

Sign In

Comparing hundreds of sentences (Levenshtein etc.)

Recommended Posts

leuce

Link to comment

Share on other sites

jchd

Link to comment

Share on other sites

leuce

Link to comment

Share on other sites

jchd

Link to comment

Share on other sites

leuce

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

AutoIt Resources

Release

Beta