MenzAgitat Op

Joined: 04 Jul 2006 Posts: 118 Location: France
|
Posted: Sat Jul 25, 2009 5:22 pm Post subject: [script/library] Levenshtein's distance v1.0 |
|
|
This script provides package Levenshtein : | Code: | | Package provide Levenshtein 1.0 |
Description:
In information theory and computer science, the Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e., the so called edit distance). The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character. A generalization of the Levenshtein distance (Damerau–Levenshtein distance) allows the transposition of two characters as an operation.
( see http://en.wikipedia.org/wiki/Levenshtein_distance )
Interest:
- Allows an orthographical corrector to suggest alternate words with a low Levenshtein distance.
- Allows pseudo-AI to have an orthographical tolerance.
- ...
Syntax:
levenshtein::distance <string 1> <string 2>
you can also use a public command if you want to test things :
!test_levenshtein <string 1> <string 2>
Examples (in french, sorry):
| Code: | levenshtein::distance "bonjour" "bougeoir"
-> 4 | you must manipulate 4 characters to transform the word "bonjour" into the word "bougeoir" :
- BONJOUR
- BOUJOUR -> we replace N by U
- BOUGOUR -> we replace J by G
- BOUGEOUR -> we insert E
- BOUGEOIR -> we replace U by I
| Code: | levenshtein::distance "antiquaire" "antikaire"
-> 2 | We can conclude from this 10 letters long example that it is very similar to the second word, with a distance of only 2.
You must keep in mind that a distance of 2 between two words of 10 letters means they are very similar, while a distance of 2 between two words of 3 letters means they are very different as you can see in the following example : | Code: | levenshtein::distance "pin" "pas"
-> 2 | As you can see, a distance of 2 doesnt mean much difference in a 10 letters word but represents important modifications in a 3 letters one.
In order to preserve relevance of results, you'll take care to always link the tolerance to the length, proportionately.
| Code: | levenshtein::distance "antiquaire" "dimanche"
-> 8 | In this last example, we can see that the distance between the first word and the second is 8. They are very different words.
Download:
Levenshtein's distance v1.0 |
|