SOUNDEX 返回由四個字符組成的代碼 (SOUNDEX) 以評估兩個字符串的相似性。返回的第一個字符是輸入字符串的第一個字符,返回的第二個字符到第四個字符是數(shù)字。
soundex 代碼如下:
def soundex(name, len=4):
""" soundex module conforming to Knuth's algorithm
implementation 2000-12-24 by Gregory Jorgensen
public domain
"""
# digits holds the soundex values for the alphabet
digits = '01230120022455012623010202'
sndx = ''
fc = ''
# translate alpha chars in name to soundex digits
for c in name.upper():
if c.isalpha():
if not fc:
fc = c # remember first letter
d = digits[ord(c) - ord('A')]
# duplicate consecutive soundex digits are skipped
if not sndx or (d != sndx[-1]):
sndx += d
print sndx
# replace first digit with first alpha character
sndx = fc + sndx[1:]
# remove all 0s from the soundex code
sndx = sndx.replace('0', '')
# return soundex code padded to len characters
return (sndx + (len * '0'))[:len]
需要注意的是代碼設(shè)計(jì)為處理英文名稱。
更多建議: