Bonjour, ceci est mon troisième article sur Habré, plus tôt j'ai écrit un article sur le modèle de langage ALM . Maintenant, je veux vous présenter le système de correction de typo ASC (implémenté sur la base d' ALM ).
Oui, il existe un grand nombre de systèmes pour corriger les fautes de frappe, ils ont tous leurs propres forces et faiblesses, parmi les systèmes ouverts, je peux distinguer l'un des JamSpell les plus prometteurs , et nous le comparerons. Il existe également un système similaire de DeepPavlov , auquel beaucoup pourraient penser, mais je ne me suis jamais lié d'amitié avec lui.
Liste des fonctionnalités:
- Correction d'erreurs de mots avec une différence allant jusqu'à 4 distances de Levenshtein.
- Correction des fautes de frappe dans les mots (insertion, suppression, remplacement, réarrangement) des caractères.
- fication compte tenu du contexte.
- Mettre la casse de la première lettre d'un mot pour (noms et titres propres), en tenant compte du contexte.
- Diviser les mots combinés en mots séparés, en tenant compte du contexte.
- Effectue une analyse de texte sans corriger le texte d'origine.
- Recherchez dans le texte la présence (erreurs, fautes de frappe, contexte incorrect).
Systèmes d'exploitation pris en charge:
- Mac OS X
- FreeBSD
- Linux
Le système est écrit en C ++ 11, il y a un port pour Python3
Dictionnaires prĂŞts
Nom | Taille (Go) | RAM (Go) | Taille N-grammes | Langue |
---|---|---|---|---|
wittenbell-3-big.asc | 1,97 | 15,6 | 3 | RU |
wittenbell-3-middle.asc | 1,24 | 9.7 | 3 | RU |
mkneserney-3-middle.asc | 1,33 | 9.7 | 3 | RU |
wittenbell-3-single.asc | 0,772 | 5.14 | 3 | RU |
wittenbell-5-single.asc | 1,37 | 10,7 | cinq | RU |
Essai
Les données du concours de «correction de faute de frappe» Dialog21 2016 ont été utilisées pour tester le système . Un dictionnaire binaire formé a été utilisé pour les tests: wittenbell-3-middle.asc
Test réalisé | Précision | Rappel | FMesure |
---|---|---|---|
Mode de correction de faute de frappe | 76,97 | 62,71 | 69,11 |
Mode de correction d'erreur | 73,72 | 60,53 | 66,48 |
Je pense qu'il n'est pas nécessaire d'ajouter d'autres données, si vous le souhaitez, tout le monde peut répéter le test, je joins tous les matériaux utilisés dans les tests ci-dessous.
Matériaux utilisés dans les tests
- test.txt - Texte Ă tester
- correct.txt - Texte des variantes correctes
- evaluer.py - Script Python3 pour calculer les résultats de correction
Maintenant, il est intéressant de comparer le fonctionnement des systèmes de correction des fautes de frappe eux-mêmes dans des conditions égales, nous allons former deux fautes de frappe différentes sur les mêmes données textuelles et effectuer un test.
À titre de comparaison, prenons le système de correction de faute de frappe que j'ai mentionné ci-dessus, JamSpell .
ASC contre JamSpell
Installation
ASC
JamSpell
$ git clone --recursive https://github.com/anyks/asc.git
$ cd ./asc
$ mkdir ./build
$ cd ./build
$ cmake ..
$ make
JamSpell
$ git clone https://github.com/bakwc/JamSpell.git
$ cd ./JamSpell
$ mkdir ./build
$ cd ./build
$ cmake ..
$ make
Entraînement
ASC
train.json
Python3
JamSpell
train.json
{
"ext": "txt",
"size": 3,
"alter": {"":""},
"debug": 1,
"threads": 0,
"method": "train",
"allow-unk": true,
"reset-unk": true,
"confidence": true,
"interpolate": true,
"mixed-dicts": true,
"only-token-words": true,
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"pilots": ["","","","","","","","","","","a","i","o","e","g"],
"corpus": "./texts/correct.txt",
"w-bin": "./dictionary/3-middle.asc",
"w-vocab": "./train/lm.vocab",
"w-arpa": "./train/lm.arpa",
"mix-restwords": "./similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz",
"bin-code": "ru",
"bin-name": "Russian",
"bin-author": "You name",
"bin-copyright": "You company LLC",
"bin-contacts": "site: https://example.com, e-mail: info@example.com",
"bin-lictype": "MIT",
"bin-lictext": "... License text ...",
"embedding-size": 28,
"embedding": {
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}
}
$ ./asc -r-json ./train.json
Python3
import asc
asc.setSize(3)
asc.setAlmV2()
asc.setThreads(0)
asc.setLocale("en_US.UTF-8")
asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.allowUnk)
asc.setOption(asc.options_t.resetUnk)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.tokenWords)
asc.setOption(asc.options_t.confidence)
asc.setOption(asc.options_t.interpolate)
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
def statusArpa1(status):
print("Build arpa", status)
def statusArpa2(status):
print("Write arpa", status)
def statusVocab(status):
print("Write vocab", status)
def statusIndex(text, status):
print(text, status)
def status(text, status):
print(text, status)
asc.collectCorpus("./texts/correct.txt", asc.smoothing_t.wittenBell, 0.0, False, False, status)
asc.buildArpa(statusArpa1)
asc.writeArpa("./train/lm.arpa", statusArpa2)
asc.writeVocab("./train/lm.vocab", statusVocab)
asc.setCode("RU")
asc.setLictype("MIT")
asc.setName("Russian")
asc.setAuthor("You name")
asc.setCopyright("You company LLC")
asc.setLictext("... License text ...")
asc.setContacts("site: https://example.com, e-mail: info@example.com")
asc.setEmbedding({
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)
asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)
JamSpell
$ ./main/jamspell train ../test_data/alphabet_ru.txt ../test_data/correct.txt ./model.bin
Essai
ASC
spell.json
Python3
JamSpell
- Python , C++
spell.json
{
"debug": 1,
"threads": 0,
"method": "spell",
"spell-verbose": true,
"confidence": true,
"mixed-dicts": true,
"asc-split": true,
"asc-alter": true,
"asc-esplit": true,
"asc-rsplit": true,
"asc-uppers": true,
"asc-hyphen": true,
"asc-wordrep": true,
"r-text": "./texts/test.txt",
"w-text": "./texts/output.txt",
"r-bin": "./dictionary/3-middle.asc"
}
$ ./asc -r-json ./spell.json
Python3
import asc
asc.setAlmV2()
asc.setThreads(0)
asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.ascSplit)
asc.setOption(asc.options_t.ascAlter)
asc.setOption(asc.options_t.ascESplit)
asc.setOption(asc.options_t.ascRSplit)
asc.setOption(asc.options_t.ascUppers)
asc.setOption(asc.options_t.ascHyphen)
asc.setOption(asc.options_t.ascWordRep)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.confidence)
def status(text, status):
print(text, status)
asc.loadIndex("./dictionary/3-middle.asc", "", status)
f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')
for line in f1.readlines():
res = asc.spell(line)
f2.write("%s\n" % res[0])
f2.close()
f1.close()
JamSpell
- Python , C++
#include <fstream>
#include <iostream>
#include <jamspell/spell_corrector.hpp>
// BOOST
#ifdef USE_BOOST_CONVERT
#include <boost/locale/encoding_utf.hpp>
//
#else
#include <codecvt>
#endif
using namespace std;
/**
* convert utf-8
* @param str utf-8
* @return
*/
const string convert(const wstring & str){
//
string result = "";
//
if(!str.empty()){
// BOOST
#ifdef USE_BOOST_CONVERT
//
using boost::locale::conv::utf_to_utf;
// utf-8
result = utf_to_utf <char> (str.c_str(), str.c_str() + str.size());
//
#else
// UTF-8
using convert_type = codecvt_utf8 <wchar_t, 0x10ffff, little_endian>;
//
wstring_convert <convert_type, wchar_t> conv;
// wstring_convert <codecvt_utf8 <wchar_t>> conv;
// utf-8
result = conv.to_bytes(str);
#endif
}
//
return result;
}
/**
* convert utf-8
* @param str
* @return utf-8
*/
const wstring convert(const string & str){
//
wstring result = L"";
//
if(!str.empty()){
// BOOST
#ifdef USE_BOOST_CONVERT
//
using boost::locale::conv::utf_to_utf;
// utf-8
result = utf_to_utf <wchar_t> (str.c_str(), str.c_str() + str.size());
//
#else
//
// wstring_convert <codecvt_utf8 <wchar_t>> conv;
wstring_convert <codecvt_utf8_utf16 <wchar_t, 0x10ffff, little_endian>> conv;
// utf-8
result = conv.from_bytes(str);
#endif
}
//
return result;
}
/**
* safeGetline
* @param is
* @param t
* @return
*/
istream & safeGetline(istream & is, string & t){
//
t.clear();
istream::sentry se(is, true);
streambuf * sb = is.rdbuf();
for(;;){
int c = sb->sbumpc();
switch(c){
case '\n': return is;
case '\r':
if(sb->sgetc() == '\n') sb->sbumpc();
return is;
case streambuf::traits_type::eof():
if(t.empty()) is.setstate(ios::eofbit);
return is;
default: t += (char) c;
}
}
}
/**
* main
*/
int main(){
//
NJamSpell::TSpellCorrector corrector;
//
corrector.LoadLangModel("model.bin");
//
ifstream file1("./test_data/test.txt", ios::in);
//
if(file1.is_open()){
//
string line = "", res = "";
//
ofstream file2("./test_data/output.txt", ios::out);
//
if(file2.is_open()){
//
while(file1.good()){
//
safeGetline(file1, line);
// ,
if(!line.empty()){
//
res = convert(corrector.FixFragment(convert(line)));
// ,
if(!res.empty()){
//
res.append("\n");
//
file2.write(res.c_str(), res.size());
}
}
}
//
file2.close();
}
//
file1.close();
}
return 0;
}
$ g++ -std=c++11 -I../JamSpell -L./build/jamspell -L./build/contrib/cityhash -L./build/contrib/phf -ljamspell_lib -lcityhash -lphf ./test.cpp -o ./bin/test
$ ./bin/test
résultats
Obtenir des résultats
$ python3 evaluate.py ./texts/test.txt ./texts/correct.txt ./texts/output.txt
ASC
Précision | Rappel | FMesure |
---|---|---|
92,13 | 82,51 | 87,05 |
JamSpell
Précision | Rappel | FMesure |
---|---|---|
77,87 | 63,36 | 69,87 |
L'une des principales caractéristiques de l' ASC est d'apprendre à partir de données sales. Il est pratiquement impossible de trouver des corpus de texte sans erreurs et fautes de frappe en libre accès. Ce n'est pas assez de vie pour réparer des téraoctets de données à la main, mais vous devez travailler avec.
Le principe pédagogique que j'offre
- Assembler un modèle de langage à l'aide de données sales
- Nous supprimons tous les mots rares et N-grammes dans le modèle de langage assemblé
- Nous ajoutons des mots simples pour un fonctionnement plus correct du système de correction des fautes de frappe.
- Préparer un dictionnaire binaire
Commençons
Supposons que nous ayons plusieurs corpus de sujets différents, il est plus logique de les former séparément, puis de les combiner.
Assemblage du châssis à l'aide d'ALM
collect.json
Python
,
{
"size": 3,
"debug": 1,
"threads": 0,
"ext": "txt",
"method": "train",
"allow-unk": true,
"mixed-dicts": true,
"only-token-words": true,
"smoothing": "wittenbell",
"locale": "en_US.UTF-8",
"w-abbr": "./output/alm.abbr",
"w-map": "./output/alm.map",
"w-vocab": "./output/alm.vocab",
"w-words": "./output/words.txt",
"corpus": "./texts/corpus",
"abbrs": "./abbrs/abbrs.txt",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"mix-restwords": "./texts/similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./collect.json
- size — N- 3
- debug —
- threads —
- ext —
- allow-unk — 〈unk〉
- mixed-dicts —
- only-token-words — N- —
- smoothing — wittenbell ( , - )
- locale — ( )
- w-abbr —
- w-map —
- w-vocab —
- w-words — ( )
- corpus —
- abbrs — , , (, , ...)
- goodwords —
- badwords —
- mix-restwords —
- alphabet — ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
# <unk>
alm.setOption(alm.options_t.allowUnk)
#
alm.setOption(alm.options_t.mixDicts)
# N- —
alm.setOption(alm.options_t.tokenWords)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
# , , (, , ...)
f = open('./abbrs/abbrs.txt')
for abbr in f.readlines():
abbr = abbr.replace("\n", "")
alm.addAbbr(abbr)
f.close()
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addBadword(word)
f.close()
def status(text, status):
print(text, status)
def statusWords(status):
print("Write words", status)
def statusVocab(status):
print("Write vocab", status)
def statusMap(status):
print("Write map", status)
def statusSuffix(status):
print("Write suffix", status)
#
alm.collectCorpus("./texts/corpus", status)
#
alm.writeWords("./output/words.txt", statusWords)
#
alm.writeVocab("./output/alm.vocab", statusVocab)
#
alm.writeMap("./output/alm.map", statusMap)
#
alm.writeSuffix("./output/alm.abbr", statusSuffix)
,
Élagage d'une coque assemblée avec ALM
prune.json
Python
{
"size": 3,
"debug": 1,
"allow-unk": true,
"method": "vprune",
"vprune-wltf": -15.0,
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"r-map": "./corpus1/alm.map",
"r-vocab": "./corpus1/alm.vocab",
"w-map": "./output/alm.map",
"w-vocab": "./output/alm.vocab",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./prune.json
- size — N- 3
- debug —
- allow-unk — 〈unk〉
- vprune-wltf — - (, — )
- locale — ( )
- smoothing — wittenbell ( , - )
- r-map —
- r-vocab —
- w-map —
- w-vocab —
- goodwords —
- badwords —
- alphabet — ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
# <unk>
alm.setOption(alm.options_t.allowUnk)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addBadword(word)
f.close()
def statusPrune(status):
print("Prune data", status)
def statusReadVocab(text, status):
print("Read vocab", text, status)
def statusWriteVocab(status):
print("Write vocab", status)
def statusReadMap(text, status):
print("Read map", text, status)
def statusWriteMap(status):
print("Write map", status)
#
alm.readVocab("./corpus1/alm.vocab", statusReadVocab)
#
alm.readMap("./corpus1/alm.map", statusReadMap)
#
alm.pruneVocab(-15.0, 0, 0, statusPrune)
#
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#
alm.writeMap("./output/alm.map", statusWriteMap)
Combiner les données collectées avec ALM
merge.json
Python
{
"size": 3,
"debug": 1,
"allow-unk": true,
"method": "merge",
"mixed-dicts": "true",
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"r-words": "./texts/words",
"r-map": "./corpus1",
"r-vocab": "./corpus1",
"w-map": "./output/alm.map",
"w-vocab": "./output/alm.vocab",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"mix-restwords": "./texts/similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./merge.json
- size — N- 3
- debug —
- allow-unk — 〈unk〉
- mixed-dicts —
- locale — ( )
- smoothing — wittenbell ( , - )
- r-words —
- r-map — ,
- r-vocab — ,
- w-map —
- w-vocab —
- goodwords —
- badwords —
- alphabet — ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
# <unk>
alm.setOption(alm.options_t.allowUnk)
#
alm.setOption(alm.options_t.mixDicts)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addBadword(word)
f.close()
#
f = open('./texts/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addWord(word)
f.close()
def statusReadVocab(text, status):
print("Read vocab", text, status)
def statusWriteVocab(status):
print("Write vocab", status)
def statusReadMap(text, status):
print("Read map", text, status)
def statusWriteMap(status):
print("Write map", status)
#
alm.readVocab("./corpus1", statusReadVocab)
#
alm.readMap("./corpus1", statusReadMap)
#
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#
alm.writeMap("./output/alm.map", statusWriteMap)
Apprendre un modèle de langage avec ALM
train.json
Python
{
"size": 3,
"debug": 1,
"allow-unk": true,
"reset-unk": true,
"interpolate": true,
"method": "train",
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"r-map": "./output/alm.map",
"r-vocab": "./output/alm.vocab",
"w-arpa": "./output/alm.arpa",
"w-words": "./output/words.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./train.json
- size — N- 3
- debug —
- allow-unk — 〈unk〉
- reset-unk — , 〈unk〉
- interpolate —
- locale — ( )
- smoothing — wittenbell
- r-map — ,
- r-vocab — ,
- w-arpa — ARPA,
- w-words — , ( )
- alphabet — ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
# <unk>
alm.setOption(alm.options_t.allowUnk)
# <unk>
alm.setOption(alm.options_t.resetUnk)
#
alm.setOption(alm.options_t.mixDicts)
#
alm.setOption(alm.options_t.interpolate)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
def statusReadVocab(text, status):
print("Read vocab", text, status)
def statusReadMap(text, status):
print("Read map", text, status)
def statusBuildArpa(status):
print("Build ARPA", status)
def statusWriteMap(status):
print("Write map", status)
def statusWriteArpa(status):
print("Write ARPA", status)
def statusWords(status):
print("Write words", status)
#
alm.readVocab("./output/alm.vocab", statusReadVocab)
#
alm.readMap("./output/alm.map", statusReadMap)
#
alm.buildArpa(statusBuildArpa)
# ARPA
alm.writeArpa("./output/alm.arpa", statusWriteArpa)
#
alm.writeWords("./output/words.txt", statusWords)
Formation ASC correcteur orthographique
train.json
Python
{
"size": 3,
"debug": 1,
"threads": 0,
"confidence": true,
"mixed-dicts": true,
"method": "train",
"alter": {"":""},
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"pilots": ["","","","","","","","","","","a","i","o","e","g"],
"w-bin": "./dictionary/3-single.asc",
"r-abbr": "./output/alm.abbr",
"r-vocab": "./output/alm.vocab",
"r-arpa": "./output/alm.arpa",
"abbrs": "./texts/abbrs/abbrs.txt",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"alters": "./texts/alters/yoficator.txt",
"upwords": "./texts/words/upp",
"mix-restwords": "./texts/similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz",
"bin-code": "ru",
"bin-name": "Russian",
"bin-author": "You name",
"bin-copyright": "You company LLC",
"bin-contacts": "site: https://example.com, e-mail: info@example.com",
"bin-lictype": "MIT",
"bin-lictext": "... License text ...",
"embedding-size": 28,
"embedding": {
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}
}
$ ./asc -r-json ./train.json
- size — N- 3
- debug —
- threads —
- confidence — ARPA - ,
- mixed-dicts —
- alter — ( , , — «»)
- locale — ( )
- smoothing — wittenbell ( , - )
- pilots — ( )
- w-bin —
- r-abbr — ,
- r-vocab — ,
- r-arpa — ARPA,
- abbrs — , , (, , ...)
- goodwords —
- badwords —
- alters — , ( )
- upwords — , (, , ...)
- mix-restwords —
- alphabet — ( )
- bin-code —
- bin-name —
- bin-author —
- bin-copyright —
- bin-contacts —
- bin-lictype —
- bin-lictext —
- embedding-size —
- embedding — ( , )
Python
import asc
# N- 3
asc.setSize(3)
#
asc.setThreads(0)
# ( )
asc.setLocale("en_US.UTF-8")
#
asc.setOption(asc.options_t.uppers)
# <unk>
asc.setOption(asc.options_t.allowUnk)
# <unk>
asc.setOption(asc.options_t.resetUnk)
#
asc.setOption(asc.options_t.mixDicts)
# ARPA - ,
asc.setOption(asc.options_t.confidence)
# ( )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
# ( )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addBadword(word)
f.close()
#
f = open('./output/alm.abbr')
for word in f.readlines():
word = word.replace("\n", "")
asc.addSuffix(word)
f.close()
# , (, , ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
abbr = abbr.replace("\n", "")
asc.addAbbr(abbr)
f.close()
# , (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addUWord(word)
f.close()
#
asc.addAlt("", "")
# , ( )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
words = words.replace("\n", "")
words = words.split('\t')
asc.addAlt(words[0], words[1])
f.close()
def statusIndex(text, status):
print(text, status)
def statusBuildIndex(status):
print("Build index", status)
def statusArpa(status):
print("Read arpa", status)
def statusVocab(status):
print("Read vocab", status)
# ARPA
asc.readArpa("./output/alm.arpa", statusArpa)
#
asc.readVocab("./output/alm.vocab", statusVocab)
#
asc.setCode("RU")
#
asc.setLictype("MIT")
#
asc.setName("Russian")
#
asc.setAuthor("You name")
#
asc.setCopyright("You company LLC")
#
asc.setLictext("... License text ...")
#
asc.setContacts("site: https://example.com, e-mail: info@example.com")
# ( , )
asc.setEmbedding({
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)
#
asc.buildIndex(statusBuildIndex)
#
asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)
Je comprends que tout le monde ne pourra pas former son propre vocabulaire binaire; cela nécessite des corpus de textes et des ressources informatiques importantes. Par conséquent, l' ASC est capable de travailler avec un seul fichier ARPA comme dictionnaire principal.
Exemple de travail
spell.json
Python
{
"ad": 13,
"cw": 38120,
"debug": 1,
"threads": 0,
"method": "spell",
"alter": {"":""},
"asc-split": true,
"asc-alter": true,
"confidence": true,
"asc-esplit": true,
"asc-rsplit": true,
"asc-uppers": true,
"asc-hyphen": true,
"mixed-dicts": true,
"asc-wordrep": true,
"spell-verbose": true,
"r-text": "./texts/test.txt",
"w-text": "./texts/output.txt",
"upwords": "./texts/words/upp",
"r-arpa": "./dictionary/alm.arpa",
"r-abbr": "./dictionary/alm.abbr",
"abbrs": "./texts/abbrs/abbrs.txt",
"alters": "./texts/alters/yoficator.txt",
"mix-restwords": "./similars/letters.txt",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"pilots": ["","","","","","","","","","","a","i","o","e","g"],
"alphabet": "abcdefghijklmnopqrstuvwxyz",
"embedding-size": 28,
"embedding": {
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}
}
$ ./asc -r-json ./spell.json
Python
import asc
#
asc.setThreads(0)
#
asc.setOption(asc.options_t.uppers)
#
asc.setOption(asc.options_t.ascSplit)
#
asc.setOption(asc.options_t.ascAlter)
#
asc.setOption(asc.options_t.ascESplit)
#
asc.setOption(asc.options_t.ascRSplit)
#
asc.setOption(asc.options_t.ascUppers)
#
asc.setOption(asc.options_t.ascHyphen)
#
asc.setOption(asc.options_t.ascWordRep)
#
asc.setOption(asc.options_t.mixDicts)
# ARPA - ,
asc.setOption(asc.options_t.confidence)
# ( )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
# ( )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addBadword(word)
f.close()
#
f = open('./output/alm.abbr')
for word in f.readlines():
word = word.replace("\n", "")
asc.addSuffix(word)
f.close()
# , (, , ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
abbr = abbr.replace("\n", "")
asc.addAbbr(abbr)
f.close()
# , (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addUWord(word)
f.close()
#
asc.addAlt("", "")
# , ( )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
words = words.replace("\n", "")
words = words.split('\t')
asc.addAlt(words[0], words[1])
f.close()
def statusArpa(status):
print("Read arpa", status)
def statusIndex(status):
print("Build index", status)
# ARPA
asc.readArpa("./dictionary/alm.arpa", statusArpa)
# (38120 13 )
asc.setAdCw(38120, 13)
# ( , )
asc.setEmbedding({
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)
#
asc.buildIndex(statusIndex)
f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')
for line in f1.readlines():
res = asc.spell(line)
f2.write("%s\n" % res[0])
f2.close()
f1.close()
PS Pour ceux qui ne veulent pas du tout collecter et former quoi que ce soit, j'ai évoqué la version web d' ASC . Il faut également garder à l'esprit que le système de correction des fautes de frappe n'est pas un système omniscient et qu'il est impossible d'y nourrir toute la langue russe. ASC ne corrigera aucun texte, il est nécessaire de s'entraîner séparément pour chaque sujet.