👩🏾‍🎤 🤽🏻 📨 Analyser un article sur la façon d'extraire du sens des incorporations 🌼 🐻 🐴

tl; dr: Une analyse simplifiée de l'article, dans laquelle l'auteur propose deux théorèmes intéressants, sur la base desquels il a trouvé un moyen d'extraire des vecteurs cachés de significations de la matrice d'enrobage. Il existe un guide sur la façon de reproduire les résultats. L'ordinateur portable est disponible sur github .

introduction

Dans cet article, je veux parler d'une chose impressionnante que le chercheur Sanjev Arora a trouvée dans la structure algébrique linéaire des sens des mots, avec des applications à la polysémie . Il fait partie d'une série d'articles dans lesquels il essaie de fournir une base théorique pour les propriétés des incorporations de mots. Dans le même travail, Arora fait l'hypothèse que de simples incorporations telles que word2vec ou Glove incluent en fait plusieurs significations pour un seul mot et suggère un moyen de les reconstruire. Je vais essayer de m'en tenir aux exemples originaux tout au long de l'article.

Plus formellement, pour $\upsilon_{tie}$ nous désignons un certain vecteur d'incorporation du mot cravate , qui peut avoir le sens d'un nœud ou d'une cravate, ou ce peut être le verbe «attacher». Arora suggère que ce vecteur peut être écrit comme la combinaison linéaire suivante

υ_{t i e} \approx α_{1} υ_{t i e 1} + α_{2} υ_{t i e 2} + α_{3} υ_{t i e 3} + . . .

$\upsilon_{tie} \approx \alpha_1 \upsilon_{tie1} + \alpha_2 \upsilon_{tie2} + \alpha_3 \upsilon_{tie3}+...$

Où $\upsilon_{tien}$ c'est l'une des significations possibles de cravate , et $\alpha$ - coefficient. Essayons de comprendre comment cela se produit.

Théorie

Avertissement

Écrit par un non-mathématicien, veuillez signaler toutes les erreurs, en particulier dans le tapis. terminologie.

Une petite note sur la théorie d'Arora

Étant donné que le travail de démarrage d'Arara est beaucoup plus compliqué que cela, je n'ai pas encore préparé l'examen complet. Cependant, nous verrons brièvement ce que c'est.

Ainsi, Arora propose l'idée que tout texte est généré par un modèle génératif. Dans le processus de son travail à chaque pas de temps $t$ un mot est généré $w$ . Le modèle est constitué d'un vecteur de contexte et vecteurs de plongements $u_w$ . (dimensions), , . , , - (, ), — (, ), , , — .

, .. - , . . , . : " " , " ". , "": , .

, . , , , .

: , . , $t$ $w$

P (w | c_{t}) = \frac{1}{Z_{c}} \exp < c_{t}, υ_{w} >

$P(w|c_t) = \frac{1}{Z_c}\exp{<c_t, \upsilon_w>}$

$c_t$ — $t$ , $\upsilon_w$ — $w$ , $Z_c = \sum_w\exp<c,\upsilon_w>$ — partition function. , , .

. , , : , , , . Y, X .

. - , - .

, , . , , "". :

, ", , , ". , , , ", , , " , " " .

, . , , . , ( , ). , , . .

1

, $s$ $n$ . $A$ ,

υ_{w} \approx A E [\frac{1}{n} \sum_{w_{i} \in s} υ_{w_{i}} | w \in s]

$\upsilon_w \approx A \mathbb{E}[\frac{1}{n}\sum_{w_i\in s}\upsilon_{w_i}|w \in s]$

, . . $w$ . $S$ . , $\upsilon_s$ $s \in S$ , $u$ . , , $u$ $\upsilon_w$ $A$ ( ). , , out-of-vocabulary , , .

, . , SIF . , , , . , SIF $\upsilon_{SIF}$ k, , $w$ , TF-IDF.

υ_{S I F} = \frac{1}{k} \sum_{n = 1}^{k} υ_{n} * t f_i d f (w_{n})

$\upsilon_{SIF}=\frac{1}{k}\sum^k_{n=1}\upsilon_n*tf\_idf(w_n)$

, , 1, $c$ . , - , , .

. , - $w$ , $\upsilon_w$ , . :

. $V$ .
$w' \in V$ , , SIF 20 $w'$ , . $w' \in V$ $(\nu_{w'}^1, \nu_{w'}^2,, ... \nu_{w'}^n,)$ , n — $w'$ .
$u_{w'}$ SIF $w' \in V$ $u_{w'}=\frac{1}{n}\sum_{t=1}^n\nu_{w'}^t$ .
$argmin_A\sum_A||Au_{w'}-\upsilon_{w'}||^2_2$
SIF $\upsilon_w=Au_w$

, .. . 1/3 , A 2\3 . . .

#paragraphs	250k	500k	750k	1 million
cos similarity	0.94	0.95	0.96	0.96

2

, $w$ $s_1$ $s_2$ . \ upsilon $\upsilon_w$ - , . , , .. , , tie_1 tie_2, tie_1 — , tie2 — .

, , $$ $$. , , , $\lVert\upsilon_w - \upsilon^- \rVert \rightarrow 0$ , $\upsilon^-$

υ_{w}^{-} = \frac{f_{1}}{f_{1} + f_{2}} * υ_{s 1} + \frac{f_{2}}{f_{1} + f_{2}} * υ_{s 2} = α υ_{s 1} + β υ_{s 2}

$\upsilon_w^- = \frac{f_1}{f_1+f_2} * \upsilon_{s1} + \frac{f_2}{f_1+f_2} * \upsilon_{s2} = \alpha\upsilon_{s1} + \beta\upsilon_{s2}$

$f_1$ $f_2$ $s_1$ and $s_2$ . , , .

, , , , ? , $alpha$ . . , $c$ . , , . , , , , , . , , , (inner product) . , , - (, , , ), $\upsilon_{tie1}$ , ! .

. ? $\Re^d$ $k, n$ . $k < n$ , $A_1, A_2, ..., A_m$ , ,

υ_{w} = \sum_{j = 1}^{m} α_{w, j} A_{j} + μ_{w}

$\upsilon_w = \sum_{j=1}^m\alpha_{w,j}A_j + \mu_w$

$k$ $\alpha$ $\mu_w$ — .

\sum_{w} ‖ υ_{w} - \sum_{j = 1}^{m} α_{w, j} A_{j} ‖_{2}^{2}

$\sum_w \lVert\upsilon_w - \sum_{j=1}^m\alpha_{w,j}A_j \rVert^2_2$

, k (sparsity parameter), m — .. , . k-SVD. , . , $A$ , ( , $A$ ). , , - $A_i$ , , , $m$ . .

, , .

import numpy as np

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from scipy.spatial.distance import cosine
import warnings
warnings.filterwarnings('ignore')

1. Gensim

GloVe.

, 300- .

tmp_file = get_tmpfile("test_word2vec.txt")
_ = glove2word2vec("/home/astromis/Embeddings/glove.6B.300d.txt", tmp_file)
model = KeyedVectors.load_word2vec_format(tmp_file)

embeddings = model.wv

index2word = embeddings.index2word
embedds = embeddings.vectors

print(embedds.shape)

(400000, 300)

400000 .

2. k-svd

. ksvd.

!pip install ksvd
from ksvd import ApproximateKSVD

Requirement already satisfied: ksvd in /home/astromis/anaconda3/lib/python3.6/site-packages (0.0.3)
Requirement already satisfied: numpy in /home/astromis/anaconda3/lib/python3.6/site-packages (from ksvd) (1.14.5)
Requirement already satisfied: scikit-learn in /home/astromis/anaconda3/lib/python3.6/site-packages (from ksvd) (0.19.1)

, 2000 5.

: 10000 . , , , , .

%time
aksvd = ApproximateKSVD(n_components=2000,transform_n_nonzero_coefs=5, )
embedding_trans = embeddings.vectors
dictionary = aksvd.fit(embedding_trans).components_
gamma = aksvd.transform(embedding_trans)

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 9.54 µs

#gamma = np.load('./data/mats/.npz')
# dictionary_glove6b_300d.np.npz - whole matrix file
dictionary = np.load('./data/mats/dictionary_glove6b_300d_10000.np.npz')
dictionary = dictionary[dictionary.keys()[0]]

#print(gamma.shape)
print(dictionary.shape)

(2000, 300)

#np.savez_compressed('gamma_glove6b_300d.npz', gamma)
#np.savez_compressed('dictionary_glove6b_300d.npz', dictionary)

, . .

embeddings.similar_by_vector(dictionary[1354,:])

[('slave', 0.8417330980300903),
 ('slaves', 0.7482961416244507),
 ('plantation', 0.6208109259605408),
 ('slavery', 0.5356900095939636),
 ('enslaved', 0.4814416170120239),
 ('indentured', 0.46423888206481934),
 ('fugitive', 0.4226764440536499),
 ('laborers', 0.41914862394332886),
 ('servitude', 0.41276970505714417),
 ('plantations', 0.4113745093345642)]

embeddings.similar_by_vector(dictionary[1350,:])

[('transplant', 0.7767853736877441),
 ('marrow', 0.699995219707489),
 ('transplants', 0.6998592615127563),
 ('kidney', 0.6526087522506714),
 ('transplantation', 0.6381147503852844),
 ('tissue', 0.6344675421714783),
 ('liver', 0.6085026860237122),
 ('blood', 0.5676015615463257),
 ('heart', 0.5653558969497681),
 ('cells', 0.5476219058036804)]

embeddings.similar_by_vector(dictionary[1546,:])

[('commons', 0.7160810828208923),
 ('house', 0.6588335037231445),
 ('parliament', 0.5054076910018921),
 ('capitol', 0.5014163851737976),
 ('senate', 0.4895153343677521),
 ('hill', 0.48859673738479614),
 ('inn', 0.4566132128238678),
 ('congressional', 0.4341348707675934),
 ('congress', 0.42997264862060547),
 ('parliamentary', 0.4264637529850006)]

embeddings.similar_by_vector(dictionary[1850,:])

[('okano', 0.2669774889945984),
 ('erythrocytes', 0.25755012035369873),
 ('windir', 0.25621023774147034),
 ('reapportionment', 0.2507009208202362),
 ('qurayza', 0.2459488958120346),
 ('taschen', 0.24417680501937866),
 ('pfaffenbach', 0.2437630295753479),
 ('boldt', 0.2394050508737564),
 ('frucht', 0.23922981321811676),
 ('rulebook', 0.23821482062339783)]

! , . . , , . "tie" "spring" .

itie = index2word.index('tie')
ispring = index2word.index('spring')

tie_emb = embedds[itie]
string_emb = embedds[ispring]

simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, tie_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:15]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))

Atom #162: win victory winning victories wins won 2-1 scored 3-1 scoring
Atom #58: game play match matches games played playing tournament players stadium
Atom #237: 0-0 1-1 2-2 3-3 draw 0-1 4-4 goalless 1-0 1-2
Atom #622: wrapped wrap wrapping holding placed attached tied hold plastic held
Atom #1899: struggles tying tied inextricably fortunes struggling tie intertwined redefine define
Atom #1941: semifinals quarterfinals semifinal quarterfinal finals semis semi-finals berth champions quarter-finals
Atom #1074: qualifier quarterfinals semifinal semifinals semi finals quarterfinal champion semis champions
Atom #1914: wearing wore jacket pants dress wear worn trousers shirt jeans
Atom #281: black wearing man pair white who girl young woman big
Atom #1683: overtime extra seconds ot apiece 20-17 turnovers 3-2 halftime overtimes
Atom #369: snap picked snapped pick grabbed picks knocked picking bounced pulled
Atom #98: first team start final second next time before test after
Atom #1455: after later before when then came last took again but
Atom #1203: competitions qualifying tournaments finals qualification matches qualifiers champions competition competed
Atom #1602: hat hats mask trick wearing wears sunglasses trademark wig wore

simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, string_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:15]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))

Atom #528: autumn spring summer winter season rainy seasons fall seasonal during
Atom #1070: start begin beginning starting starts begins next coming day started
Atom #931: holiday christmas holidays easter thanksgiving eve celebrate celebrations weekend festivities
Atom #1455: after later before when then came last took again but
Atom #754: but so not because even only that it this they
Atom #688: yankees yankee mets sox baseball braves steinbrenner dodgers orioles torre
Atom #1335: last ago year months years since month weeks week has
Atom #252: upcoming scheduled preparations postponed slated forthcoming planned delayed preparation preparing
Atom #619: cold cool warm temperatures dry cooling wet temperature heat moisture
Atom #1775: garden gardens flower flowers vegetable ornamental gardeners gardening nursery floral
Atom #21: dec. nov. oct. feb. jan. aug. 27 28 29 june
Atom #84: celebrations celebration marking festivities occasion ceremonies celebrate celebrated celebrating ceremony
Atom #98: first team start final second next time before test after
Atom #606: vacation lunch hour spend dinner hours time ramadan brief workday
Atom #384: golden moon hemisphere mars twilight millennium dark dome venus magic

! , , , .

, , . , , .

. fastText, RusVectores. 300.

fasttext_model = KeyedVectors.load('/home/astromis/Embeddings/fasttext/model.model')

embeddings = fasttext_model.wv

index2word = embeddings.index2word
embedds = embeddings.vectors

embedds.shape

(164996, 300)

%time
aksvd = ApproximateKSVD(n_components=2000,transform_n_nonzero_coefs=5, )
embedding_trans = embeddings.vectors[:10000]
dictionary = aksvd.fit(embedding_trans).components_
gamma = aksvd.transform(embedding_trans)

CPU times: user 1 µs, sys: 2 µs, total: 3 µs
Wall time: 6.2 µs

dictionary = np.load('./data/mats/dictionary_rus_fasttext_300d.npz')
dictionary = dictionary[dictionary.keys()[0]]

embeddings.similar_by_vector(dictionary[1024,:], 20)

[('', 0.6854609251022339),
 ('', 0.6593252420425415),
 ('', 0.6360634565353394),
 ('', 0.5998549461364746),
 ('', 0.5971367955207825),
 ('', 0.5862340927124023),
 ('', 0.5788886547088623),
 ('', 0.5788123607635498),
 ('', 0.5623885989189148),
 ('', 0.5610565543174744),
 ('', 0.5551878809928894),
 ('', 0.551397442817688),
 ('', 0.5356274247169495),
 ('', 0.531707227230072),
 ('', 0.5174376368522644),
 ('', 0.5131562948226929),
 ('', 0.5120065212249756),
 ('', 0.5077806115150452),
 ('', 0.5074601173400879),
 ('', 0.5068254470825195)]

embeddings.similar_by_vector(dictionary[1582,:], 20)

[('', 0.45191124081611633),
 ('', 0.4515378475189209),
 ('', 0.4478364586830139),
 ('', 0.4280813932418823),
 ('', 0.41220104694366455),
 ('', 0.40772825479507446),
 ('', 0.4047147035598755),
 ('', 0.4030646085739136),
 ('', 0.39368513226509094),
 ('', 0.39012178778648376),
 ('', 0.3866344690322876),
 ('', 0.37968817353248596),
 ('', 0.3728911876678467),
 ('', 0.3663109242916107),
 ('', 0.3640827238559723),
 ('', 0.3474290072917938),
 ('', 0.3473641574382782),
 ('', 0.3468908369541168),
 ('', 0.34586742520332336),
 ('', 0.34555742144584656)]

embeddings.similar_by_vector(dictionary[500,:], 20)

[('', 0.6874514222145081),
 ('-', 0.5172050595283508),
 ('', 0.46720415353775024),
 ('', 0.44713956117630005),
 ('', 0.4144558310508728),
 ('', 0.40545403957366943),
 ('', 0.4030636250972748),
 ('-', 0.4016447067260742),
 ('', 0.38331469893455505),
 ('', 0.37292781472206116),
 ('', 0.3625457286834717),
 ('', 0.35121074318885803),
 ('', 0.3504621088504791),
 ('', 0.34097471833229065),
 ('', 0.33320850133895874),
 ('', 0.3277249336242676),
 ('', 0.3266661763191223),
 ('', 0.31865227222442627),
 ('::', 0.30150306224823),
 ('', 0.2975207567214966)]

itie = index2word.index('')
ispring = index2word.index('')

tie_emb = embedds[itie]
string_emb = embedds[ispring]

simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, string_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:10]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))

Atom #185:          
Atom #1217:         - 
Atom #1213:          
Atom #1978:          
Atom #1796:          
Atom #839:          
Atom #989:          
Atom #414:          
Atom #1140:       -   
Atom #878:

simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, tie_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:10]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))

Atom #883:          -
Atom #40:          
Atom #215:          
Atom #688:          
Atom #386:          
Atom #676:          
Atom #414:          
Atom #127:          
Atom #592:          
Atom #703:    - -

#np.savez_compressed('./data/mats/gamma_rus_fasttext_300d.npz', gamma)
#np.savez_compressed('./data/mats/dictionary_rus_fasttext_300d.npz', dictionary)

, (Word sense indection), , 1. — , . , , . , , , . , .

UPD: knagaev .

Analyser un article sur la façon d'extraire du sens des incorporations

introduction

Théorie

Une petite note sur la théorie d'Arora

1

2

More articles: