Clustering et classification des données Big Text avec M.O. en Java. Article # 3 - Architecture / Résultats

Bonjour, Habr! Aujourd'hui sera la dernière partie du sujet Clustering et classification des données Big Text à l'aide de l'apprentissage automatique en Java. Cet article est une continuation des  premier et deuxième articles .









L'article décrit l'architecture du système, l'algorithme et les résultats visuels. Tous les détails de la théorie et des algorithmes se trouvent dans les deux premiers articles.









Les architectures système peuvent être divisées en deux parties principales: une application Web et un logiciel de clustering et de classification de données









L'algorithme du logiciel d'apprentissage automatique se compose de 3 parties principales:





  1. traitement du langage naturel;





    1. tokenisation;





    2. lemmatisation;





    3. arrêter la liste;





    4. fréquence des mots;





  2. méthodes de regroupement;





    1. TF-IDF;





    2. SVD;





    3. trouver des groupes de grappes;





  3. méthodes de classification - API Aylien.





Traitement du langage naturel

L'algorithme commence par lire les données textuelles. Puisque notre système est une bibliothèque électronique, les livres sont pour la plupart au format pdf. Vous pouvez lire la mise en œuvre et les détails du traitement NLP ici .





Voici une comparaison lors de l'exécution des algorithmes de Lemmatisation et de Stemmitisation:





  : 4173415
    : 88547
    : 82294
      
      











, , , . , :





characterize, design, space, render, robot, face, alisa, kalegina, university, washington, seattle, washington, grace, schroeder, university, washington, seattle, washington, aidan, allchin, lakeside, also, il, school, seattle, washington, keara, berlin, macalester, college, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, university, washington, seattle, washington, abstract, face, critical, establish, agency, social, robot, building, expressive, mechanical, face, costly, difficult, robot, build, year, face, ren, der, screen, great, flexibility, robot, face, open, design, space, tablish, robot, character, perceive, property, despite, prevalence, robot, render, face, systematic, exploration, design, space, work, aim, fill, gap, conduct, survey, identify, robot, render, face, code, term, property, statistics
      
      



, :





character, design, space, render, robot, face, alisa, kalegina, univers, washington, seattl, washington, grace, schroeder, univers, washington, seattl, washington, grsuwedu, aidan, allchin, lakesid, also, il, school, seattl, washington, keara, berlin, macalest, colleg, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, univers, washington, seattl, washington, abstract, face, critic, establish, agenc, social, robot, build, express, mechan, face, cost, difficult, mani, robot, built, year, face, ren, dere, screen, great, flexibl, robot, face, open, design, space, tablish, robot, charact, perceiv, properti, despit, preval, robot, render, face, systemat, explor, design, space, work, aim, fill, gap, conduct, survey, identifi, robot, render, face, code, term, properti, statist, common, pattern, observ, data, set, face, conduct, survey, understand, peopl, percep, tion, render, robot, face, identifi, impact, differ, face, featur, survey, result, indic, prefer, vari, level, realism, detail, robot, facecharacter, design, space, render, robot, face, alisa, kalegina, univers, washington, seattl, washington, grace, schroeder, univers, washington, seattl, washington, grsuwedu, aidan, allchin, lakesid, also, il, school, seattl, washington, keara, berlin, macalest, colleg, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, univers, washington, seattl, washington, abstract, face, critic, establish, agenc, social, robot, build, express, mechan, face, cost, difficult, mani, robot, built, year, face, ren, dere, screen, great, flexibl, robot, face, open, design, space, tablish, robot, charact, perceiv, properti, despit, preval, robot, render, face, systemat, explor, design, space, work, aim, fill, gap, conduct, survey, identifi, robot, render, face, code, term, properti, statist, common, pattern, observ, data, set, face, conduct, survey, understand, peopl, percep, tion, render, robot, face, identifi, impact, differ, face, featur, survey, result, indic, prefer, vari, level, realism, detail, robot, face
      
      











tf-idf . HashMap, - , - -.





-:





tf-idf:









, , tf-idf. :





-0.0031139399383999997 0.023330604746 -1.3650204652799997E-4
-0.038380206566 0.00104373247064 0.056140327901
-0.006980774822399999 0.073057418689 -0.0035209342337999996
-0.0047152503238 0.0017397257449 0.024816828582999998
-0.005195951771999999 0.03189764447 -5.9991080912E-4
-0.008568593700999999 0.114337675179 -0.0088221197958
-0.00337365927 0.022604474721999997 -1.1457816390099999E-4
-0.03938283525 -0.0012682796482399999 0.0023486548592
-0.034341362795999995 -0.00111758118864 0.0036010404917
-0.0039026609385999994 0.0016699372352999998 0.021206653766000002
-0.0079418490394 0.003116062838 0.072380311755
-0.007021828444599999 0.0036496566028 0.07869801528199999
-0.0030219410092 0.018637386319 0.00102082843809
-0.0042041069026 0.023621439238999998 0.0022947637053
-0.0061050946438 0.00114796066823 0.018477825284
-0.0065708646563999995 0.0022944737838999996 0.035902813761
-0.037790461814 -0.0015372596281999999 0.008878823611899999
-0.13264545848599998 -0.0144908102251 -0.033606397957999995
-0.016229093174 1.41831464625E-4 0.005181988760999999
-0.024075296507999996 -8.708131965899999E-4 0.0034344653516999997

      
      











SVD   .





, .  – , . OrientDB , OrientDB . OrientDB , , , . . .





, .









– . , , DBSCAN. . . r=0.007. 562 80.000 , . , .





r = max (D) / n









   max(D)  ‒ , . n -













, . – , –









, . 4-. ( > nt)





nt = N / S

N‒ - , S ‒ .









, .





– Aylien API





Aylien API . API json , . API . 9 , . POST API:





String queryText = "select  DocText from documents where clusters = '" + cluster + "'";
   OResultSet resultSet = database.query(queryText);
   while (resultSet.hasNext()) {
   OResult result = resultSet.next();

   String textDoc = result.toString().replaceAll("[\\<||\\>||\\{||\\}]", "").replaceAll("doctext:", "")
   .toLowerCase();
   keywords.add(textDoc.replaceAll("\\n", ""));
   }

   ClassifyByTaxonomyParams.Builder classifyByTaxonomybuilder    = ClassifyByTaxonomyParams.newBuilder();
   classifyByTaxonomybuilder.setText(keywords.toString());
   classifyByTaxonomybuilder.setTaxonomy(ClassifyByTaxonomyParams.StandardTaxonomy.IAB_QAG);
   TaxonomyClassifications response = client.classifyByTaxonomy(classifyByTaxonomybuilder.build());
   for (TaxonomyCategory c : response.getCategories()) {
   clusterUpdate.add(c.getLabel());
   }

      
      







GET, :









. .













. . , . . , . , :









-





- – . , . - , . Vaadin Flow:









:





  • , .





  • .





  • -.





  • , , , , -.





  • .













“Technology & Computing”:









:









:









, . . , , . . . . : .





, , , -, tf-idf, . , . DBSCAN . . , , . , , , , ..





, NoSQL , OrinetDB, 4 NoSQL. , . OrientDB , .





Aylien API, . , 100 . , , , k-, . , .








All Articles