Mein holpriger Weg zur automatischen Textkategorisierung mit Perl-Modulen / Algorithmen - Teil I
Mein holpriger Weg zur automatischen Textkategorisierung mit Perl-Modulen / Algorithmen - Teil I
Sollte ich den Weg erfolgreich zu Ende gehen, werde ich vielleicht eine verständlichere Erklärung in Form einer Zusammenfassung nachreichen. Vielleicht, eventuell. Wem das hier schon Anregung sein kann, bitte einfach zugreifen! ( Alle Anderen, bitte einfach ignorieren! )
Näggl mit Köppn - Probelauf 1 - Hembelz Om(x)
( besser lesbar unter diesem Link )
Andere verwenden Algorithmen Anderer erfolgreich:
Und
nun sollte auch ich mal daran gehen, mit fertig erstellten Algorithmen
mein Glück zu versuchen. Klauen wie im obigen Beispiel ist ja eh nicht
meine Absicht, sondern Offenlegen meiner Karten im Spiel, wie gehabt.
Von daher: Okay, so - grünes Licht! (gibt mir nun meine Gewissensabteilung)
Link-Sammlung
Mein
tsquery_tsranks-Proggi ist grade am Sammeln von Sublinks und Ranks zum
Thema. Derweil habe ich manuell schon ein paar Funde zu bieten:
Algorithm::Kmeanspp - perl implementation of K-means++ - metacpan.org
python
- Wie die aussagekräftigen Wortes zu finden, jedes k-means-Cluster aus
word2vec Vektoren abgeleitet darzustellen? - FrageIT.de
Microsoft PowerPoint - KDD2-7-MultiInstanzDataMining.ppt [Kompatibilitätsmodus] - KDD2-7-MultiInstanzDataMining.pdf
k-means und wortvektoren - Google-Suche
K-Means
k-Means-Clustering: Big Data am Beispiel von Hemdgrößen - Micromata
Was ist der k-Means-Algorithmus?
Und hiermit fange ich einfach mal grob & salopp & vor allem einfach mal an:
sudo perl -MCPAN -e shell
cpan[1]> install Algorithm::Kmeanspp
...
...............................................................DONE
Fetching with LWP:
http://www.cpan.org/modules/03modlist.data.gz
Reading '/home/zarko/.local/share/.cpan/sources/modules/03modlist.data.gz'
DONE
Writing /home/zarko/.local/share/.cpan/Metadata
Running install for module 'Algorithm::Kmeanspp'
Fetching with LWP:
http://www.cpan.org/authors/id/F/FU/FUJISAWA/Algorithm-Kmeanspp-0.03.tar.gz
Fetching with LWP:
http://www.cpan.org/authors/id/F/FU/FUJISAWA/CHECKSUMS
Checksum for /home/zarko/.local/share/.cpan/sources/authors/id/F/FU/FUJISAWA/Algorithm-Kmeanspp-0.03.tar.gz ok
Scanning cache /home/zarko/.local/share/.cpan/build for sizes
............................................................................DONE
'YAML' not installed, will not store persistent state
Configuring F/FU/FUJISAWA/Algorithm-Kmeanspp-0.03.tar.gz with Makefile.PL
Bareword "use_test_base" not allowed while "strict subs" in use at Makefile.PL line 13.
Execution of Makefile.PL aborted due to compilation errors.
Warning: No success on command[/usr/bin/perl Makefile.PL INSTALLDIRS=site]
FUJISAWA/Algorithm-Kmeanspp-0.03.tar.gz
/usr/bin/perl Makefile.PL INSTALLDIRS=site -- NOT OK
Failed during this command:
FUJISAWA/Algorithm-Kmeanspp-0.03.tar.gz : writemakefile NO '/usr/bin/perl Makefile.PL INSTALLDIRS=site' returned status 65280
YAML fehlt - was auch immer das ist.
Yap,
das war's! Easiest! Nach "install YAML" in der CPAN-Shell funzt nun die
Installation des gewünschten Algorithmus fluffig durch.
Da
der Algorithmus offenbar sehr aufwendig programmiert ist und die
Installation dauert, schreibe ich kurz einen Entwurf eines Programms,
das später Subroutine oder Modul werden soll, welches mir aus meinen
tssearch-Wortvektoren Vektoren in Hash-Form für/in Perl transformiert.
Code
#!/usr/bin/perl
# tsvector2perlhash.pl
use strict;
use warnings;
use DBI;
use ZugangsDaten_postgresql qw($DB_USER $DB_PASSWD);
use Encode qw(is_utf8 decode encode);
# Programm
## Erfragen der Vektor-ID
print "\nBitte die Wortvektor-ID (link_id) eingeben!\n";
my $link_id = <STDIN>;
chomp $link_id;
## Ausgabe des Vektors am Bildschirm als String
connect_db;
my $vector = vector2hash($link_id);
disconnect_db;
print "\nDer ermittelte Wortvektor sieht so aus:\n\n";
print $vector;
print "\nZufrieden mit dem Zwischenergebnis?\n";
###########################################################
############### Subroutinen ####################
###########################################################
# Subroutinen für Export
sub connect_db {
## Verbindung zur DB herstellen
$dbh = DBI->connect("DBI:Pg:dbname=links;host=localhost", "$DB_USER", "$DB_PASSWD");
}
sub disconnect_db {
## Verbindung zur DB trennen
$dbh->disconnect();
}
sub vector2hash {
my $link_id = shift;
my $vector_select = $dbh->prepare("SELECT vector FROM wordvectors WHERE link_id = $link_id;");
$vector_select->execute();
my $vector_string = $vector_select->fetchrow;
return $vector_string
}
Hier meldet sich die "ZugangsDaten_postgres.pm": Huhuhu!
Global symbol "$dbh" requires explicit package name (did you forget to declare "my $dbh"?) at tsvector2perlhash.pl line 39.
Global symbol "$dbh" requires explicit package name (did you forget to declare "my $dbh"?) at tsvector2perlhash.pl line 44.
Global symbol "$dbh" requires explicit package name (did you forget to declare "my $dbh"?) at tsvector2perlhash.pl line 49.
Bareword "connect_db" not allowed while "strict subs" in use at tsvector2perlhash.pl line 21.
Bareword "disconnect_db" not allowed while "strict subs" in use at tsvector2perlhash.pl line 23.
Execution of tsvector2perlhash.pl aborted due to compilation errors.
#!/usr/bin/perl
# tsvector2perlhash.pl
use strict;
use warnings;
use DBI;
use ZugangsDaten_postgresql qw($DB_USER $DB_PASSWD);
use Encode qw(is_utf8 decode encode);
# Variablen
my $dbh;
# Programm
## Erfragen der Vektor-ID
print "\nBitte die Wortvektor-ID (link_id) eingeben!\n";
my $link_id = <STDIN>;
chomp $link_id;
## Ausgabe des Vektors am Bildschirm als String
connect_db();
my $vector = vector2hash($link_id);
disconnect_db();
print "\nDer ermittelte Wortvektor sieht so aus:\n\n";
print $vector;
print "\nZufrieden mit dem Zwischenergebnis?\n";
###########################################################
############### Subroutinen ####################
###########################################################
# Subroutinen für Export
sub connect_db {
## Verbindung zur DB herstellen
$dbh = DBI->connect("DBI:Pg:dbname=links;host=localhost", "$DB_USER", "$DB_PASSWD");
}
sub disconnect_db {
## Verbindung zur DB trennen
$dbh->disconnect();
}
sub vector2hash {
my $link_id = shift;
my $vector_select = $dbh->prepare("SELECT vector FROM wordvectors WHERE link_id = $link_id;");
$vector_select->execute();
my $vector_string = $vector_select->fetchrow;
return $vector_string
}
Hier meldet sich die "ZugangsDaten_postgres.pm": Huhuhu!
Bitte die Wortvektor-ID (link_id) eingeben!
55555
DBD::Pg::st execute failed: ERROR: column "vector" does not exist
LINE 1: SELECT vector FROM wordvectors WHERE link_id = 55555;
^ at tsvector2perlhash.pl line 55, <STDIN> line 1.
DBD::Pg::st fetchrow failed: no statement executing at tsvector2perlhash.pl line 56, <STDIN> line 1.
Der ermittelte Wortvektor sieht so aus:
Use of uninitialized value $vector in print at tsvector2perlhash.pl line 31, <STDIN> line 1.
Zufrieden mit dem Zwischenergebnis?
my $vector_select = $dbh->prepare("SELECT wordvector FROM wordvectors WHERE link_id = $link_id;");
Hier meldet sich die "ZugangsDaten_postgres.pm": Huhuhu!
Bitte die Wortvektor-ID (link_id) eingeben!
55555
Der ermittelte Wortvektor sieht so aus:
Wide character in print at tsvector2perlhash.pl line 31, <STDIN> line 1.
'-0':16383
'-00':16383 '-010':16383 '-0168':16383 '-02':16383 '-02049':16383
'-0404':16383 '-0481':16383 '-06':62,16383 '-0716':16383 '-0822':16383
'-09':16383 '-1':16383 '-11':16383 '-11482':16383
'-12':6610,6611,15687,15688 '-125':16383 '-127':16383 '-1614':16383
'-17':16383 '-1746':16383 '-177446':16383 '-18':63 '-19':16383
'-1976':16383 '-2':16383 '-20130127':16383 '-20735':16383 '-237':16383
'-239':16383 '-24':16383 '-25680':16383 '-269':16383 '-28':8044,16383
'-3':16383 '-304':16383 '-306':16383 '-307':16383 '-3077':16383
'-312':16383 '-313':16383 '-316':16383 '-333':16383 '-33874':16383
'-345':16383 '-34969':16383 '-35338':16383 '-36':16383 '-3636':16383
'-37':16383 '-393':16383 '-4':16383 '-4000':16383 '-4165':16383
'-451':16383 '-5':16383 '-512941':16383 '-516921':16383 '-5248':16383
'-525':16383 '-531':16383 '-534':16383 '-5494':16383 '-553':16383
'-55652':16383 '-56025':16383 '-56976':16383 '-57':16383 '-57215':16383
'-59240':16383 '-59256':16383 '-6':16383 '-600':16383 '-60398':16383
'-61':16383 '-61613':16383 '-6209':16383 '-7':16383 '-705':16383
'-7119':16383 '-73':16383 '-731':16383 '-733':16383 '-7432'
...
-yo':16383
'yogi':11883 'yoko':8881,9460,10541,13640,16383
'york':5185,5688,6211,6293,6304,6357,6385,6833,9753,10745,13083,13103,13176,14356,15890,16383
'you':989,2566,3218,3231,3242,3263,4227,4375,4677,7954,9218,10673,10783,11684,12060,12320,12330,13257,13562,13587,13659,13711,13887,13964,14061,14206,14340,14787,15817,16301,16383
'young':775,912,16383 'your':8654,16306,16383 'yourself':16383
'youssou':16383 'youth':5278,5618,11968,16383 'youtub':16383
'yvonn':16383 'zenith':1210,1274,16383 'zeppelin':16383 'zoo':16383
'zubin':16383 'à':10569 'ádám':16383 'álvaro':16383 'íslenska':16383
'čeština':16383 'ελληνικά':16383 'беларуская':16383 'български':16383
'в':4842,16383 'македонски':16383 'монгол':16383 'нохчийн':16383
'русиньскый':16383 'русский':16383 'снова':4841,16383 'српски':16383
'српскохрватски':16383 'ссср':4843,16383 'тарашкевіца':16383
'українська':16383 'ўзбекча':16383 'қазақша':16383 'հայերեն':16383
'ייִדיש':16383 'עברית':16383 'اردو':16383 'العربية':16383 'فارسی':16383
'مصرى':16383 'कोंकणी':16383 'गोंयची':16383 'नेपाली':16383 'मराठी':16383
'हिन्दी':16383 'বাংলা':16383 'മലയാളം':16383 'ไทย':16383
'მარგალური':16383 'ქართული':16383 '中文':16383 '日本語':16383 '粵語':16383
'한국어':16383
Zufrieden mit dem Zwischenergebnis?
Ja, zufrieden.
(
Ich mußte das so in ruckizucki-Manier machen, weil hier jemand mit den
Füßen scharrt, um ins Wohnzimmer an die Glotze zu kommen, nachdem ich
mir erbeten habe, wenigstens eine halbe Stunde mal die Tür hinter mir
zumachen zu dürfen, damit ich mich auf die Sache konzentrieren kann -
die mir wichtig ist. Echtes Verständnis ist da in diesem Leben nicht
mehr zu erwarten, tja. )
#!/usr/bin/perl
# kmeanspp-demo.pl
use Algorithm::Kmeanspp;
# input documents
my %documents = (
Alex => { 'Pop' => 10, 'R&B' => 6, 'Rock' => 4 },
Bob => { 'Jazz' => 8, 'Reggae' => 9 },
Dave => { 'Classic' => 4, 'World' => 4 },
Ted => { 'Jazz' => 9, 'Metal' => 2, 'Reggae' => 6 },
Fred => { 'Hip-hop' => 3, 'Rock' => 3, 'Pop' => 3 },
Sam => { 'Classic' => 8, 'Rock' => 1 },
);
my $kmp = Algorithm::Kmeanspp->new;
foreach my $id (keys %documents) {
$kmp->add_document($id, $documents{$id});
}
my $num_cluster = 3;
my $num_iter = 20;
$kmp->do_clustering($num_cluster, $num_iter);
# show clustering result
foreach my $cluster (@{ $kmp->clusters }) {
print join "\t", @{ $cluster };
print "\n";
}
# show cluster centroids
foreach my $centroid (@{ $kmp->centroids }) {
print join "\t", map { sprintf "%s:%.4f", $_, $centroid->{$_} }
keys %{ $centroid };
print "\n";
}
Can't
locate Algorithm/Kmeanspp.pm in @INC (you may need to install the
Algorithm::Kmeanspp module) (@INC contains: /etc/perl
/usr/local/lib/x86_64-linux-gnu/perl/5.26.1 /usr/local/share/perl/5.26.1
/usr/lib/x86_64-linux-gnu/perl5/5.26 /usr/share/perl5
/usr/lib/x86_64-linux-gnu/perl/5.26 /usr/share/perl/5.26
/usr/local/lib/site_perl /usr/local/lib/x86_64-linux-gnu/perl/5.26.0
/usr/local/share/perl/5.26.0 /usr/lib/x86_64-linux-gnu/perl-base) at
kmeanspp-demo.pl line 6.
BEGIN failed--compilation aborted at kmeanspp-demo.pl line 6.
So,
schade. Zeitfenster ist nun zu. Das werde ich zu Hause fertigstellen
müssen. Aber ich bin ja ganz gut weit gekommen, auf die Schnelle. Supi!
Hier in meinem Domizil erhalte ich eine seltsame Fehlermeldung:
cpan install Algorithm::Kmeanspp
Loading internal logger. Log::Log4perl recommended for better logging
CPAN: Storable loaded ok (v2.53_01)
Reading '/home/zarko/.cpan/Metadata'
Database was generated on Sun, 06 Jan 2019 18:29:02 GMT
Running install for module 'Algorithm::Kmeanspp'
CPAN: Digest::SHA loaded ok (v5.95)
CPAN: Compress::Zlib loaded ok (v2.068)
Checksum for /home/zarko/.cpan/sources/authors/id/F/FU/FUJISAWA/Algorithm-Kmeanspp-0.03.tar.gz ok
CPAN: YAML loaded ok (v1.27)
CPAN: CPAN::Meta::Requirements loaded ok (v2.132)
CPAN: Parse::CPAN::Meta loaded ok (v1.4414)
CPAN: CPAN::Meta loaded ok (v2.150001)
CPAN: Module::CoreList loaded ok (v5.20151213)
Configuring F/FU/FUJISAWA/Algorithm-Kmeanspp-0.03.tar.gz with Makefile.PL
Bareword "use_test_base" not allowed while "strict subs" in use at Makefile.PL line 13.
Execution of Makefile.PL aborted due to compilation errors.
Warning: No success on command[/usr/bin/perl Makefile.PL INSTALLDIRS=site]
FUJISAWA/Algorithm-Kmeanspp-0.03.tar.gz
/usr/bin/perl Makefile.PL INSTALLDIRS=site -- NOT OK
Milchmädchenberechnenderweise schaue ich mir das Makefile.PL jetzt mal an ...
use inc::Module::Install;
name 'Algorithm-Kmeanspp';
all_from 'lib/Algorithm/Kmeanspp.pm';
requires 'Carp';
requires 'Class::Accessor::Fast';
requires 'List::Util';
tests 't/*.t';
author_tests 'xt';
build_requires 'Test::More';
use_test_base;
auto_include;
WriteAll;
Alle Requirements durchgecheckt. Zwei haben gefehlt. Immer noch Error. Jetzt bleibt nur noch:
use_test_base;
Nach Auskommentierung und manuellem Versuch:
perl Makefile.PL
Cannot determine perl version info from lib/Algorithm/Kmeanspp.pm
Checking if your kit is complete...
Looks good
Generating a Unix-style Makefile
Writing Makefile for Algorithm::Kmeanspp
Unable to open MakeMaker.tmp: Permission denied at /usr/share/perl/5.22/ExtUtils/MakeMaker.pm line 1173.
Iwi scheint das hier zu fehlen:
ExtUtils::MakeMaker
Wird grade in der CPAN-Shell installiert. Blicke aber so gut wie gar nicht mehr durch, hierbei ;-) .
Permission denied at /usr/local/share/perl/5.22.1/ExtUtils/MakeMaker.pm line 1227.
Why?
...
Vielleicht gibt es noch ein anderes KMeans-Modul ... (schlechte Lösung, eigentlich, aber mal schau'n ...)
Keine gute Idee.
newbie, problem in installing module
On Tue, 30 Jan 2001 09:15:27 GMT, Rafael Garcia-Suarez
Quote:
>Pradeep Sethi wrote in comp.lang.perl.misc:
>> Writing Makefile for XML::XPath
>> Unable to open MakeMaker.tmp: Permission denied at
>> /usr/lib/perl5/5.6.0/ExtUtils/MakeMaker.pm line 747.
>(Strange error to occur when you run perl as root.) But this error comes
>from the system, not from perl.
Yes, and it could be a NFS file system mounted without root
permissions.
Probably not the best idea to install Perl modules as root anyway.
--
Garry Williams
Also mal die Installation des Moduls als Nicht-Root versuchen?
Ne, auch die falsche Fährte.
41 posts
Paul
Paul Yachnes wrote:
> Now I get the following error:
>
> Writing Makefile for koha
> Unable to open MakeMaker.tmp: Permission denied at
> /usr/share/perl/5.8/ExtUtils/MakeMaker.pm line 878.
I fixed by changing permissions on the koha folder.
Paul
_______________________________________________
Koha mailing list
[hidden email]
http://lists.katipo.co.nz/mailman/listinfo/koha
Endlich ein Schritt weiter!
perl Makefile.PL
Bareword "use_test_base" not allowed while "strict subs" in use at Makefile.PL line 13.
Einfach die Zeile in der Datei gelöscht, und:
perl Makefile.PL
Cannot determine perl version info from lib/Algorithm/Kmeanspp.pm
Checking if your kit is complete...
Looks good
Generating a Unix-style Makefile
Writing Makefile for Algorithm::Kmeanspp
Writing MYMETA.yml and MYMETA.json
$ perl Makefile.PL $ make $ make test $ make install
https://www.perlmonks.org/?node_id=128077make
cp lib/Algorithm/Kmeanspp.pm blib/lib/Algorithm/Kmeanspp.pm
Manifying 1 pod document
make test
PERL_DL_NONLAZY=1 "/usr/bin/perl" "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0, 'inc', 'blib/lib', 'blib/arch')" t/*.t
t/00_compile.t ..... ok
t/01_basic.t ....... ok
t/02_clustering.t .. ok
All tests successful.
Files=3, Tests=318, 1 wallclock secs ( 0.06 usr 0.02 sys + 0.34 cusr 0.01 csys = 0.43 CPU)
Result: PASS
make install
Manifying 1 pod document
Installing /home/zarko/perl5/lib/perl5/Algorithm/Kmeanspp.pm
Installing /home/zarko/perl5/man/man3/Algorithm::Kmeanspp.3pm
Appending installation info to /home/zarko/perl5/lib/perl5/i686-linux-gnu-thread-multi-64int/perllocal.pod
Tjo. Und nu?cpan[1]> install Algorithm::Kmeanspp
Reading '/home/zarko/.cpan/Metadata'
Database was generated on Sun, 06 Jan 2019 18:29:02 GMT
Algorithm::Kmeanspp is up to date (0.03).
Sieht fast so aus, als wär's das gewesen. Funny. Mal antesten!
Output
perl kmeanspp-demo.pl
Ted Bob
Dave Sam
Fred Alex
Metal:0.6667 Jazz:5.6667 Reggae:5.0000
World:1.3333 Rock:0.3333 Classic:4.0000
Pop:3.2500 R&B:1.5000 Hip-hop:0.7500 Rock:1.7500
So. Supi, supi. So unspektakulär geht diese Problemlösungssuche zu Ende.
Next step:
Die Umwandlung meiner Wortvektoren zu Hashs.
The real next step
Ich suche noch nach der richtigen Frage. Die Antwort, die ich finden will, soll mir lediglich ermöglichen, den nächsten sinnvollen Schritt mit meinen Daten und mithilfe des K-Means++-Algorithmus-Moduls schreiten zu können.
Aber wie finden?
Das hier bietet zwar viel, aber anscheinend auch viel zu viel. So weit bin ich noch gar nicht. Oder?
D:/Uni/dipl-Arbeit/Ausarbeitung/Verschriftlichung/DA.dvi - hennig_2005a.pdf
Vor einer Analyse ist festzulegen, bzgl. welcher Variablen die Objekte miteinander verglichen werden sollen. Dann ist ein Maß zu bestimmen, mit dem die Ähnlichkeit oder Unähnlichkeit zwischen den Objekten numerisch ausgedrückt wird. Da Variablen in der Regel als numerische Codes gespeichert werden, ist jedes Objekt als Punkt in einem endlich-dimensionalen Raum repräsentiert. Seine Dimension stimmt mit der Anzahl der Analysevariablen überein. Als Maße für Unähnlichkeiten werden Metriken in endlichdimensionalen reellen Räumen oder davon abgeleitete Größen wie die Euklidische Metrik oder deren quadrierter Wert verwendet.
Das scheint's nun endlich zu sein:
Ähnlichkeitsmaße festlegen!
https://www.google.com/search?client=ubuntu&channel=fs&q=%C3%84hnlichkeitsma%C3%9Fe+textanalyse&ie=utf-8&oe=utf-8
probe.pdf Multimedia Retrieval im WS 2011/2012 6. Ähnlichkeitsmaße - MMR06.pdf Clusteranalyse Microsoft PowerPoint - M3_Vorlesung_6_ CA_mit_PVL - M3_Vorlesung_6_-CA.pdf Ähnlichkeitsmaße clusteranalyse.fm - clusteranalyse.pdf skript_clusteranalyse_sose2011.pdf Microsoft PowerPoint - meth11 - meth11.pdf Ähnlichkeitsmaße für Vektoren - Haenelt_VektorAehnlichkeit.pdf Ähnlichkeitsanalyse – Wikipedia
Viel durchzuschauen.
TF-IDF
Daß der Weg iwi über tf-idf gehen muß, hätte ich mir als RapiMiner-User eigentlich schon eher denken können/sollen.
tf–idf - Wikipedia
Text::TFIDF - Perl extension for computing the TF-IDF measure - metacpan.org
Vorlesung Wissensentdeckung in Datenbanken - SVM -- Textkategorisierung - svm3.pdf
Tf-idf-Maß – Wikipedia
tensorflow - Warum tf.mul im Word2vec Trainingsprozess verwenden?
Und - so wie ich das momentan erahne - diese
Werte müssen zu einem Vektorwert (z.B. einem Wert zwischen 0 und 1)
konvertiert werden. Iwi. Aber an das Iwi komme ich nun ja allmählich
immer näher dran.
Path to mecab config? [/usr/bin/mecab-config]
install Text::MeCab
Running install for module 'Text::MeCab'
DMAKI/Text-MeCab-0.20016.tar.gz
Has already been unwrapped into directory /home/zarko/.cpan/build/Text-MeCab-0.20016-0
DMAKI/Text-MeCab-0.20016.tar.gz
No 'Makefile' created
, not re-running
cpan[3]> install Lingua::TFIDF
Running install for module 'Lingua::TFIDF'
SEKIA/Lingua-TFIDF-0.01.tar.gz
Has already been unwrapped into directory /home/zarko/.cpan/build/Lingua-TFIDF-0.01-0
SEKIA/Lingua-TFIDF-0.01.tar.gz
Has already been prepared
SEKIA/Lingua-TFIDF-0.01.tar.gz
Has already been made
Running make test for SEKIA/Lingua-TFIDF-0.01.tar.gz
PERL_DL_NONLAZY=1 "/usr/bin/perl" "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0, 'blib/lib', 'blib/arch')" t/Lingua/*.t t/Lingua/TFIDF/WordSegmenter/*.t t/Lingua/TFIDF/WordSegmenter/JA/*.t
t/Lingua/TFIDF.t ............................. ok
t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t ...... 1/?
# Failed test 'use Lingua::TFIDF::WordSegmenter::JA::MeCab;'
# at t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t line 6.
# Tried to use 'Lingua::TFIDF::WordSegmenter::JA::MeCab'.
# Error: Can't locate Text/MeCab.pm in @INC (you may need to install the Text::MeCab module) (@INC contains: /home/zarko/.cpan/build/Lingua-TFIDF-0.01-0/blib/lib /home/zarko/.cpan/build/Lingua-TFIDF-0.01-0/blib/arch /etc/perl /usr/local/lib/i386-linux-gnu/perl/5.22.1 /usr/local/share/perl/5.22.1 /usr/lib/i386-linux-gnu/perl5/5.22 /usr/share/perl5 /usr/lib/i386-linux-gnu/perl/5.22 /usr/share/perl/5.22 /usr/local/lib/site_perl /usr/lib/i386-linux-gnu/perl-base .) at /home/zarko/.cpan/build/Lingua-TFIDF-0.01-0/blib/lib/Lingua/TFIDF/WordSegmenter/JA/MeCab.pm line 9.
# BEGIN failed--compilation aborted at /home/zarko/.cpan/build/Lingua-TFIDF-0.01-0/blib/lib/Lingua/TFIDF/WordSegmenter/JA/MeCab.pm line 9.
# Compilation failed in require at t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t line 6.
# BEGIN failed--compilation aborted at t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t line 6.
# Failed test 'Lingua::TFIDF::WordSegmenter::JA::MeCab->new() died'
# at t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t line 8.
# Error was: Can't locate object method "new" via package "Lingua::TFIDF::WordSegmenter::JA::MeCab" at /usr/local/share/perl/5.22.1/Test/More.pm line 717.
Can't call method "segment" on an undefined value at t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t line 17.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 2.
t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t ...... Dubious, test returned 255 (wstat 65280, 0xff00)
Failed 2/2 subtests
t/Lingua/TFIDF/WordSegmenter/LetterNgram.t ... ok
t/Lingua/TFIDF/WordSegmenter/SplitBySpace.t .. ok
Test Summary Report
-------------------
t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t (Wstat: 65280 Tests: 2 Failed: 2)
Failed tests: 1-2
Non-zero exit status: 255
Parse errors: No plan found in TAP output
Files=4, Tests=16, 1 wallclock secs ( 0.05 usr 0.00 sys + 0.44 cusr 0.05 csys = 0.54 CPU)
Result: FAIL
Failed 1/4 test programs. 2/16 subtests failed.
Makefile:890: die Regel für Ziel „test_dynamic“ scheiterte
make: *** [test_dynamic] Fehler 255
SEKIA/Lingua-TFIDF-0.01.tar.gz
/usr/bin/make test -- NOT OK
//hint// to see the cpan-testers results for installing this module, try:
reports SEKIA/Lingua-TFIDF-0.01.tar.gz
Failed during this command:
SEKIA/Lingua-TFIDF-0.01.tar.gz : make_test NO
Fehler, die die Welt liebt :-)
sudo apt install libtext-mecab-perl
cpan[8]> install Text::MeCab
Text::MeCab is up to date (0.20016).
That probably just means the tests are bad, rather than the code itself, and you can do force install Thread::Conveyor::Monitored
to bypass the testing.
...
https://superuser.com/questions/145601/what-steps-to-take-when-cpan-installation-fails
I tried doing this from source, and when I run make test, I get the same diagnostic messages. The make itself is fine - in fact, I think this is a pure perl module, so there's nothing to make. The issue is that the tests fail.
–
pythonic metaphor
May 26 '10 at 19:43
cpan[1]> force install Lingua::TFIDF
Reading '/home/zarko/.cpan/Metadata'
Database was generated on Tue, 08 Jan 2019 05:17:02 GMT
Running install for module 'Lingua::TFIDF'
Checksum for /home/zarko/.cpan/sources/authors/id/S/SE/SEKIA/Lingua-TFIDF-0.01.tar.gz ok
Scanning cache /home/zarko/.cpan/build for sizes
............................................................................DONE
Configuring S/SE/SEKIA/Lingua-TFIDF-0.01.tar.gz with Makefile.PL
Checking if your kit is complete...
Looks good
Generating a Unix-style Makefile
Writing Makefile for Lingua::TFIDF
Writing MYMETA.yml and MYMETA.json
SEKIA/Lingua-TFIDF-0.01.tar.gz
/usr/bin/perl Makefile.PL INSTALLDIRS=site -- OK
Running make for S/SE/SEKIA/Lingua-TFIDF-0.01.tar.gz
cp lib/Lingua/TFIDF.pm blib/lib/Lingua/TFIDF.pm
cp lib/Lingua/TFIDF/WordSegmenter/JA/MeCab.pm blib/lib/Lingua/TFIDF/WordSegmenter/JA/MeCab.pm
cp lib/Lingua/TFIDF/Types.pm blib/lib/Lingua/TFIDF/Types.pm
cp lib/Lingua/TFIDF/WordCounter/Simple.pm blib/lib/Lingua/TFIDF/WordCounter/Simple.pm
cp lib/Lingua/TFIDF/WordSegmenter/SplitBySpace.pm blib/lib/Lingua/TFIDF/WordSegmenter/SplitBySpace.pm
cp lib/Lingua/TFIDF/WordSegmenter/LetterNgram.pm blib/lib/Lingua/TFIDF/WordSegmenter/LetterNgram.pm
cp lib/Lingua/TFIDF/WordCounter/Lossy.pm blib/lib/Lingua/TFIDF/WordCounter/Lossy.pm
Manifying 7 pod documents
SEKIA/Lingua-TFIDF-0.01.tar.gz
/usr/bin/make -- OK
Running make test for SEKIA/Lingua-TFIDF-0.01.tar.gz
PERL_DL_NONLAZY=1
"/usr/bin/perl" "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef
*Test::Harness::Switches; test_harness(0, 'blib/lib', 'blib/arch')"
t/Lingua/*.t t/Lingua/TFIDF/WordSegmenter/*.t
t/Lingua/TFIDF/WordSegmenter/JA/*.t
t/Lingua/TFIDF.t ............................. ok
t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t ...... 1/?
# Failed test 'Lingua::TFIDF::WordSegmenter::JA::MeCab->new() died'
# at t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t line 8.
# Error was: Failed to create mecab instance at /usr/lib/i386-linux-gnu/perl5/5.22/Text/MeCab.pm line 64.
Can't call method "segment" on an undefined value at t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t line 17.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 2.
t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t ...... Dubious, test returned 255 (wstat 65280, 0xff00)
Failed 1/2 subtests
t/Lingua/TFIDF/WordSegmenter/LetterNgram.t ... ok
t/Lingua/TFIDF/WordSegmenter/SplitBySpace.t .. ok
Test Summary Report
-------------------
t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t (Wstat: 65280 Tests: 2 Failed: 1)
Failed test: 2
Non-zero exit status: 255
Parse errors: No plan found in TAP output
Files=4, Tests=16, 0 wallclock secs ( 0.03 usr 0.01 sys + 0.42 cusr 0.04 csys = 0.50 CPU)
Result: FAIL
Failed 1/4 test programs. 1/16 subtests failed.
Makefile:890: die Regel für Ziel „test_dynamic“ scheiterte
make: *** [test_dynamic] Fehler 255
SEKIA/Lingua-TFIDF-0.01.tar.gz
/usr/bin/make test -- NOT OK
//hint// to see the cpan-testers results for installing this module, try:
reports SEKIA/Lingua-TFIDF-0.01.tar.gz
Running make install for SEKIA/Lingua-TFIDF-0.01.tar.gz
Manifying 7 pod documents
Installing /usr/local/share/perl/5.22.1/Lingua/TFIDF.pm
Installing /usr/local/share/perl/5.22.1/Lingua/TFIDF/Types.pm
Installing /usr/local/share/perl/5.22.1/Lingua/TFIDF/WordSegmenter/LetterNgram.pm
Installing /usr/local/share/perl/5.22.1/Lingua/TFIDF/WordSegmenter/SplitBySpace.pm
Installing /usr/local/share/perl/5.22.1/Lingua/TFIDF/WordSegmenter/JA/MeCab.pm
Installing /usr/local/share/perl/5.22.1/Lingua/TFIDF/WordCounter/Lossy.pm
Installing /usr/local/share/perl/5.22.1/Lingua/TFIDF/WordCounter/Simple.pm
Installing /usr/local/man/man3/Lingua::TFIDF::WordCounter::Lossy.3pm
Installing /usr/local/man/man3/Lingua::TFIDF::WordSegmenter::JA::MeCab.3pm
Installing /usr/local/man/man3/Lingua::TFIDF::Types.3pm
Installing /usr/local/man/man3/Lingua::TFIDF.3pm
Installing /usr/local/man/man3/Lingua::TFIDF::WordSegmenter::SplitBySpace.3pm
Installing /usr/local/man/man3/Lingua::TFIDF::WordSegmenter::LetterNgram.3pm
Installing /usr/local/man/man3/Lingua::TFIDF::WordCounter::Simple.3pm
Appending installation info to /usr/lib/i386-linux-gnu/perl/5.22/perllocal.pod
SEKIA/Lingua-TFIDF-0.01.tar.gz
/usr/bin/make install -- OK
Failed during this command:
SEKIA/Lingua-TFIDF-0.01.tar.gz : make_test NO but failure ignored because 'force' in effect
So, dann mal schau'n ...
#!/usr/bin/perl
# tfidf-demo.pl
use Lingua::TFIDF;
use Lingua::TFIDF::WordSegmenter::SplitBySpace;
my $tf_idf_calc = Lingua::TFIDF->new(
# Use a word segmenter for japanese text.
word_segmenter => Lingua::TFIDF::WordSegmenter::SplitBySpace->new,
);
my $document1 = 'Humpty Dumpty sat on a wall...';
my $document2 = 'Remember, remember, the fifth of November...';
my $tf = $tf_idf_calc->tf(document => $document1);
# TF of word "Dumpty" in $document1.
say "Say 1: ", $tf->{'Dumpty'}; # 2, if you are referring same text as mine.
my $idf = $tf_idf_calc->idf(documents => [$document1, $document2]);
say "Say 2: ", $idf->{'Dumpty'}; # log(2/1) ≒ 0.693147
my $tf_idfs = $tf_idf_calc->tf_idf(documents => [$document1, $document2]);
# TF-IDF of word "Dumpty" in $document1.
say "Say 3: ", $tf_idfs->[0]{'Dumpty'}; # 2 log(2/1) ≒ 1.386294
# Ditto. But in $document2.
say "Say 4: ", $tf_idfs->[1]{'Dumpty'}; # 0
Can't call method "say" on unblessed reference at tfidf-demo.pl line 19.
...
# tfidf-demo.pl
use Lingua::TFIDF;
use Lingua::TFIDF::WordSegmenter::SplitBySpace;
use feature qw(say);
# Programm
...
Funzt. Prima.
Funzt es wirklich?
Code
#!/usr/bin/perl
# tfidf-demo.pl
use strict;
use warnings;
use Lingua::TFIDF;
use Lingua::TFIDF::WordSegmenter::SplitBySpace;
use feature qw(say);
# Programm
my $tf_idf_calc = Lingua::TFIDF->new(
# Use a word segmenter for japanese text.
word_segmenter => Lingua::TFIDF::WordSegmenter::SplitBySpace->new,
);
my $document1 = 'Humpty Dumpty sat on a wall Honky Dory Donkey';
my $document2 = 'Remember remember the fifth of November Humpty Donkey Fireday';
my @document1_token = split ( " ", $document1 );
my @document2_token = split ( " ", $document2 );
my $tf = $tf_idf_calc->tf(document => $document1);
my $idf = $tf_idf_calc->idf(documents => [$document1, $document2]);
my $tf_idfs = $tf_idf_calc->tf_idf(documents => [$document1, $document2]);
foreach ( @document1_token ) {
# TF-IDF of word $_ in $document1.
say "Say $_, doc1: ", $tf_idfs->[0]{$_};
# Ditto. But in $document2.
say "Say $_, doc2: ", $tf_idfs->[1]{$_};
}
Output
perl tfidf-demo.pl
Say Humpty, doc1: 0
Say Humpty, doc2: 0
Say Dumpty, doc1: 0.693147180559945
Use of uninitialized value in say at tfidf-demo.pl line 34.
Say Dumpty, doc2:
Say sat, doc1: 0.693147180559945
Use of uninitialized value in say at tfidf-demo.pl line 34.
Say sat, doc2:
Say on, doc1: 0.693147180559945
Use of uninitialized value in say at tfidf-demo.pl line 34.
Say on, doc2:
Say a, doc1: 0.693147180559945
Use of uninitialized value in say at tfidf-demo.pl line 34.
Say a, doc2:
Say wall, doc1: 0.693147180559945
Use of uninitialized value in say at tfidf-demo.pl line 34.
Say wall, doc2:
Say Honky, doc1: 0.693147180559945
Use of uninitialized value in say at tfidf-demo.pl line 34.
Say Honky, doc2:
Say Dory, doc1: 0.693147180559945
Use of uninitialized value in say at tfidf-demo.pl line 34.
Say Dory, doc2:
Say Donkey, doc1: 0
Say Donkey, doc2: 0
#!/usr/bin/perl
# tfidf-demo.pl
use strict;
use warnings;
use DBI;
use ZugangsDaten_postgresql qw($DB_USER $DB_PASSWD);
use Lingua::TFIDF;
use Lingua::TFIDF::WordSegmenter::SplitBySpace;
use feature qw(say);
# Variablen
our $dbh;
# Programm
my $tf_idf_calc = Lingua::TFIDF->new(
# Use a word segmenter for japanese text.
word_segmenter => Lingua::TFIDF::WordSegmenter::SplitBySpace->new,
);
connect_db();
my $document1 = document_token_select('11111');
my $document2 = document_token_select('44444');
disconnect_db();
print "\nToken von Dokument 1:\n";
print $document1, "\n";
print "\nToken von Dokument 1, Ende:\n";
sleep 11;
print "\nToken von Dokument 2:\n";
print $document2, "\n";
print "\nToken von Dokument 2, Ende:\n";
sleep 11;
my @document1_token = split ( " ", $document1 );
my @document2_token = split ( " ", $document2 );
my $tf = $tf_idf_calc->tf(document => $document1);
my $idf = $tf_idf_calc->idf(documents => [$document1, $document2]);
my $tf_idfs = $tf_idf_calc->tf_idf(documents => [$document1, $document2]);
foreach ( @document1_token ) {
# TF-IDF of word $_ in $document1.
say "Say $_, doc1: ", $tf_idfs->[0]{$_};
# Ditto. But in $document2.
say "Say $_, doc2: ", $tf_idfs->[1]{$_};
}
###########################################################
############### Subroutinen ####################
###########################################################
# Subroutinen
sub connect_db {
## Verbindung zur DB herstellen
$dbh = DBI->connect("DBI:Pg:dbname=links;host=localhost", "$DB_USER", "$DB_PASSWD");
}
sub disconnect_db {
$dbh->disconnect();
}
# clean_texts_update-Statement
sub document_token_select {
my $link_id = shift;
my $document_token_select = $dbh->prepare("SELECT token FROM (SELECT
token(ts_debug(text)) FROM texts WHERE link_id = $link_id) AS token;");
$document_token_select->execute();
my @document_token;
while ( my $token = $document_token_select->fetchrow() ) {
if ( $token =~ /[a-zA-ZäöüÄÖÜß]+/ ) {
push @document_token, $token;
}
}
my $document_token_string = join ( " ", map { $_ } @document_token );
return $document_token_string
}
Output
...
Say TV-Programm, doc2:
Say TV, doc1: 2.77258872223978
Use of uninitialized value in say at tfidf-demo.pl line 53.
Say TV, doc2:
Say Programm, doc1: 1.38629436111989
Use of uninitialized value in say at tfidf-demo.pl line 53.
Say Programm, doc2:
Say Themen, doc1: 0
Say Themen, doc2: 0
Say Autoren, doc1: 0.693147180559945
Use of uninitialized value in say at tfidf-demo.pl line 53.
Say Autoren, doc2:
Say Spiele, doc1: 0
Say Spiele, doc2: 0
Say Newsletter, doc1: 0
Say Newsletter, doc2: 0
Say WELTPLUS, doc1: 0.693147180559945
Use of uninitialized value in say at tfidf-demo.pl line 53.
Say WELTPLUS, doc2:
Say BUTTON, doc1: 0
Say BUTTON, doc2: 0
Say Politik, doc1: 0
Say Politik, doc2: 0
Say Wirtschaft, doc1: 0
Say Wirtschaft, doc2: 0
Say Finanzen, doc1: 0.693147180559945
Use of uninitialized value in say at tfidf-demo.pl line 53.
Say Finanzen, doc2:
Say Sport, doc1: 0
Say Sport, doc2: 0
Say Panorama, doc1: 0
Say Panorama, doc2: 0
Say Wissen, doc1: 0
Say Wissen, doc2: 0
Say Gesundheit, doc1: 0
Say Gesundheit, doc2: 0
Say Kultur, doc1: 0
Say Kultur, doc2: 0
Say Meinung, doc1: 1.38629436111989
Use of uninitialized value in say at tfidf-demo.pl line 53.
Say Meinung, doc2:
Say Geschichte, doc1: 0
Say Geschichte, doc2: 0
Say Reise, doc1: 0
Say Reise, doc2: 0
Say PS, doc1: 1.38629436111989
Use of uninitialized value in say at tfidf-demo.pl line 53.
Say PS, doc2:
...
Say Bayern, doc1: 0
Say Bayern, doc2: 0
Say Baden-W�rttemberg, doc1: 0.693147180559945
Use of uninitialized value in say at tfidf-demo.pl line 53.
Say Baden-W�rttemberg, doc2:
Say Baden, doc1: 0.693147180559945
Use of uninitialized value in say at tfidf-demo.pl line 53.
Say Baden, doc2:
Say W�rttemberg, doc1: 0.693147180559945
Use of uninitialized value in say at tfidf-demo.pl line 53.
Say W�rttemberg, doc2:
Say Niedersachsen, doc1: 0.693147180559945
Use of uninitialized value in say at tfidf-demo.pl line 53.
Say Niedersachsen, doc2:
Say Bremen, doc1: 0.693147180559945
Use of uninitialized value in say at tfidf-demo.pl line 53.
Say Bremen, doc2:
Say Hessen, doc1: 0.693147180559945
Use of uninitialized value in say at tfidf-demo.pl line 53.
Say Hessen, doc2:
Say Rheinland-Pfalz, doc1: 0.693147180559945
Use of uninitialized value in say at tfidf-demo.pl line 53.
Say Rheinland-Pfalz, doc2:
...
Im
Großen und Ganzen scheint es gut & schnell zu funzen. Zwei
Unschönheiten sind noch zu beheben, UF8-Prob und
Uninitialized-Value-Prob. Sollte kein Thema sein.
Eine kleine Pause habe ich mir jetzt verdient, auch wenn ich noch gar nicht lange gearbeitet habe ;-) .
...
foreach ( @document1_token ) {
# TF-IDF of word $_ in $document1.
if ( not defined $tf_idfs->[0]{$_} ) {
say "Say $_, doc1: undef";
} else { say "Say $_, doc1: ", $tf_idfs->[0]{$_} }
# Ditto. But in $document2.
if ( not defined $tf_idfs->[1]{$_} ) {
say "Say $_, doc2: undef";
} else { say "Say $_, doc2: ", $tf_idfs->[1]{$_} }
}
...
...
Say Premium, doc1: 0.693147180559945
Say Premium, doc2: undef
Say Aromen, doc1: 5.54517744447956
Say Aromen, doc2: undef
Say aus, doc1: 0
Say aus, doc2: 0
Say dem, doc1: 0.693147180559945
Say dem, doc2: undef
Say Hause, doc1: 0.693147180559945
Say Hause, doc2: undef
Say German, doc1: 0.693147180559945
Say German, doc2: undef
Say Liquid, doc1: 4.15888308335967
Say Liquid, doc2: undef
Say s, doc1: 0.693147180559945
Say s, doc2: undef
Say Anzeigen, doc1: 0.693147180559945
Say Anzeigen, doc2: undef
Say Kacheln, doc1: 0.693147180559945
Say Kacheln, doc2: undef
Say Liste, doc1: 0.693147180559945
Say Liste, doc2: undef
...
Zeigt mir, daß alle in beiden Dokumenten enthaltene Token den Wert 0 zugeordnet bekommen. Ausgerechnete Werte gibt's nur da, wo eins (???) "undef" ist ... da erkenne ich grade einen Fehler in meinem Proggi!
Code
...
my %vector_token;
foreach ( @document1_token ) {
if ( not exists $vector_token{$_} ) { $vector_token{$_} = 1 }
}
foreach ( @document2_token ) {
if ( not exists $vector_token{$_} ) { $vector_token{$_} = 1 }
}
my $tf = $tf_idf_calc->tf(document => $document1);
my $idf = $tf_idf_calc->idf(documents => [$document1, $document2]);
my $tf_idfs = $tf_idf_calc->tf_idf(documents => [$document1, $document2]);
foreach ( sort { $a cmp $b } keys %vector_token ) {
# TF-IDF of word $_ in $document1.
if ( not defined $tf_idfs->[0]{$_} ) {
say "Say $_, doc1: undef";
} else { say "Say $_, doc1: ", $tf_idfs->[0]{$_} }
# Ditto. But in $document2.
if ( not defined $tf_idfs->[1]{$_} ) {
say "Say $_, doc2: undef";
} else { say "Say $_, doc2: ", $tf_idfs->[1]{$_} }
}
...
Output
...
Say Batterieentsorgung, doc1: 0.693147180559945
Say Batterieentsorgung, doc2: undef
Say Beginn, doc1: undef
Say Beginn, doc2: 0.693147180559945
Say Benzinpreis, doc1: undef
Say Benzinpreis, doc2: 1.38629436111989
Say Bereitstellung, doc1: 0.693147180559945
Say Bereitstellung, doc2: undef
Say Bestseller, doc1: undef
Say Bestseller, doc2: 0.693147180559945
Say Bettmann, doc1: undef
Say Bettmann, doc2: 0.693147180559945
Say BeyondTomorrow, doc1: undef
Say BeyondTomorrow, doc2: 0.693147180559945
Say Big, doc1: 6.23832462503951
Say Big, doc2: undef
Say Brutto, doc1: undef
Say Brutto, doc2: 1.38629436111989
Say Brutto-Netto-Rechner, doc1: undef
Say Brutto-Netto-Rechner, doc2: 1.38629436111989
Say Buchrezensionen, doc1: undef
Say Buchrezensionen, doc2: 0.693147180559945
Say Bull, doc1: 1.38629436111989
Say Bull, doc2: undef
Say Bundesliga, doc1: undef
Say Bundesliga, doc2: 0.693147180559945
Say Burner, doc1: 2.77258872223978
Say Burner, doc2: undef
Say Business, doc1: undef
Say Business, doc2: 1.38629436111989
Say Bu�geldrechner, doc1: undef
Say Bu�geldrechner, doc2: 1.38629436111989
Say B�rse, doc1: undef
Say B�rse, doc2: 2.07944154167984
...
Code für TF
...
my $tf1 = $tf_idf_calc->tf(document => $document1);
my $tf2 = $tf_idf_calc->tf(document => $document2);
my $idf = $tf_idf_calc->idf(documents => [$document1, $document2]);
my $tf_idfs = $tf_idf_calc->tf_idf(documents => [$document1, $document2]);
foreach ( sort { $a cmp $b } keys %vector_token ) {
# TF of word $_ in $document1.
if ( not defined $tf1->{$_} ) {
say "Say $_, doc1: undef";
} else { say "Say $_, doc1: ", $tf1->{$_} }
# Ditto. But in $document2.
if ( not defined $tf2->{$_} ) {
say "Say $_, doc2: undef";
} else { say "Say $_, doc2: ", $tf2->{$_} }
}
print "\nPause!\n";
sleep 11;
...
...
Say Bestseller, doc1: undef
Say Bestseller, doc2: 1
Say Bettmann, doc1: undef
Say Bettmann, doc2: 1
Say BeyondTomorrow, doc1: undef
Say BeyondTomorrow, doc2: 1
Say Big, doc1: 9
Say Big, doc2: undef
Say Brutto, doc1: undef
Say Brutto, doc2: 2
Say Brutto-Netto-Rechner, doc1: undef
Say Brutto-Netto-Rechner, doc2: 2
Say Buchrezensionen, doc1: undef
Say Buchrezensionen, doc2: 1
Say Bull, doc1: 2
Say Bull, doc2: undef
Say Bundesliga, doc1: undef
Say Bundesliga, doc2: 1
Say Burner, doc1: 4
Say Burner, doc2: undef
Say Business, doc1: undef
Say Business, doc2: 2
Say Bu�geldrechner, doc1: undef
Say Bu�geldrechner, doc2: 2
Say B�rse, doc1: undef
Say B�rse, doc2: 3
Say B�cher, doc1: undef
Say B�cher, doc2: 2
Say CHRONIK, doc1: undef
Say CHRONIK, doc2: 1
Say Champions, doc1: undef
Say Champions, doc2: 1
Say Clark, doc1: 2
Say Clark, doc2: undef
Say Coils, doc1: 1
Say Coils, doc2: undef
Say Coilstore, doc1: 2
Say Coilstore, doc2: undef
...
Code IDF
...
foreach ( sort { $a cmp $b } keys %vector_token ) {
# IDF of word $_ in $document1.
if ( not defined $idf->{$_} ) {
say "Say $_, doc1: undef";
} else { say "Say $_, doc1: ", $idf->{$_} }
# Ditto. But in $document2.
if ( not defined $idf->{$_} ) {
say "Say $_, doc2: undef";
} else { say "Say $_, doc2: ", $idf->{$_} }
}
print "\nPause!\n";
sleep 11;
...
Output
...
Say Apps, doc1: 0.693147180559945
Say Apps, doc2: 0.693147180559945
Say Archiv, doc1: 0.693147180559945
Say Archiv, doc2: 0.693147180559945
Say Archive, doc1: 0.693147180559945
Say Archive, doc2: 0.693147180559945
Say Aroma, doc1: 0.693147180559945
Say Aroma, doc2: 0.693147180559945
Say Aromen, doc1: 0.693147180559945
Say Aromen, doc2: 0.693147180559945
Say Artikel, doc1: 0
Say Artikel, doc2: 0
Say Arztsuche, doc1: 0.693147180559945
Say Arztsuche, doc2: 0.693147180559945
Say Aspire, doc1: 0.693147180559945
Say Aspire, doc2: 0.693147180559945
...
Daran erkenne ich, daß ich den IDF-Output noch nicht verstehe ;-) . Kommt Zeit, kommt Rat. Immer mit der Ruhe.
"Lesen, verstehen.", heißt der Zauberspruch!
...............................................................................................................................
Eine Welt der SupiDupis, und - leider - schlechter Formatierungen. Mindestens dafür muß ich mich entschuldigen ;-)
...............................................................................................................................
FORTSETZUNG FOLGT/DROHT!
Eine Welt der SupiDupis, und - leider - schlechter Formatierungen. Mindestens dafür muß ich mich entschuldigen ;-)
...............................................................................................................................
FORTSETZUNG FOLGT/DROHT!
Kommentare
Kommentar veröffentlichen