Mein holpriger Weg zur automatischen Textkategorisierung mit Perl-Modulen / Algorithmen - Teil I

Mein holpriger Weg zur automatischen Textkategorisierung mit Perl-Modulen / Algorithmen - Teil I



Sollte ich den Weg erfolgreich zu Ende gehen, werde ich vielleicht eine verständlichere Erklärung in Form einer Zusammenfassung nachreichen. Vielleicht, eventuell. Wem das hier schon Anregung sein kann, bitte einfach zugreifen! ( Alle Anderen, bitte einfach ignorieren! )


Näggl mit Köppn - Probelauf 1 - Hembelz Om(x) 

( besser lesbar unter diesem Link )



Andere verwenden Algorithmen Anderer erfolgreich:


Und nun sollte auch ich mal daran gehen, mit fertig erstellten Algorithmen mein Glück zu versuchen. Klauen wie im obigen Beispiel ist ja eh nicht meine Absicht, sondern Offenlegen meiner Karten im Spiel, wie gehabt. Von daher: Okay, so - grünes Licht! (gibt mir nun meine Gewissensabteilung)

Und hiermit fange ich einfach mal grob & salopp & vor allem einfach mal an:


Erste Schritte

Modul installieren

 

sudo perl -MCPAN -e shell

cpan[1]> install Algorithm::Kmeanspp


...


...............................................................DONE


Fetching with LWP:


http://www.cpan.org/modules/03modlist.data.gz


Reading '/home/zarko/.local/share/.cpan/sources/modules/03modlist.data.gz'


DONE


Writing /home/zarko/.local/share/.cpan/Metadata


Running install for module 'Algorithm::Kmeanspp'


Fetching with LWP:


http://www.cpan.org/authors/id/F/FU/FUJISAWA/Algorithm-Kmeanspp-0.03.tar.gz


Fetching with LWP:


http://www.cpan.org/authors/id/F/FU/FUJISAWA/CHECKSUMS


Checksum for /home/zarko/.local/share/.cpan/sources/authors/id/F/FU/FUJISAWA/Algorithm-Kmeanspp-0.03.tar.gz ok


Scanning cache /home/zarko/.local/share/.cpan/build for sizes


............................................................................DONE


'YAML' not installed, will not store persistent state


Configuring F/FU/FUJISAWA/Algorithm-Kmeanspp-0.03.tar.gz with Makefile.PL


Bareword "use_test_base" not allowed while "strict subs" in use at Makefile.PL line 13.


Execution of Makefile.PL aborted due to compilation errors.


Warning: No success on command[/usr/bin/perl Makefile.PL INSTALLDIRS=site]


  FUJISAWA/Algorithm-Kmeanspp-0.03.tar.gz


  /usr/bin/perl Makefile.PL INSTALLDIRS=site -- NOT OK


Failed during this command:


 FUJISAWA/Algorithm-Kmeanspp-0.03.tar.gz      : writemakefile NO '/usr/bin/perl Makefile.PL INSTALLDIRS=site' returned status 65280

Mir scheint, da hat etwas nicht geklappt.
YAML fehlt - was auch immer das ist.
Yap, das war's! Easiest! Nach "install YAML" in der CPAN-Shell funzt nun die Installation des gewünschten Algorithmus fluffig durch.



Da der Algorithmus offenbar sehr aufwendig programmiert ist und die Installation dauert, schreibe ich kurz einen Entwurf eines Programms, das später Subroutine oder Modul werden soll, welches mir aus meinen tssearch-Wortvektoren Vektoren in Hash-Form für/in Perl transformiert.


Code

#!/usr/bin/perl

# tsvector2perlhash.pl

use strict;
use warnings;
use DBI;
use ZugangsDaten_postgresql qw($DB_USER $DB_PASSWD);
use Encode qw(is_utf8 decode encode);

# Programm

## Erfragen der Vektor-ID

print "\nBitte die Wortvektor-ID (link_id) eingeben!\n";
my $link_id = <STDIN>;
chomp $link_id;

## Ausgabe des Vektors am Bildschirm als String

connect_db;
my $vector = vector2hash($link_id);
disconnect_db;

print "\nDer ermittelte Wortvektor sieht so aus:\n\n";
print $vector;
print "\nZufrieden mit dem Zwischenergebnis?\n";



###########################################################
############### Subroutinen ####################
###########################################################

# Subroutinen für Export

sub connect_db {
    ## Verbindung zur DB herstellen
    $dbh = DBI->connect("DBI:Pg:dbname=links;host=localhost", "$DB_USER", "$DB_PASSWD");
}

sub disconnect_db {
    ## Verbindung zur DB trennen
    $dbh->disconnect();
}

sub vector2hash {
    my $link_id = shift;
    my $vector_select = $dbh->prepare("SELECT vector FROM wordvectors WHERE link_id = $link_id;");
    $vector_select->execute();
    my $vector_string = $vector_select->fetchrow;
    return $vector_string
}

 

Output

 

Hier meldet sich die "ZugangsDaten_postgres.pm": Huhuhu!

Global symbol "$dbh" requires explicit package name (did you forget to declare "my $dbh"?) at tsvector2perlhash.pl line 39.
Global symbol "$dbh" requires explicit package name (did you forget to declare "my $dbh"?) at tsvector2perlhash.pl line 44.
Global symbol "$dbh" requires explicit package name (did you forget to declare "my $dbh"?) at tsvector2perlhash.pl line 49.
Bareword "connect_db" not allowed while "strict subs" in use at tsvector2perlhash.pl line 21.
Bareword "disconnect_db" not allowed while "strict subs" in use at tsvector2perlhash.pl line 23.
Execution of tsvector2perlhash.pl aborted due to compilation errors.


Code-Änderung

 

#!/usr/bin/perl

# tsvector2perlhash.pl

use strict;
use warnings;
use DBI;
use ZugangsDaten_postgresql qw($DB_USER $DB_PASSWD);
use Encode qw(is_utf8 decode encode);

# Variablen

my $dbh;


# Programm

## Erfragen der Vektor-ID

print "\nBitte die Wortvektor-ID (link_id) eingeben!\n";
my $link_id = <STDIN>;
chomp $link_id;

## Ausgabe des Vektors am Bildschirm als String

connect_db();
my $vector = vector2hash($link_id);
disconnect_db();

print "\nDer ermittelte Wortvektor sieht so aus:\n\n";
print $vector;
print "\nZufrieden mit dem Zwischenergebnis?\n";



###########################################################
############### Subroutinen ####################
###########################################################

# Subroutinen für Export

sub connect_db {
    ## Verbindung zur DB herstellen
    $dbh = DBI->connect("DBI:Pg:dbname=links;host=localhost", "$DB_USER", "$DB_PASSWD");
}

sub disconnect_db {
    ## Verbindung zur DB trennen
    $dbh->disconnect();
}

sub vector2hash {
    my $link_id = shift;
    my $vector_select = $dbh->prepare("SELECT vector FROM wordvectors WHERE link_id = $link_id;");
    $vector_select->execute();
    my $vector_string = $vector_select->fetchrow;
    return $vector_string
}

 

Output

 

Hier meldet sich die "ZugangsDaten_postgres.pm": Huhuhu!


Bitte die Wortvektor-ID (link_id) eingeben!
55555
DBD::Pg::st execute failed: ERROR:  column "vector" does not exist
LINE 1: SELECT vector FROM wordvectors WHERE link_id = 55555;
               ^ at tsvector2perlhash.pl line 55, <STDIN> line 1.
DBD::Pg::st fetchrow failed: no statement executing at tsvector2perlhash.pl line 56, <STDIN> line 1.

Der ermittelte Wortvektor sieht so aus:

Use of uninitialized value $vector in print at tsvector2perlhash.pl line 31, <STDIN> line 1.

Zufrieden mit dem Zwischenergebnis?

 

Endgültiger Entwurf ;-) (bzw. Änderung einer Zeile)

my $vector_select = $dbh->prepare("SELECT wordvector FROM wordvectors WHERE link_id = $link_id;");

 

Endgültiger Entwurfsoutput

 

Hier meldet sich die "ZugangsDaten_postgres.pm": Huhuhu!


Bitte die Wortvektor-ID (link_id) eingeben!
55555

Der ermittelte Wortvektor sieht so aus:

Wide character in print at tsvector2perlhash.pl line 31, <STDIN> line 1.
'-0':16383 '-00':16383 '-010':16383 '-0168':16383 '-02':16383 '-02049':16383 '-0404':16383 '-0481':16383 '-06':62,16383 '-0716':16383 '-0822':16383 '-09':16383 '-1':16383 '-11':16383 '-11482':16383 '-12':6610,6611,15687,15688 '-125':16383 '-127':16383 '-1614':16383 '-17':16383 '-1746':16383 '-177446':16383 '-18':63 '-19':16383 '-1976':16383 '-2':16383 '-20130127':16383 '-20735':16383 '-237':16383 '-239':16383 '-24':16383 '-25680':16383 '-269':16383 '-28':8044,16383 '-3':16383 '-304':16383 '-306':16383 '-307':16383 '-3077':16383 '-312':16383 '-313':16383 '-316':16383 '-333':16383 '-33874':16383 '-345':16383 '-34969':16383 '-35338':16383 '-36':16383 '-3636':16383 '-37':16383 '-393':16383 '-4':16383 '-4000':16383 '-4165':16383 '-451':16383 '-5':16383 '-512941':16383 '-516921':16383 '-5248':16383 '-525':16383 '-531':16383 '-534':16383 '-5494':16383 '-553':16383 '-55652':16383 '-56025':16383 '-56976':16383 '-57':16383 '-57215':16383 '-59240':16383 '-59256':16383 '-6':16383 '-600':16383 '-60398':16383 '-61':16383 '-61613':16383 '-6209':16383 '-7':16383 '-705':16383 '-7119':16383 '-73':16383 '-731':16383 '-733':16383 '-7432'
...
-yo':16383 'yogi':11883 'yoko':8881,9460,10541,13640,16383 'york':5185,5688,6211,6293,6304,6357,6385,6833,9753,10745,13083,13103,13176,14356,15890,16383 'you':989,2566,3218,3231,3242,3263,4227,4375,4677,7954,9218,10673,10783,11684,12060,12320,12330,13257,13562,13587,13659,13711,13887,13964,14061,14206,14340,14787,15817,16301,16383 'young':775,912,16383 'your':8654,16306,16383 'yourself':16383 'youssou':16383 'youth':5278,5618,11968,16383 'youtub':16383 'yvonn':16383 'zenith':1210,1274,16383 'zeppelin':16383 'zoo':16383 'zubin':16383 'à':10569 'ádám':16383 'álvaro':16383 'íslenska':16383 'čeština':16383 'ελληνικά':16383 'беларуская':16383 'български':16383 'в':4842,16383 'македонски':16383 'монгол':16383 'нохчийн':16383 'русиньскый':16383 'русский':16383 'снова':4841,16383 'српски':16383 'српскохрватски':16383 'ссср':4843,16383 'тарашкевіца':16383 'українська':16383 'ўзбекча':16383 'қазақша':16383 'հայերեն':16383 'ייִדיש':16383 'עברית':16383 'اردو':16383 'العربية':16383 'فارسی':16383 'مصرى':16383 'कोंकणी':16383 'गोंयची':16383 'नेपाली':16383 'मराठी':16383 'हिन्दी':16383 'বাংলা':16383 'മലയാളം':16383 'ไทย':16383 'მარგალური':16383 'ქართული':16383 '中文':16383 '日本語':16383 '粵語':16383 '한국어':16383

 
Zufrieden mit dem Zwischenergebnis?
Ja, zufrieden.

( Ich mußte das so in ruckizucki-Manier machen, weil hier jemand mit den Füßen scharrt, um ins Wohnzimmer an die Glotze zu kommen, nachdem ich mir erbeten habe, wenigstens eine halbe Stunde mal die Tür hinter mir zumachen zu dürfen, damit ich mich auf die Sache konzentrieren kann - die mir wichtig ist. Echtes Verständnis ist da in diesem Leben nicht mehr zu erwarten, tja. )



 

Demoprogramm ausprobieren

 

Code

 

#!/usr/bin/perl

# kmeanspp-demo.pl


use Algorithm::Kmeanspp;
 
# input documents
my %documents = (
    Alex => { 'Pop'     => 10, 'R&B'    => 6, 'Rock'   => 4 },
    Bob  => { 'Jazz'    => 8,  'Reggae' => 9                },
    Dave => { 'Classic' => 4,  'World'  => 4                },
    Ted  => { 'Jazz'    => 9,  'Metal'  => 2, 'Reggae' => 6 },
    Fred => { 'Hip-hop' => 3,  'Rock'   => 3, 'Pop'    => 3 },
    Sam  => { 'Classic' => 8,  'Rock'   => 1                },
);
 
my $kmp = Algorithm::Kmeanspp->new;
 
foreach my $id (keys %documents) {
    $kmp->add_document($id, $documents{$id});
}
 
my $num_cluster = 3;
my $num_iter    = 20;
$kmp->do_clustering($num_cluster, $num_iter);            
 
# show clustering result
foreach my $cluster (@{ $kmp->clusters }) {
    print join "\t", @{ $cluster };
    print "\n";
}
# show cluster centroids
foreach my $centroid (@{ $kmp->centroids }) {
    print join "\t", map { sprintf "%s:%.4f", $_, $centroid->{$_} }
        keys %{ $centroid };
    print "\n";
}

 

Output

 

Can't locate Algorithm/Kmeanspp.pm in @INC (you may need to install the Algorithm::Kmeanspp module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.26.1 /usr/local/share/perl/5.26.1 /usr/lib/x86_64-linux-gnu/perl5/5.26 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.26 /usr/share/perl/5.26 /usr/local/lib/site_perl /usr/local/lib/x86_64-linux-gnu/perl/5.26.0 /usr/local/share/perl/5.26.0 /usr/lib/x86_64-linux-gnu/perl-base) at kmeanspp-demo.pl line 6.
BEGIN failed--compilation aborted at kmeanspp-demo.pl line 6.
So, schade. Zeitfenster ist nun zu. Das werde ich zu Hause fertigstellen müssen. Aber ich bin ja ganz gut weit gekommen, auf die Schnelle. Supi!


Hier in meinem Domizil erhalte ich eine seltsame Fehlermeldung:

cpan install Algorithm::Kmeanspp
Loading internal logger. Log::Log4perl recommended for better logging
CPAN: Storable loaded ok (v2.53_01)
Reading '/home/zarko/.cpan/Metadata'
  Database was generated on Sun, 06 Jan 2019 18:29:02 GMT
Running install for module 'Algorithm::Kmeanspp'
CPAN: Digest::SHA loaded ok (v5.95)
CPAN: Compress::Zlib loaded ok (v2.068)
Checksum for /home/zarko/.cpan/sources/authors/id/F/FU/FUJISAWA/Algorithm-Kmeanspp-0.03.tar.gz ok
CPAN: YAML loaded ok (v1.27)
CPAN: CPAN::Meta::Requirements loaded ok (v2.132)
CPAN: Parse::CPAN::Meta loaded ok (v1.4414)
CPAN: CPAN::Meta loaded ok (v2.150001)
CPAN: Module::CoreList loaded ok (v5.20151213)
Configuring F/FU/FUJISAWA/Algorithm-Kmeanspp-0.03.tar.gz with Makefile.PL
Bareword "use_test_base" not allowed while "strict subs" in use at Makefile.PL line 13.
Execution of Makefile.PL aborted due to compilation errors.
Warning: No success on command[/usr/bin/perl Makefile.PL INSTALLDIRS=site]
  FUJISAWA/Algorithm-Kmeanspp-0.03.tar.gz
  /usr/bin/perl Makefile.PL INSTALLDIRS=site -- NOT OK
Milchmädchenberechnenderweise schaue ich mir das Makefile.PL jetzt mal an ...


use inc::Module::Install;
name 'Algorithm-Kmeanspp';
all_from 'lib/Algorithm/Kmeanspp.pm';

requires 'Carp';
requires 'Class::Accessor::Fast';
requires 'List::Util';

tests 't/*.t';
author_tests 'xt';

build_requires 'Test::More';
use_test_base;
auto_include;
WriteAll;
Alle Requirements durchgecheckt. Zwei haben gefehlt. Immer noch Error. Jetzt bleibt nur noch:
use_test_base;

Nach Auskommentierung und manuellem Versuch:

perl Makefile.PL
Cannot determine perl version info from lib/Algorithm/Kmeanspp.pm
Checking if your kit is complete...
Looks good
Generating a Unix-style Makefile
Writing Makefile for Algorithm::Kmeanspp
Unable to open MakeMaker.tmp: Permission denied at /usr/share/perl/5.22/ExtUtils/MakeMaker.pm line 1173.
Iwi scheint das hier zu fehlen:

ExtUtils::MakeMaker

Wird grade in der CPAN-Shell installiert. Blicke aber so gut wie gar nicht mehr durch, hierbei ;-) .

Permission denied at /usr/local/share/perl/5.22.1/ExtUtils/MakeMaker.pm line 1227.

Why?


...

Vielleicht gibt es noch ein anderes KMeans-Modul ... (schlechte Lösung, eigentlich, aber mal schau'n ...)

Keine gute Idee.


 newbie, problem in installing module
On Tue, 30 Jan 2001 09:15:27 GMT, Rafael Garcia-Suarez

Quote:

>Pradeep Sethi wrote in comp.lang.perl.misc:
>> Writing Makefile for XML::XPath
>> Unable to open MakeMaker.tmp: Permission denied at
>> /usr/lib/perl5/5.6.0/ExtUtils/MakeMaker.pm line 747.
>(Strange error to occur when you run perl as root.) But this error comes
>from the system, not from perl.
Yes, and it could be a NFS file system mounted without root
permissions.  
Probably not the best idea to install Perl modules as root anyway.  
--
Garry Williams


Also mal die Installation des Moduls als Nicht-Root versuchen?

Ne, auch die falsche Fährte.

Jul 16, 2008; 11:18pm

Re: Apology and install problem

Paul Yachnes
41 posts
Paul

Paul Yachnes wrote:
> Now I get the following error:
>
> Writing Makefile for koha
> Unable to open MakeMaker.tmp: Permission denied at
> /usr/share/perl/5.8/ExtUtils/MakeMaker.pm line 878.
I fixed by changing permissions on the koha folder.

Paul
_______________________________________________
Koha mailing list
[hidden email]
http://lists.katipo.co.nz/mailman/listinfo/koha



Endlich ein Schritt weiter!

perl Makefile.PL
Bareword "use_test_base" not allowed while "strict subs" in use at Makefile.PL line 13.

Einfach die Zeile in der Datei gelöscht, und:



perl Makefile.PL
Cannot determine perl version info from lib/Algorithm/Kmeanspp.pm
Checking if your kit is complete...
Looks good
Generating a Unix-style Makefile
Writing Makefile for Algorithm::Kmeanspp
Writing MYMETA.yml and MYMETA.json
$ perl Makefile.PL $ make $ make test $ make install
 
https://www.perlmonks.org/?node_id=128077


make cp lib/Algorithm/Kmeanspp.pm blib/lib/Algorithm/Kmeanspp.pm Manifying 1 pod document
make test PERL_DL_NONLAZY=1 "/usr/bin/perl" "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0, 'inc', 'blib/lib', 'blib/arch')" t/*.t t/00_compile.t ..... ok t/01_basic.t ....... ok t/02_clustering.t .. ok All tests successful. Files=3, Tests=318, 1 wallclock secs ( 0.06 usr 0.02 sys + 0.34 cusr 0.01 csys = 0.43 CPU) Result: PASS
make install Manifying 1 pod document Installing /home/zarko/perl5/lib/perl5/Algorithm/Kmeanspp.pm Installing /home/zarko/perl5/man/man3/Algorithm::Kmeanspp.3pm Appending installation info to /home/zarko/perl5/lib/perl5/i686-linux-gnu-thread-multi-64int/perllocal.pod
Tjo. Und nu?

cpan[1]> install Algorithm::Kmeanspp Reading '/home/zarko/.cpan/Metadata' Database was generated on Sun, 06 Jan 2019 18:29:02 GMT Algorithm::Kmeanspp is up to date (0.03).
Sieht fast so aus, als wär's das gewesen. Funny. Mal antesten!

 

Output

 

perl kmeanspp-demo.pl
Ted    Bob
Dave    Sam
Fred    Alex
Metal:0.6667    Jazz:5.6667    Reggae:5.0000
World:1.3333    Rock:0.3333    Classic:4.0000
Pop:3.2500    R&B:1.5000    Hip-hop:0.7500    Rock:1.7500
 
So. Supi, supi. So unspektakulär geht diese Problemlösungssuche zu Ende.
Next step:
Die Umwandlung meiner Wortvektoren zu Hashs.


 

The real next step

 

Ich suche noch nach der richtigen Frage. Die Antwort, die ich finden will, soll mir lediglich ermöglichen, den nächsten sinnvollen Schritt mit meinen Daten und mithilfe des K-Means++-Algorithmus-Moduls schreiten zu können.
Aber wie finden?
Das hier bietet zwar viel, aber anscheinend auch viel zu viel. So weit bin ich noch gar nicht. Oder?

D:/Uni/dipl-Arbeit/Ausarbeitung/Verschriftlichung/DA.dvi - hennig_2005a.pdf
 
Vor einer Analyse ist festzulegen, bzgl. welcher Variablen die Objekte miteinander verglichen werden sollen. Dann ist ein Maß zu bestimmen, mit dem die Ähnlichkeit oder Unähnlichkeit zwischen den Objekten numerisch ausgedrückt wird. Da Variablen in der Regel als numerische Codes gespeichert werden, ist jedes Objekt als Punkt in einem endlich-dimensionalen Raum repräsentiert. Seine Dimension stimmt mit der Anzahl der Analysevariablen überein. Als Maße für Unähnlichkeiten werden Metriken in endlichdimensionalen reellen Räumen oder davon abgeleitete Größen wie die Euklidische Metrik oder deren quadrierter Wert verwendet. 
Microsoft Word - HowToDoCA_011023.doc - how-to10mwcz.pdf
 


Das scheint's nun endlich zu sein:

Ähnlichkeitsmaße festlegen!

 

https://www.google.com/search?client=ubuntu&channel=fs&q=%C3%84hnlichkeitsma%C3%9Fe+textanalyse&ie=utf-8&oe=utf-8
probe.pdf Multimedia Retrieval im WS 2011/2012 6. Ähnlichkeitsmaße - MMR06.pdf Clusteranalyse Microsoft PowerPoint - M3_Vorlesung_6_ CA_mit_PVL - M3_Vorlesung_6_-CA.pdf Ähnlichkeitsmaße clusteranalyse.fm - clusteranalyse.pdf skript_clusteranalyse_sose2011.pdf Microsoft PowerPoint - meth11 - meth11.pdf Ähnlichkeitsmaße für Vektoren - Haenelt_VektorAehnlichkeit.pdf Ähnlichkeitsanalyse – Wikipedia
 
Viel durchzuschauen.

 

TF-IDF

 

Daß der Weg iwi über tf-idf gehen muß, hätte ich mir als RapiMiner-User eigentlich schon eher denken können/sollen.


Und - so wie ich das momentan erahne - diese Werte müssen zu einem Vektorwert (z.B. einem Wert zwischen 0 und 1) konvertiert werden. Iwi. Aber an das Iwi komme ich nun ja allmählich immer näher dran.

 

Nächstes Perl-Modul zum Ausprobieren

 

 

Path to mecab config? [/usr/bin/mecab-config]
 
install Text::MeCab Running install for module 'Text::MeCab' DMAKI/Text-MeCab-0.20016.tar.gz Has already been unwrapped into directory /home/zarko/.cpan/build/Text-MeCab-0.20016-0 DMAKI/Text-MeCab-0.20016.tar.gz No 'Makefile' created , not re-running
cpan[3]> install Lingua::TFIDF Running install for module 'Lingua::TFIDF' SEKIA/Lingua-TFIDF-0.01.tar.gz Has already been unwrapped into directory /home/zarko/.cpan/build/Lingua-TFIDF-0.01-0 SEKIA/Lingua-TFIDF-0.01.tar.gz Has already been prepared SEKIA/Lingua-TFIDF-0.01.tar.gz Has already been made Running make test for SEKIA/Lingua-TFIDF-0.01.tar.gz PERL_DL_NONLAZY=1 "/usr/bin/perl" "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0, 'blib/lib', 'blib/arch')" t/Lingua/*.t t/Lingua/TFIDF/WordSegmenter/*.t t/Lingua/TFIDF/WordSegmenter/JA/*.t t/Lingua/TFIDF.t ............................. ok t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t ...... 1/? # Failed test 'use Lingua::TFIDF::WordSegmenter::JA::MeCab;' # at t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t line 6. # Tried to use 'Lingua::TFIDF::WordSegmenter::JA::MeCab'. # Error: Can't locate Text/MeCab.pm in @INC (you may need to install the Text::MeCab module) (@INC contains: /home/zarko/.cpan/build/Lingua-TFIDF-0.01-0/blib/lib /home/zarko/.cpan/build/Lingua-TFIDF-0.01-0/blib/arch /etc/perl /usr/local/lib/i386-linux-gnu/perl/5.22.1 /usr/local/share/perl/5.22.1 /usr/lib/i386-linux-gnu/perl5/5.22 /usr/share/perl5 /usr/lib/i386-linux-gnu/perl/5.22 /usr/share/perl/5.22 /usr/local/lib/site_perl /usr/lib/i386-linux-gnu/perl-base .) at /home/zarko/.cpan/build/Lingua-TFIDF-0.01-0/blib/lib/Lingua/TFIDF/WordSegmenter/JA/MeCab.pm line 9. # BEGIN failed--compilation aborted at /home/zarko/.cpan/build/Lingua-TFIDF-0.01-0/blib/lib/Lingua/TFIDF/WordSegmenter/JA/MeCab.pm line 9. # Compilation failed in require at t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t line 6. # BEGIN failed--compilation aborted at t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t line 6. # Failed test 'Lingua::TFIDF::WordSegmenter::JA::MeCab->new() died' # at t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t line 8. # Error was: Can't locate object method "new" via package "Lingua::TFIDF::WordSegmenter::JA::MeCab" at /usr/local/share/perl/5.22.1/Test/More.pm line 717. Can't call method "segment" on an undefined value at t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t line 17. # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 255 just after 2. t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t ...... Dubious, test returned 255 (wstat 65280, 0xff00) Failed 2/2 subtests t/Lingua/TFIDF/WordSegmenter/LetterNgram.t ... ok t/Lingua/TFIDF/WordSegmenter/SplitBySpace.t .. ok Test Summary Report ------------------- t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t (Wstat: 65280 Tests: 2 Failed: 2) Failed tests: 1-2 Non-zero exit status: 255 Parse errors: No plan found in TAP output Files=4, Tests=16, 1 wallclock secs ( 0.05 usr 0.00 sys + 0.44 cusr 0.05 csys = 0.54 CPU) Result: FAIL Failed 1/4 test programs. 2/16 subtests failed. Makefile:890: die Regel für Ziel „test_dynamic“ scheiterte make: *** [test_dynamic] Fehler 255 SEKIA/Lingua-TFIDF-0.01.tar.gz /usr/bin/make test -- NOT OK //hint// to see the cpan-testers results for installing this module, try: reports SEKIA/Lingua-TFIDF-0.01.tar.gz Failed during this command: SEKIA/Lingua-TFIDF-0.01.tar.gz : make_test NO
 
Fehler, die die Welt liebt :-)

sudo apt install libtext-mecab-perl
 

cpan[8]> install Text::MeCab Text::MeCab is up to date (0.20016).


  • That probably just means the tests are bad, rather than the code itself, and you can do force install Thread::Conveyor::Monitored to bypass the testing.
    answered May 27 '10 at 0:39
    Steve Simms
    1,724117
    • ...


  • I tried doing this from source, and when I run make test, I get the same diagnostic messages. The make itself is fine - in fact, I think this is a pure perl module, so there's nothing to make. The issue is that the tests fail. – pythonic metaphor May 26 '10 at 19:43
  • Steve was right I think. The tests are poorly written. – pythonic metaphor May 28 '10 at 19:49
  • @pythonic metaphor: Cool. – Satanicpuppy May 28 '10 at 21:25
  • https://superuser.com/questions/145601/what-steps-to-take-when-cpan-installation-fails


    cpan[1]> force install Lingua::TFIDF
    Reading '/home/zarko/.cpan/Metadata'
      Database was generated on Tue, 08 Jan 2019 05:17:02 GMT
    Running install for module 'Lingua::TFIDF'
    Checksum for /home/zarko/.cpan/sources/authors/id/S/SE/SEKIA/Lingua-TFIDF-0.01.tar.gz ok
    Scanning cache /home/zarko/.cpan/build for sizes
    ............................................................................DONE
    Configuring S/SE/SEKIA/Lingua-TFIDF-0.01.tar.gz with Makefile.PL
    Checking if your kit is complete...
    Looks good
    Generating a Unix-style Makefile
    Writing Makefile for Lingua::TFIDF
    Writing MYMETA.yml and MYMETA.json
      SEKIA/Lingua-TFIDF-0.01.tar.gz
      /usr/bin/perl Makefile.PL INSTALLDIRS=site -- OK
    Running make for S/SE/SEKIA/Lingua-TFIDF-0.01.tar.gz
    cp lib/Lingua/TFIDF.pm blib/lib/Lingua/TFIDF.pm
    cp lib/Lingua/TFIDF/WordSegmenter/JA/MeCab.pm blib/lib/Lingua/TFIDF/WordSegmenter/JA/MeCab.pm
    cp lib/Lingua/TFIDF/Types.pm blib/lib/Lingua/TFIDF/Types.pm
    cp lib/Lingua/TFIDF/WordCounter/Simple.pm blib/lib/Lingua/TFIDF/WordCounter/Simple.pm
    cp lib/Lingua/TFIDF/WordSegmenter/SplitBySpace.pm blib/lib/Lingua/TFIDF/WordSegmenter/SplitBySpace.pm
    cp lib/Lingua/TFIDF/WordSegmenter/LetterNgram.pm blib/lib/Lingua/TFIDF/WordSegmenter/LetterNgram.pm
    cp lib/Lingua/TFIDF/WordCounter/Lossy.pm blib/lib/Lingua/TFIDF/WordCounter/Lossy.pm
    Manifying 7 pod documents
      SEKIA/Lingua-TFIDF-0.01.tar.gz
      /usr/bin/make -- OK
    Running make test for SEKIA/Lingua-TFIDF-0.01.tar.gz
    PERL_DL_NONLAZY=1 "/usr/bin/perl" "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0, 'blib/lib', 'blib/arch')" t/Lingua/*.t t/Lingua/TFIDF/WordSegmenter/*.t t/Lingua/TFIDF/WordSegmenter/JA/*.t
    t/Lingua/TFIDF.t ............................. ok  
    t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t ...... 1/?
    #   Failed test 'Lingua::TFIDF::WordSegmenter::JA::MeCab->new() died'
    #   at t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t line 8.
    #     Error was:  Failed to create mecab instance at /usr/lib/i386-linux-gnu/perl5/5.22/Text/MeCab.pm line 64.
    Can't call method "segment" on an undefined value at t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t line 17.
    # Tests were run but no plan was declared and done_testing() was not seen.
    # Looks like your test exited with 255 just after 2.
    t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t ...... Dubious, test returned 255 (wstat 65280, 0xff00)
    Failed 1/2 subtests
    t/Lingua/TFIDF/WordSegmenter/LetterNgram.t ... ok  
    t/Lingua/TFIDF/WordSegmenter/SplitBySpace.t .. ok  

    Test Summary Report
    -------------------
    t/Lingua/TFIDF/WordSegmenter/JA/MeCab.t    (Wstat: 65280 Tests: 2 Failed: 1)
      Failed test:  2
      Non-zero exit status: 255
      Parse errors: No plan found in TAP output
    Files=4, Tests=16,  0 wallclock secs ( 0.03 usr  0.01 sys +  0.42 cusr  0.04 csys =  0.50 CPU)
    Result: FAIL
    Failed 1/4 test programs. 1/16 subtests failed.
    Makefile:890: die Regel für Ziel „test_dynamic“ scheiterte
    make: *** [test_dynamic] Fehler 255
      SEKIA/Lingua-TFIDF-0.01.tar.gz
      /usr/bin/make test -- NOT OK
    //hint// to see the cpan-testers results for installing this module, try:
      reports SEKIA/Lingua-TFIDF-0.01.tar.gz
    Running make install for SEKIA/Lingua-TFIDF-0.01.tar.gz
    Manifying 7 pod documents
    Installing /usr/local/share/perl/5.22.1/Lingua/TFIDF.pm
    Installing /usr/local/share/perl/5.22.1/Lingua/TFIDF/Types.pm
    Installing /usr/local/share/perl/5.22.1/Lingua/TFIDF/WordSegmenter/LetterNgram.pm
    Installing /usr/local/share/perl/5.22.1/Lingua/TFIDF/WordSegmenter/SplitBySpace.pm
    Installing /usr/local/share/perl/5.22.1/Lingua/TFIDF/WordSegmenter/JA/MeCab.pm
    Installing /usr/local/share/perl/5.22.1/Lingua/TFIDF/WordCounter/Lossy.pm
    Installing /usr/local/share/perl/5.22.1/Lingua/TFIDF/WordCounter/Simple.pm
    Installing /usr/local/man/man3/Lingua::TFIDF::WordCounter::Lossy.3pm
    Installing /usr/local/man/man3/Lingua::TFIDF::WordSegmenter::JA::MeCab.3pm
    Installing /usr/local/man/man3/Lingua::TFIDF::Types.3pm
    Installing /usr/local/man/man3/Lingua::TFIDF.3pm
    Installing /usr/local/man/man3/Lingua::TFIDF::WordSegmenter::SplitBySpace.3pm
    Installing /usr/local/man/man3/Lingua::TFIDF::WordSegmenter::LetterNgram.3pm
    Installing /usr/local/man/man3/Lingua::TFIDF::WordCounter::Simple.3pm
    Appending installation info to /usr/lib/i386-linux-gnu/perl/5.22/perllocal.pod
      SEKIA/Lingua-TFIDF-0.01.tar.gz
      /usr/bin/make install  -- OK
    Failed during this command:
     SEKIA/Lingua-TFIDF-0.01.tar.gz               : make_test NO but failure ignored because 'force' in effect


    So, dann mal schau'n ...

    #!/usr/bin/perl

    # tfidf-demo.pl


    use Lingua::TFIDF;
    use Lingua::TFIDF::WordSegmenter::SplitBySpace;
     
    my $tf_idf_calc = Lingua::TFIDF->new(
      # Use a word segmenter for japanese text.
      word_segmenter => Lingua::TFIDF::WordSegmenter::SplitBySpace->new,
    );
     
    my $document1 = 'Humpty Dumpty sat on a wall...';
    my $document2 = 'Remember, remember, the fifth of November...';
     
    my $tf = $tf_idf_calc->tf(document => $document1);
    # TF of word "Dumpty" in $document1.
    say "Say 1: ", $tf->{'Dumpty'};  # 2, if you are referring same text as mine.

    my $idf = $tf_idf_calc->idf(documents => [$document1, $document2]);
    say "Say 2: ", $idf->{'Dumpty'};  # log(2/1) ≒ 0.693147

    my $tf_idfs = $tf_idf_calc->tf_idf(documents => [$document1, $document2]);
    # TF-IDF of word "Dumpty" in $document1.
    say "Say 3: ", $tf_idfs->[0]{'Dumpty'};  # 2 log(2/1) ≒ 1.386294
    # Ditto. But in $document2.
    say "Say 4: ", $tf_idfs->[1]{'Dumpty'};  # 0

    Can't call method "say" on unblessed reference at tfidf-demo.pl line 19.

    ...
    # tfidf-demo.pl


    use Lingua::TFIDF;
    use Lingua::TFIDF::WordSegmenter::SplitBySpace;
    use feature qw(say);


    # Programm
    ...

     

    Output

     

    perl tfidf-demo.pl
    Say 1: 1
    Say 2: 0.693147180559945
    Say 3: 0.693147180559945
    Say 4:

    Funzt. Prima.
    Funzt es wirklich?

     

    Code

     

    #!/usr/bin/perl

    # tfidf-demo.pl

    use strict;
    use warnings;
    use Lingua::TFIDF;
    use Lingua::TFIDF::WordSegmenter::SplitBySpace;
    use feature qw(say);


    # Programm
     
    my $tf_idf_calc = Lingua::TFIDF->new(
      # Use a word segmenter for japanese text.
      word_segmenter => Lingua::TFIDF::WordSegmenter::SplitBySpace->new,
    );
     
    my $document1 = 'Humpty Dumpty sat on a wall Honky Dory Donkey';
    my $document2 = 'Remember remember the fifth of November Humpty Donkey Fireday';
    my @document1_token = split ( " ", $document1 );
    my @document2_token = split ( " ", $document2 );
     

    my $tf = $tf_idf_calc->tf(document => $document1);
    my $idf = $tf_idf_calc->idf(documents => [$document1, $document2]);
    my $tf_idfs = $tf_idf_calc->tf_idf(documents => [$document1, $document2]);


    foreach ( @document1_token ) {
        # TF-IDF of word $_ in $document1.
        say "Say $_, doc1: ", $tf_idfs->[0]{$_};
        # Ditto. But in $document2.
        say "Say $_, doc2: ", $tf_idfs->[1]{$_};
    }

     

    Output

     

    perl tfidf-demo.pl
    Say Humpty, doc1: 0
    Say Humpty, doc2: 0
    Say Dumpty, doc1: 0.693147180559945
    Use of uninitialized value in say at tfidf-demo.pl line 34.
    Say Dumpty, doc2:
    Say sat, doc1: 0.693147180559945
    Use of uninitialized value in say at tfidf-demo.pl line 34.
    Say sat, doc2:
    Say on, doc1: 0.693147180559945
    Use of uninitialized value in say at tfidf-demo.pl line 34.
    Say on, doc2:
    Say a, doc1: 0.693147180559945
    Use of uninitialized value in say at tfidf-demo.pl line 34.
    Say a, doc2:
    Say wall, doc1: 0.693147180559945
    Use of uninitialized value in say at tfidf-demo.pl line 34.
    Say wall, doc2:
    Say Honky, doc1: 0.693147180559945
    Use of uninitialized value in say at tfidf-demo.pl line 34.
    Say Honky, doc2:
    Say Dory, doc1: 0.693147180559945
    Use of uninitialized value in say at tfidf-demo.pl line 34.
    Say Dory, doc2:
    Say Donkey, doc1: 0
    Say Donkey, doc2: 0
    Mal füttern mit mehr Text.
    #!/usr/bin/perl

    # tfidf-demo.pl

    use strict;
    use warnings;
    use DBI;
    use ZugangsDaten_postgresql qw($DB_USER $DB_PASSWD);
    use Lingua::TFIDF;
    use Lingua::TFIDF::WordSegmenter::SplitBySpace;
    use feature qw(say);

    # Variablen

    our $dbh;


    # Programm
     
    my $tf_idf_calc = Lingua::TFIDF->new(
      # Use a word segmenter for japanese text.
      word_segmenter => Lingua::TFIDF::WordSegmenter::SplitBySpace->new,
    );

    connect_db();
    my $document1 = document_token_select('11111');
    my $document2 = document_token_select('44444');
    disconnect_db();

    print "\nToken von Dokument 1:\n";
    print $document1, "\n";
    print "\nToken von Dokument 1, Ende:\n";
    sleep 11;

    print "\nToken von Dokument 2:\n";
    print $document2, "\n";
    print "\nToken von Dokument 2, Ende:\n";
    sleep 11;


    my @document1_token = split ( " ", $document1 );
    my @document2_token = split ( " ", $document2 );
     

    my $tf = $tf_idf_calc->tf(document => $document1);
    my $idf = $tf_idf_calc->idf(documents => [$document1, $document2]);
    my $tf_idfs = $tf_idf_calc->tf_idf(documents => [$document1, $document2]);


    foreach ( @document1_token ) {
        # TF-IDF of word $_ in $document1.
        say "Say $_, doc1: ", $tf_idfs->[0]{$_};
        # Ditto. But in $document2.
        say "Say $_, doc2: ", $tf_idfs->[1]{$_};
    }





    ###########################################################
    ############### Subroutinen ####################
    ###########################################################

    # Subroutinen

    sub connect_db {
        ## Verbindung zur DB herstellen
        $dbh = DBI->connect("DBI:Pg:dbname=links;host=localhost", "$DB_USER", "$DB_PASSWD");
    }

    sub disconnect_db {
        $dbh->disconnect();
    }

    # clean_texts_update-Statement

    sub document_token_select {
        my $link_id = shift;
        my $document_token_select = $dbh->prepare("SELECT token FROM (SELECT token(ts_debug(text)) FROM texts WHERE link_id = $link_id) AS token;");
        $document_token_select->execute();
        my @document_token;
        while ( my $token = $document_token_select->fetchrow() ) {
            if ( $token =~ /[a-zA-ZäöüÄÖÜß]+/ ) {
                push @document_token, $token;
            }
        }
        my $document_token_string = join ( " ", map { $_ } @document_token );
        return $document_token_string
    }

     

    Output

     

    ...
    Say TV-Programm, doc2:
    Say TV, doc1: 2.77258872223978
    Use of uninitialized value in say at tfidf-demo.pl line 53.
    Say TV, doc2:
    Say Programm, doc1: 1.38629436111989
    Use of uninitialized value in say at tfidf-demo.pl line 53.
    Say Programm, doc2:
    Say Themen, doc1: 0
    Say Themen, doc2: 0
    Say Autoren, doc1: 0.693147180559945
    Use of uninitialized value in say at tfidf-demo.pl line 53.
    Say Autoren, doc2:
    Say Spiele, doc1: 0
    Say Spiele, doc2: 0
    Say Newsletter, doc1: 0
    Say Newsletter, doc2: 0
    Say WELTPLUS, doc1: 0.693147180559945
    Use of uninitialized value in say at tfidf-demo.pl line 53.
    Say WELTPLUS, doc2:
    Say BUTTON, doc1: 0
    Say BUTTON, doc2: 0
    Say Politik, doc1: 0
    Say Politik, doc2: 0
    Say Wirtschaft, doc1: 0
    Say Wirtschaft, doc2: 0
    Say Finanzen, doc1: 0.693147180559945
    Use of uninitialized value in say at tfidf-demo.pl line 53.
    Say Finanzen, doc2:
    Say Sport, doc1: 0
    Say Sport, doc2: 0
    Say Panorama, doc1: 0
    Say Panorama, doc2: 0
    Say Wissen, doc1: 0
    Say Wissen, doc2: 0
    Say Gesundheit, doc1: 0
    Say Gesundheit, doc2: 0
    Say Kultur, doc1: 0
    Say Kultur, doc2: 0
    Say Meinung, doc1: 1.38629436111989
    Use of uninitialized value in say at tfidf-demo.pl line 53.
    Say Meinung, doc2:
    Say Geschichte, doc1: 0
    Say Geschichte, doc2: 0
    Say Reise, doc1: 0
    Say Reise, doc2: 0
    Say PS, doc1: 1.38629436111989
    Use of uninitialized value in say at tfidf-demo.pl line 53.
    Say PS, doc2:
    ...
    Say Bayern, doc1: 0
    Say Bayern, doc2: 0
    Say Baden-W�rttemberg, doc1: 0.693147180559945
    Use of uninitialized value in say at tfidf-demo.pl line 53.
    Say Baden-W�rttemberg, doc2:
    Say Baden, doc1: 0.693147180559945
    Use of uninitialized value in say at tfidf-demo.pl line 53.
    Say Baden, doc2:
    Say W�rttemberg, doc1: 0.693147180559945
    Use of uninitialized value in say at tfidf-demo.pl line 53.
    Say W�rttemberg, doc2:
    Say Niedersachsen, doc1: 0.693147180559945
    Use of uninitialized value in say at tfidf-demo.pl line 53.
    Say Niedersachsen, doc2:
    Say Bremen, doc1: 0.693147180559945
    Use of uninitialized value in say at tfidf-demo.pl line 53.
    Say Bremen, doc2:
    Say Hessen, doc1: 0.693147180559945
    Use of uninitialized value in say at tfidf-demo.pl line 53.
    Say Hessen, doc2:
    Say Rheinland-Pfalz, doc1: 0.693147180559945
    Use of uninitialized value in say at tfidf-demo.pl line 53.
    Say Rheinland-Pfalz, doc2:
    ...
    Im Großen und Ganzen scheint es gut & schnell zu funzen. Zwei Unschönheiten sind noch zu beheben, UF8-Prob und Uninitialized-Value-Prob. Sollte kein Thema sein.

    Eine kleine Pause habe ich mir jetzt verdient, auch wenn ich noch gar nicht lange gearbeitet habe ;-) .


    ...
    foreach ( @document1_token ) {
        # TF-IDF of word $_ in $document1.
        if ( not defined $tf_idfs->[0]{$_} ) {
            say "Say $_, doc1: undef";
        } else { say "Say $_, doc1: ", $tf_idfs->[0]{$_} }
        # Ditto. But in $document2.
        if ( not defined $tf_idfs->[1]{$_} ) {
            say "Say $_, doc2: undef";
        } else { say "Say $_, doc2: ", $tf_idfs->[1]{$_} }
    }
    ...
    ...
    Say Premium, doc1: 0.693147180559945
    Say Premium, doc2: undef
    Say Aromen, doc1: 5.54517744447956
    Say Aromen, doc2: undef
    Say aus, doc1: 0
    Say aus, doc2: 0
    Say dem, doc1: 0.693147180559945
    Say dem, doc2: undef
    Say Hause, doc1: 0.693147180559945
    Say Hause, doc2: undef
    Say German, doc1: 0.693147180559945
    Say German, doc2: undef
    Say Liquid, doc1: 4.15888308335967
    Say Liquid, doc2: undef
    Say s, doc1: 0.693147180559945
    Say s, doc2: undef
    Say Anzeigen, doc1: 0.693147180559945
    Say Anzeigen, doc2: undef
    Say Kacheln, doc1: 0.693147180559945
    Say Kacheln, doc2: undef
    Say Liste, doc1: 0.693147180559945
    Say Liste, doc2: undef
    ...


    Zeigt mir, daß alle in beiden Dokumenten enthaltene Token den Wert 0 zugeordnet bekommen. Ausgerechnete Werte gibt's nur da, wo eins (???) "undef" ist ... da erkenne ich grade einen Fehler in meinem Proggi!

    Code

     

    ...
    my %vector_token;
    foreach ( @document1_token ) {
        if ( not exists $vector_token{$_} ) { $vector_token{$_} = 1 }
    }
    foreach ( @document2_token ) {
        if ( not exists $vector_token{$_} ) { $vector_token{$_} = 1 }
    }


    my $tf = $tf_idf_calc->tf(document => $document1);
    my $idf = $tf_idf_calc->idf(documents => [$document1, $document2]);
    my $tf_idfs = $tf_idf_calc->tf_idf(documents => [$document1, $document2]);


    foreach ( sort { $a cmp $b } keys %vector_token ) {
        # TF-IDF of word $_ in $document1.
        if ( not defined $tf_idfs->[0]{$_} ) {
            say "Say $_, doc1: undef";
        } else { say "Say $_, doc1: ", $tf_idfs->[0]{$_} }
        # Ditto. But in $document2.
        if ( not defined $tf_idfs->[1]{$_} ) {
            say "Say $_, doc2: undef";
        } else { say "Say $_, doc2: ", $tf_idfs->[1]{$_} }
    }
    ...

     

    Output

     

    ...
    Say Batterieentsorgung, doc1: 0.693147180559945
    Say Batterieentsorgung, doc2: undef
    Say Beginn, doc1: undef
    Say Beginn, doc2: 0.693147180559945
    Say Benzinpreis, doc1: undef
    Say Benzinpreis, doc2: 1.38629436111989
    Say Bereitstellung, doc1: 0.693147180559945
    Say Bereitstellung, doc2: undef
    Say Bestseller, doc1: undef
    Say Bestseller, doc2: 0.693147180559945
    Say Bettmann, doc1: undef
    Say Bettmann, doc2: 0.693147180559945
    Say BeyondTomorrow, doc1: undef
    Say BeyondTomorrow, doc2: 0.693147180559945
    Say Big, doc1: 6.23832462503951
    Say Big, doc2: undef
    Say Brutto, doc1: undef
    Say Brutto, doc2: 1.38629436111989
    Say Brutto-Netto-Rechner, doc1: undef
    Say Brutto-Netto-Rechner, doc2: 1.38629436111989
    Say Buchrezensionen, doc1: undef
    Say Buchrezensionen, doc2: 0.693147180559945
    Say Bull, doc1: 1.38629436111989
    Say Bull, doc2: undef
    Say Bundesliga, doc1: undef
    Say Bundesliga, doc2: 0.693147180559945
    Say Burner, doc1: 2.77258872223978
    Say Burner, doc2: undef
    Say Business, doc1: undef
    Say Business, doc2: 1.38629436111989
    Say Bu�geldrechner, doc1: undef
    Say Bu�geldrechner, doc2: 1.38629436111989
    Say B�rse, doc1: undef
    Say B�rse, doc2: 2.07944154167984
    ...

     

    Code für TF

     

    ...
    my $tf1 = $tf_idf_calc->tf(document => $document1);
    my $tf2 = $tf_idf_calc->tf(document => $document2);
    my $idf = $tf_idf_calc->idf(documents => [$document1, $document2]);
    my $tf_idfs = $tf_idf_calc->tf_idf(documents => [$document1, $document2]);


    foreach ( sort { $a cmp $b } keys %vector_token ) {
        # TF of word $_ in $document1.
        if ( not defined $tf1->{$_} ) {
            say "Say $_, doc1: undef";
        } else { say "Say $_, doc1: ", $tf1->{$_} }
        # Ditto. But in $document2.
        if ( not defined $tf2->{$_} ) {
            say "Say $_, doc2: undef";
        } else { say "Say $_, doc2: ", $tf2->{$_} }
    }

    print "\nPause!\n";
    sleep 11;
    ...

     

    Output

     

    ...
    Say Bestseller, doc1: undef
    Say Bestseller, doc2: 1
    Say Bettmann, doc1: undef
    Say Bettmann, doc2: 1
    Say BeyondTomorrow, doc1: undef
    Say BeyondTomorrow, doc2: 1
    Say Big, doc1: 9
    Say Big, doc2: undef
    Say Brutto, doc1: undef
    Say Brutto, doc2: 2
    Say Brutto-Netto-Rechner, doc1: undef
    Say Brutto-Netto-Rechner, doc2: 2
    Say Buchrezensionen, doc1: undef
    Say Buchrezensionen, doc2: 1
    Say Bull, doc1: 2
    Say Bull, doc2: undef
    Say Bundesliga, doc1: undef
    Say Bundesliga, doc2: 1
    Say Burner, doc1: 4
    Say Burner, doc2: undef
    Say Business, doc1: undef
    Say Business, doc2: 2
    Say Bu�geldrechner, doc1: undef
    Say Bu�geldrechner, doc2: 2
    Say B�rse, doc1: undef
    Say B�rse, doc2: 3
    Say B�cher, doc1: undef
    Say B�cher, doc2: 2
    Say CHRONIK, doc1: undef
    Say CHRONIK, doc2: 1
    Say Champions, doc1: undef
    Say Champions, doc2: 1
    Say Clark, doc1: 2
    Say Clark, doc2: undef
    Say Coils, doc1: 1
    Say Coils, doc2: undef
    Say Coilstore, doc1: 2
    Say Coilstore, doc2: undef
    ...

     

    Code IDF

     

    ...

    foreach ( sort { $a cmp $b } keys %vector_token ) {
        # IDF of word $_ in $document1.
        if ( not defined $idf->{$_} ) {
            say "Say $_, doc1: undef";
        } else { say "Say $_, doc1: ", $idf->{$_} }
        # Ditto. But in $document2.
        if ( not defined $idf->{$_} ) {
            say "Say $_, doc2: undef";
        } else { say "Say $_, doc2: ", $idf->{$_} }
    }

    print "\nPause!\n";
    sleep 11;

    ...

     

    Output

     

    ...

    Say Apps, doc1: 0.693147180559945
    Say Apps, doc2: 0.693147180559945
    Say Archiv, doc1: 0.693147180559945
    Say Archiv, doc2: 0.693147180559945
    Say Archive, doc1: 0.693147180559945
    Say Archive, doc2: 0.693147180559945
    Say Aroma, doc1: 0.693147180559945
    Say Aroma, doc2: 0.693147180559945
    Say Aromen, doc1: 0.693147180559945
    Say Aromen, doc2: 0.693147180559945
    Say Artikel, doc1: 0
    Say Artikel, doc2: 0
    Say Arztsuche, doc1: 0.693147180559945
    Say Arztsuche, doc2: 0.693147180559945
    Say Aspire, doc1: 0.693147180559945
    Say Aspire, doc2: 0.693147180559945

    ...




    Daran erkenne ich, daß ich den IDF-Output noch nicht verstehe ;-) . Kommt Zeit, kommt Rat. Immer mit der Ruhe.

    "Lesen, verstehen.", heißt der Zauberspruch!





    ...............................................................................................................................
    Eine Welt der SupiDupis, und - leider - schlechter Formatierungen. Mindestens dafür muß ich mich entschuldigen ;-)
    ...............................................................................................................................
    FORTSETZUNG FOLGT/DROHT!


    Kommentare

    Beliebte Posts aus diesem Blog

    ·

    Es brennt.

    Bye, bye Nord Stream 2!