SpaCy, I love you! - part 1

SpaCy muss fim einfach lieb haben, es geht gar nicht anders.



Und dieser nette Herr hat ein sehr schönes, informatives Tutorial verfasst, was mir hier an dieser Stelle sehr viel weiterhilft:

About Taranjeet Singh
Taranjeet Singh
Taranjeet is a software engineer, with experience in Django, NLP and Search, having build search engine for K12 students(featured in Google IO 2019) and children with Autism.
» More about Taranjeet

https://realpython.com/natural-language-processing-spacy-python/


Thank you very much, Mr. Singh!




Ehre, wem Ehre gebührt.

Hier drei SVG2PNG-Outputs und die Codezeilen aus dem Tutorial, die ich bis jetzt ausprobiert habe:








 


#!/usr/bin/env python
# coding: utf-8

# In[ ]:


#### !!!
### spaCy muss fim einfach gern/lieb haben!
#### !!!


# In[ ]:


# Install a pip package in the current Jupyter kernel

import sys

get_ipython().system('{sys.executable} -m pip install spacy')


# In[7]:


get_ipython().system('{sys.executable} -m spacy download en')


# In[8]:


get_ipython().system('{sys.executable} -m spacy download de')


# In[9]:


import spacy


# In[10]:


get_ipython().system(' python3 -m spacy validate')


# In[ ]:


# https://stackoverflow.com/questions/56310481/how-to-fix-oserror-e050-cant-find-model-en-when-en-is-already-download
# Problem mit Fehlermeldung trotz erfolgreicher/m Installation/Download
# Hat leider nicht viel gebracht


# In[8]:


# Dieser Beitrag hat wirklich ++ endlich geholfen:
# https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/

## Ein hohes Lob auf diesen Autor!!!


# In[ ]:


# Tutorial:
# https://realpython.com/natural-language-processing-spacy-python/


# In[ ]:


import spacy


# In[17]:


# In Englisch

nlp = spacy.load('en')
introduction_text = ('This tutorial is about Natural Language Processing in Spacy.')
introduction_doc = nlp(introduction_text)

# Extract tokens for the given doc

print ([token.text for token in introduction_doc])


# In[19]:


# In Deutsch

nlp = spacy.load('de')
einfuehrung_text = ('Dieses Tutorial behandelt Natural Language Processing mit Spacy.')
einfuehrung_doc = nlp(einfuehrung_text)

# Extrahiere Token für das gegebende Dokument

print ([token.text for token in einfuehrung_doc])


# In[20]:


# How to Read a Text File

nlp = spacy.load('en')
file_name = '/home/heide/Dropbox/Jupyter-Python/NLP_mit_spaCy/beispiel.txt'
introduction_file_text = open(file_name).read()
introduction_file_doc = nlp(introduction_file_text)

# Extract tokens for the given doc

print ([token.text for token in introduction_file_doc])


# In[21]:


# Sentence Detection

file_name = '/home/heide/Dropbox/Jupyter-Python/NLP_mit_spaCy/beispiel.txt'
about_text = open(file_name).read()
about_doc = nlp(about_text)
sentences = list(about_doc.sents)
len(sentences)


# In[22]:


for sentence in sentences:
    print(sentence)


# In[23]:


# In the above example, spaCy is correctly able to identify sentences
# in the English language, using a full stop(.) as the sentence delimiter.
# You can also customize the sentence detection to detect sentences
# on custom delimiters.

# Here’s an example, where an ellipsis(...) is used as the delimiter:


# In[25]:


def set_custom_boundaries(doc):
    # Adds support to use `...` as the delimiter for sentence detection
    for token in doc[:-1]:
        if token.text == '...':
            doc[token.i+1].is_sent_start = True
    return doc

ellipsis_text = ('Gus, can you, ... never mind, I forgot'
                 ' what I was saying. So, do you think'
                 ' we should ...')
                
# Load a new model instance

custom_nlp = spacy.load('en')
custom_nlp.add_pipe(set_custom_boundaries, before='parser')
custom_ellipsis_doc = custom_nlp(ellipsis_text)
custom_ellipsis_sentences = list(custom_ellipsis_doc.sents)

for sentence in custom_ellipsis_sentences:
    print(sentence)


# In[26]:


# Sentence Detection with no customization

ellipsis_doc = nlp(ellipsis_text)
ellipsis_sentences = list(ellipsis_doc.sents)

for sentence in ellipsis_sentences:
    print(sentence)


# In[27]:


# Tokenization in spaCy


# In[28]:


for token in about_doc:
    print (token, token.idx)


# In[29]:


# Note how spaCy preserves the starting index of the tokens.
# It’s useful for in-place word replacement.
# spaCy provides various attributes for the Token class:


# In[30]:


for token in about_doc:
    print (token, token.idx, token.text_with_ws,
    token.is_alpha, token.is_punct, token.is_space,
    token.shape_, token.is_stop)


# In[31]:


# In this example, some of the commonly required attributes are accessed:

#    text_with_ws prints token text with trailing space (if present).
#    is_alpha detects if the token consists of alphabetic characters or not.
#    is_punct detects if the token is a punctuation symbol or not.
#    is_space detects if the token is a space or not.
#    shape_ prints out the shape of the word.
#    is_stop detects if the token is a stop word or not.


# In[32]:


# spaCy allows you to customize tokenization by updating the tokenizer property on the nlp object:


# In[33]:


import re
import spacy
from spacy.tokenizer import Tokenizer
custom_nlp = spacy.load('en')

prefix_re = spacy.util.compile_prefix_regex(custom_nlp.Defaults.prefixes)
suffix_re = spacy.util.compile_suffix_regex(custom_nlp.Defaults.suffixes)
infix_re = re.compile(r'''[-~]''')

def customize_tokenizer(nlp):
    # Adds support to use `-` as the delimiter for tokenization
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                     suffix_search=suffix_re.search,
                     infix_finditer=infix_re.finditer,
                     token_match=None
                     )

custom_nlp.tokenizer = customize_tokenizer(custom_nlp)
custom_tokenizer_about_doc = custom_nlp(about_text)

print([token.text for token in custom_tokenizer_about_doc])


# In[34]:


# In order for you to customize, you can pass various parameters to the Tokenizer class:

#    nlp.vocab is a storage container for special cases and is used to handle cases like contractions and emoticons.
#    prefix_search is the function that is used to handle preceding punctuation, such as opening parentheses.
#    infix_finditer is the function that is used to handle non-whitespace separators, such as hyphens.
#    suffix_search is the function that is used to handle succeeding punctuation, such as closing parentheses.
#    token_match is an optional boolean function that is used to match strings that should never be split.
#                It overrides the previous rules and is useful for entities like URLs or numbers.

# Note: spaCy already detects hyphenated words as individual tokens.
# The above code is just an example to show how tokenization
# can be customized. It can be used for any other character.


# In[36]:


# Stop Words

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)


# In[38]:


for stop_word in list(spacy_stopwords)[:11]:
    print(stop_word)


# In[41]:


about_no_stopword_doc = [token for token in about_doc if not token.is_stop]
print(about_no_stopword_doc)


# In[42]:


# about_no_stopword_doc can be joined with spaces
# to form a sentence with no stop words.


# In[43]:


# Lemmatization


# In[44]:


conference_help_text = ('Gus is helping organize a developer'
    'conference on Applications of Natural Language'
    ' Processing. He keeps organizing local Python meetups'
    ' and several internal talks at his workplace.')
   
conference_help_doc = nlp(conference_help_text)

for token in conference_help_doc:
    print (token, token.lemma_)


# In[45]:


# Word Frequency


# In[47]:


from collections import Counter

complete_text = ('Gus Proto is a Python developer currently'
    'working for a London-based Fintech company. He is'
    ' interested in learning Natural Language Processing.'
    ' There is a developer conference happening on 21 July'
    ' 2019 in London. It is titled "Applications of Natural'
    ' Language Processing". There is a helpline number '
    ' available at +1-1234567891. Gus is helping organize it.'
    ' He keeps organizing local Python meetups and several'
    ' internal talks at his workplace. Gus is also presenting'
    ' a talk. The talk will introduce the reader about "Use'
    ' cases of Natural Language Processing in Fintech".'
    ' Apart from his work, he is very passionate about music.'
    ' Gus is learning to play the Piano. He has enrolled '
    ' himself in the weekend batch of Great Piano Academy.'
    ' Great Piano Academy is situated in Mayfair or the City'
    ' of London and has world-class piano instructors.')

complete_doc = nlp(complete_text)

# Remove stop words and punctuation symbols

words = [token.text for token in complete_doc
         if not token.is_stop and not token.is_punct]
word_freq = Counter(words)

# 5 commonly occurring words with their frequencies

common_words = word_freq.most_common(5)
print (common_words)


# In[48]:


# Unique words

unique_words = [word for (word, freq) in word_freq.items() if freq == 1]
print (unique_words)


# In[49]:


# Wenig(er) aussagefähig, wenn keine Stoppworte entfernt werden:

words_all = [token.text for token in complete_doc if not token.is_punct]
word_freq_all = Counter(words_all)

# 5 commonly occurring words with their frequencies

common_words_all = word_freq_all.most_common(5)
print (common_words_all)


# In[50]:


# Mit 'about_text'

complete_doc = nlp(about_text)

# Remove stop words and punctuation symbols

words = [token.text for token in complete_doc
         if not token.is_stop and not token.is_punct]
word_freq = Counter(words)

# 5 commonly occurring words with their frequencies

common_words = word_freq.most_common(5)
print (common_words)


# In[51]:


# Part of Speech Tagging


# In[52]:


for token in about_doc:
    print (token, token.tag_, token.pos_, spacy.explain(token.tag_))


# In[ ]:


nouns = []
adjectives = []

for token in about_doc:
    if token.pos_ == 'NOUN':
        nouns.append(token)
    if token.pos_ == 'ADJ':
        adjectives.append(token)


# In[54]:


nouns


# In[55]:


adjectives


# In[56]:


# Visualization: Using displaCy


# In[58]:


# ...anaconda3/lib/python3.7/runpy.py:193: UserWarning:
# [W011] It looks like you're calling displacy.serve
# from within a Jupyter notebook or a similar environment.
# This likely means you're already running a local web server,
# so there's no need to make displaCy start another one.
# Instead, you should be able to replace displacy.serve with
# displacy.render to show the visualization.


from spacy import displacy

about_interest_text = ('He is interested in learning'
    ' Natural Language Processing.')
   
about_interest_doc = nlp(about_interest_text)
displacy.render(about_interest_doc, style='dep')


# In[59]:


# In Deutsch:

nlp = spacy.load('de')
about_interest_text = ('Er/Sie/Es interessiert sich brennend'
    'für das Erlernen von Natural Language Processing.')
   
about_interest_doc = nlp(about_interest_text)
displacy.render(about_interest_doc, style='dep')


# In[61]:


get_ipython().system('{sys.executable} -m pip install cairosvg')


# In[62]:


from cairosvg import svg2png


# In[74]:


# Nochmal in Deutsch, Ausgabe als 'png'

# Damit die Ausgabe als SVG erfolgen kann,
# muß die Jupyter-Notebook-Ausgabe ausgeschaltet werden

nlp = spacy.load('de')

about_interest_text = ('Er/Sie/Es interessiert sich brennend \
      für das Erlernen von Natural Language Processing.')
   
about_interest_doc = nlp(about_interest_text)
svg_code = displacy.render(about_interest_doc, style='dep', jupyter=False)


# In[77]:


svg2png(bytestring=svg_code,write_to='/home/heide/Dropbox/Jupyter-Python/NLP_mit_spaCy/displacy_example-1.png')


# In[ ]:


## Hilfreiche Links zum letzten Beispiel:

# https://github.com/explosion/spaCy/issues/1058
# https://stackoverflow.com/questions/6589358/convert-svg-to-png-in-python/13320450#13320450
# https://spacy.io/usage/visualizers


# In[79]:


# Noch ein Satz in Englisch
# https://www.thesun.co.uk/news/10150703/erdogan-mocks-trump-tweets-syria/

nlp = spacy.load('en')

about_interest_text = ('Poking fun at Trump’s love of social media,                        Erdogan told reporters: “When we take a look at Mr Trump\'s                        Twitter posts, we can no longer follow them.”')
   
about_interest_doc = nlp(about_interest_text)
svg = displacy.render(about_interest_doc, style='dep', jupyter=False)
svg2png(bytestring=svg,write_to='/home/heide/Dropbox/Jupyter-Python/NLP_mit_spaCy/displacy_example-2.png')                


# In[82]:


# Nochmal in Deutsch, Ausgabe als 'png'
# Zu dem, der nicht zwischen Freund und Feind unterscheiden kann

# Damit die Ausgabe als SVG erfolgen kann,
# muß die Jupyter-Notebook-Ausgabe ausgeschaltet werden

nlp = spacy.load('de')

about_interest_text = ('Eines nicht mehr fernen Tages werden ihn nicht mehr                         nur die Freunde, sondern auch noch die Feinde auslachen.')
   
about_interest_doc = nlp(about_interest_text)
svg_code = displacy.render(about_interest_doc, style='dep', jupyter=False)
svg2png(bytestring=svg_code,write_to='/home/heide/Dropbox/Jupyter-Python/NLP_mit_spaCy/displacy_example-3.png')        


# In[ ]:




This really makes me happy.
And it made my day,
today.




~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

11.

Wochenendseminar

in

Inter- & Un-
praktischer Philosophie

für

Aas
-
Geier
&
Andere




In Bad Grabeseck, Oktober, 18.-20.


Kömmet gern alle zuhauf
!



Kommentare

Beliebte Posts aus diesem Blog

·

Es brennt.

Bye, bye Nord Stream 2!