Heutiger Testraketenstart

Was ganz was Feines

Schlagzeilen von Spiegel Online scrapen | christianemester.de
HTML2PDF: Massenhaft URLs downloaden | christianemester.de

Jupyter-Notebook-Code

# Environment: ~.virtualenvs/.python3_test
# Kernel: Python 3

import csv
import datetime
from bs4 import BeautifulSoup
import requests

website = 'http://spiegel.de/schlagzeilen'
r = requests.get(website)
soup = BeautifulSoup((r.content), "lxml")
articles = []

for a in soup.select(".schlagzeilen-content.schlagzeilen-overview a[title]"):
    category, published_at = a.find_next_sibling(class_="headline-date").get_text().split(",")
    articles.append({
        "Title": a.get_text(),
        "URL": a.get('href'),
        "Category": category.strip(" ()"),
        "PublishedAt": published_at.strip(" ()")
    })
    
filename = 'SPON_Schlagzeilen.csv'

with open(filename, 'w', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=["Title", "URL", "Category", "PublishedAt"], delimiter="|")
    writer.writeheader()
    writer.writerows(articles)
    
# Mester, Christiane (2018): Spiegel Online Scraper. Text abrufbar unter: https://www.christianemester.de/spiegel-online-scraper-python/

# Eventuell notwendige Installationen

## Jupyter-Notebook
import sys
!{sys.executable} -m pip install --user pdfkit

## Ubuntu
## Installation dort nachlesen: https://wkhtmltopdf.org
### Hinweis: sudo apt install ~/Downloads/wkhtmltox_0.12.5-1.bionic_amd64.deb

Collecting pdfkit
  Using cached https://files.pythonhosted.org/packages/57/da/48fdd627794cde49f4ee7854d219f3a65714069b722b8d0e3599cd066185/pdfkit-0.6.1-py3-none-any.whl
Installing collected packages: pdfkit
Successfully installed pdfkit-0.6.1

import pdfkit
import re

with open ('SPON_Schlagzeilen.csv', 'r') as file:
    id = 0
    for line in file.readlines():
        if id > 0:
            url = re.sub(r'^[^\|]+\|([^\|]+)\|.*$',r'\1',line)
            if re.search(r'^\/.*',url) is not None:
                url = 'http://spiegel.de' + url
            print(id,url)
            pdfkit.from_url(url, '{}.pdf'.format(id), options={'disable-javascript': ''})
        id += 1
        if id > 11:
            break

file.close()
        
# Mester, Christiane (2018): HTML2PDF mit Python. Text abrufbar unter: 
# https://www.christianemester.de/html2pdf-converter-python/

1 http://spiegel.de/panorama/gesellschaft/russland-versuch-43-fuer-ein-gesetz-gegen-haeusliche-gewalt-a-1301393.html

Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done                                                                      
2 http://spiegel.de/sport/fussball/mesut-oezil-aus-chinesischer-version-von-pro-evolution-soccer-entfernt-a-1302012.html

Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done                                                                      
3 http://spiegel.de/wissenschaft/mensch/israel-aeltestes-bollwerk-gegen-meeresspiegelanstieg-entdeckt-a-1301972.html

Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done                                                                      
4 http://spiegel.de/sport/fussball/klub-wm-fc-liverpool-schlaegt-monterrey-fc-und-steht-im-finale-a-1302008.html

Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done                                                                      
5 https://www.spiegel.de/sport/fussball/bundesliga-live-ticker-tabelle-ergebnisse-spielplan-statistik-a-842988.html#contest=bl1&matchday=16&match=4588794

Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done                                                                      
6 http://spiegel.de/sport/fussball/bundesliga-hertha-bsc-berlin-schlaegt-bayer-leverkusen-a-1301995.html

Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done                                                                      
7 http://spiegel.de/sport/sonst/darts-wm-2020-im-liveticker-a-1299968.html

Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done                                                                      
8 https://www.spiegel.de/sport/fussball/la-liga-tabelle-live-ticker-spielplan-a-851219.html#contest=ligen-spa&matchday=10&match=4592642

Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done                                                                      
9 http://spiegel.de/panorama/justiz/prepper-prozess-in-schwerin-der-irrweg-des-sek-manns-a-1302001.html

Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done                                                                      
10 http://spiegel.de/panorama/schweiz-hirsch-mit-sechs-kilo-muell-im-magen-entdeckt-a-1302009.html

Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done                                                                      
11 http://spiegel.de/politik/deutschland/afd-und-bundestag-polizist-lars-herrmann-tritt-aus-partei-und-fraktion-aus-a-1301991.html

Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done

Hier nochmal in schönerer PDF-Ansicht:
https://drive.google.com/open?id=1axwPIpC58_8ZBafmAszjUnNFmrPEnnqN

Und ein Beispiel der erzeugten PDFs:
https://drive.google.com/open?id=1sUVpT6vXQr63jYFOOD8QGo_bd7U-AsQs

Zu beachten (!)

Wer es nicht selbst gleich sieht: Ich lasse - willkürlich gesetzt - nach dem Druck von 11 Exemplaren abbrechen. Natürlich sind es weit mehr Schlagzeilen des Tages (in diesem Fall exakt 315).

Schlussbemerkung

Eine wirklich coole Arbeit von https://www.christianemester.de!
Tolle Module (pdfkit,wkhtmltopdf)!

Von mir sonst noch verwendete Links:

Python Regex: re.match(), re.search(), re.findall() with Example Problembehebung › Paketverwaltung › Wiki › ubuntuusers.de apt › apt › Wiki › ubuntuusers.de Paketinstallation DEB › Wiki › ubuntuusers.de wkhtmltox_0.12.5-1.bionic_amd64.deb wkhtmltopdf wkhtmltopdf/src at master · wkhtmltopdf/wkhtmltopdf · GitHub Python 3 - break statement - Tutorialspoint Installing wkhtmltopdf · JazzCore/python-pdfkit Wiki · GitHub
python-pdfkit/before-script.sh at master · JazzCore/python-pdfkit · GitHub pdfkit · PyPI

Dieses Blog durchsuchen

Es gibt ...

Heutiger Testraketenstart - Stufe 7

Was ganz was Feines

Jupyter-Notebook-Code

Zu beachten (!)

Schlussbemerkung

Von mir sonst noch verwendete Links:

Kommentare

Kommentar veröffentlichen

Beliebte Posts aus diesem Blog

Betreuervergütung – Betreuungsrecht-Lexikon

Ich schreibe wie?

Praktisch erledigt.