bd_www.scrape
Functions to scrape www.belastingdienst.nl.
This module is used to execute a complete scrape of the
www.belastingdienst.nl site. It can be executed directly or by importing and
calling the scrape_site function from this bd_www package.
This module provides two maintenance functions:
check_storage_integrity: to execute some basic tests on the scrapes storagereprocess_scrapes: to reprocess earlier scrapes
This latter function should be used after changing the content of the
manually administered non_scrape_ids and non_orphan_ids tables, with a
from_ts starting timestamp after which these changes impact other data.
Apart from these maintenance functions, all other functions of this module implement parts of the main functionalities and are not meant to be used separately.
Note: The main design consideration for all this module is that scraping should be able to be scheduled and run unattended. Therefore no unhandled exceptions should occur and the basic scrape, i.e. getting and storing all pages together with the redirections that occur while retrieving them, should continue above all.
View Source
""" ***Functions to scrape www.belastingdienst.nl.*** This module is used to execute a complete scrape of the www.belastingdienst.nl site. It can be executed directly or by importing and calling the `scrape_site` function from this `bd_www` package. This module provides two maintenance functions: - `check_storage_integrity`: to execute some basic tests on the scrapes storage - `reprocess_scrapes`: to reprocess earlier scrapes This latter function should be used after changing the content of the manually administered *non_scrape_ids* and *non_orphan_ids* tables, with a `from_ts` starting timestamp after which these changes impact other data. Apart from these maintenance functions, all other functions of this module implement parts of the main functionalities and are not meant to be used separately. ***Note:*** *The main design consideration for all this module is that scraping should be able to be scheduled and run unattended. Therefore no unhandled exceptions should occur and the basic scrape, i.e. getting and storing all pages together with the redirections that occur while retrieving them, should continue above all.* """ import copy import datetime as dt import logging import re import requests import shutil import time import urllib3 from bs4 import BeautifulSoup from bs4.element import NavigableString, Tag, Comment, Stylesheet from pathlib import Path from typing import Set, Union from urllib.parse import urljoin, urlparse, unquote from bd_www import scrape_conf, report_conf, mst_conn, mst_dir, PageSourceZip from bd_www.constants import ROOT_URL, SCRAPER_VERSION, WWW_SITE from bd_www.report import REPORT_SUBDIRS from bd_www.matomo import matomo_available, urls_last_days # Non-public constants EXTRACTED_FIELDS = [ ('title', 'TEXT'), ('description', 'TEXT'), ('num_h1s', 'INTEGER'), ('first_h1', 'TEXT'), ('language', 'TEXT'), ('modified', 'DATE'), ('pagetype', 'TEXT'), ('classes', 'TEXT'), ('ed_text', 'TEXT'), ('aut_text', 'TEXT') ] "Field definitions for the *his_pages_info* and *scr_pages_info* table " \ "(part 1 of 2)." DERIVED_FIELDS = [ ('business', 'TEXT'), ('category', 'TEXT') ] "Field definitions for the *his_pages_info* and *scr_pages_info* table " \ "(part 2 of 2)." DV_TYPES = {'bld-campagne', 'bld-dv-content', 'bld-filter'} "Set of page-types that belong to the `dv` category." BIB_TYPES = {'bld-bd', 'bld-cluster', 'bld-concept', 'bld-direction', 'bld-faq', 'bld-landing', 'bld-overview', 'bld-sitemap', 'bld-target', 'bld-targetGroup'} "Set of page-types that belong to the `bib` category." ALG_TYPES = {'bld-error', 'bld-iahWrapper', 'bld-newsItem', 'bld-outage', 'bld-response', 'bld-search'} "Set of page-types that belong to the `alg` category (excluding " \ "'bld_wrapper', which is handled specifically)." _README_TXT = "IMPORTANT\n" \ "=========\n" \ "A scrape is validated by the availability of a log file " \ "containing the text 'scrape finished'.\n" \ "Only delete a log when a scrape needs to be completely " \ "discarded. Do this with great care!" "Text for the README file in the directory with the scrape logs." # Under the hood, the requests package uses the urllib3 package. While # checking links in the scrape_page function, this package can give some # warnings that are not relevant in the context of this scraper. urllib3.disable_warnings() logger = logging.getLogger() def create_new_data_store() -> None: """ **Create a new skeleton data store.** ***Note:*** *Since this function will not be used very often, it is mainly meant as means to document the storage concept.* Driven by the configuration file `bd_www.constants.CONFIG_NAME`, (sub)directories are created in the master directory to contain the various scraped data, and an empty scrapes as well as a metrics database is created with all necessary tables. The next data will be stored as files in separate **subdirectories** of the master directory: - page sources in zip files per scrape - 'robots.txt' files when used for finding unscraped url's and if different from the previous version - 'sitemap.xml' files when used for finding unscraped url's and if different from the previous version - log files for all scrapes - site reports per report variation, with subdirectories per reporting period Next **files** are stored directly in the master directory: - web analytics log file - reporting log file - scrapes database - metrics database Various names of above mentioned files and (sub)directories are determined by the actual parameters of the configuration file. Please refer to the `bd_www.Config` class for further information. The ***scrapes database*** consists of the following tables and views: **History tables** - contain all basic, extracted and derived data from all scrapes. All stored data is labeled with the timestamp of the scrape. With some pragmatic exceptions the data is only saved if different from the previous scrape (time delta storage). - *his_pages_life* - time delta register when a page was alive or not (*alive* is FALSE when a page was not present) - *his_pages_info* - time delta register of extracted and derived data per page (full record saved when one or more values differ from previous scrape) - *his_ed_links* - time delta register with usage of editorial links (*present* is 0 when an editorial link was not used) - *his_links_status* - time delta register with type and status of (editorial) links - *his_redirs* - time delta register with redirects that occurred while scraping (*redir_type* is 'absent' when it did not occur) - *his_unlinked* - time delta register with pages that were not connected via editorial links (*unl_type* is 'absent' when no longer unlinked or not present) - *his_short_urls* - time delta register when short url's were active - *his_status_figures* - figures representing the status of the site for each scrape **Master tables** - contain unique and primary data - *mst_paths* - unique relation between *path* and *page_id* - *mst_links* - unique relation between *url* or *page_id*/*anchor* and *link_id* - *mst_scrapes* - primary reference for available scrapes and unique source for *scraper* and *primers* fields **Non tables/views** - collections of manually administered *page_id*'s to force special handling of these pages. - *non_scrape_ids* - table with manually administered *page_id*'s that will not be scraped - *non_orphan_ids* - table with manually administered *page_id*'s that will not be considered orphans when becoming unlinked (*unl_type* will be set to 'special' in those cases) - *non_orphan_paths* - view of table *non_scrape_ids* with explicit *path* - *non_scrape_paths* - view of table *non_orphan_ids* with explicit *path* **Scrape tables** - operational stores for scrape results; purged when starting a new scrape. - *scr_page_ids* - id's of pages that were scraped - *scr_redirs* - redirects that occurred while scraping - *scr_pages_info* - extracted and derived data per page - *scr_links* - register of fully qualified (including *link_type* and *status*) links in the editorial content of the scraped pages - *scr_ed_links* - use of links in the editorial content each page - *scr_unlinked* - pages that can not be reached from other pages - *scr_status_figures* - figures representing the status of the site **Timestamped table and views** - after saving the timestamp in (the only record of) the *tsd* table, the timestamped views represent a dataset that is comparable to the scrape tables after the scrape with the specified timestamp finished. - *tsd* - scalar store to set the timestamp for the next timestamped views - *tsd_redirs* - identical to equivalent scrape table - *tsd_redirs_to_wcm* - all redirects related to the canonical url of the page they finally redirected to (not available as scrape table) - *tsd_pages_info* - identical to equivalent scrape table - *tsd_links* - identical to equivalent scrape table - *tsd_ed_links* - identical to equivalent scrape table - *tsd_unlinked* - identical to equivalent scrape table - *tsd_int_ref_cnt* - number of times that a pages is linked to from the editorial content of other pages (not available as scrape table) The ***metrics database*** consists of the following tables and views: **Basic tables** - contain unique and primary data - *shd_paths* - unique relation between *path* and *page_id*; acts as shadow subset of the *mst_paths* table in the scrape database as substitute for an inter-db relation that is not possible with SQLite - *shd_links* - unique relation between *link_id* and *url*; acts as shadow subset of the *mst_links* table in the scrape database as substitute for an inter-db relation that is not possible with SQLite - *dates* - unique relation between *date* and *date_id* **Analytics tables** - contain all historic analytics data - *daily* - numeric analytics data per date and page - *feedback* - numeric and textual feedback per date and page - *downloads* - numeric download data per date and download link **Analytics views** - views for the analytics tables with explicit relations - *daily_expl* - daily data table with explicit date and page path - *feedback_expl* - feedback data table with explicit date and page path - *downloads_expl* - downloads data table with explicit date and link path """ # Create directories and database file for dir_name in (scrape_conf.src_dir_name, scrape_conf.robots_dir_name, scrape_conf.log_dir_name, scrape_conf.sitemap_dir_name): (mst_dir / dir_name).mkdir(exist_ok=True) for report in report_conf.reports: report_conf.spec_report(report) report_dir = mst_dir / report_conf.report_dir_name report_dir.mkdir(exist_ok=True) for subdir_name in set(REPORT_SUBDIRS.values()): (report_dir / subdir_name).mkdir(exist_ok=True) log_readme = mst_dir / scrape_conf.log_dir_name / 'README.txt' with open(log_readme, 'w') as f: f.write(_README_TXT) # Scrape db: create all tables to contain the historical scrape data mst_conn.execute(''' CREATE TABLE mst_scrapes ( timestamp TEXT PRIMARY KEY, date AS ("20" || substr(timestamp,1,2) || "-" || substr(timestamp,3,2) || "-" || substr(timestamp,5,2)), scraper TEXT, primers TEXT ) ''') mst_conn.execute(''' CREATE TABLE mst_paths ( page_id INTEGER PRIMARY KEY AUTOINCREMENT, path TEXT NOT NULL UNIQUE ) ''') mst_conn.execute(''' CREATE TABLE mst_links ( link_id INTEGER PRIMARY KEY AUTOINCREMENT, url TEXT, anchor TEXT, page_id INTEGER, UNIQUE (url), UNIQUE (page_id, anchor), FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') mst_conn.execute(''' CREATE TABLE his_pages_life ( timestamp TEXT NOT NULL, page_id INTEGER NOT NULL, alive INTEGER NOT NULL, PRIMARY KEY (timestamp, page_id), FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') fields = EXTRACTED_FIELDS + DERIVED_FIELDS info_columns = ', '.join([f'{f[0]} {f[1]}' for f in fields]) mst_conn.execute(f''' CREATE TABLE his_pages_info ( timestamp TEXT NOT NULL, page_id INTEGER NOT NULL, {info_columns}, PRIMARY KEY (timestamp, page_id), FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') mst_conn.execute(''' CREATE TABLE his_links_status ( timestamp TEXT NOT NULL, link_id INTEGER NOT NULL, link_type TEXT, status INTEGER, PRIMARY KEY (timestamp, link_id), FOREIGN KEY (link_id) REFERENCES mst_links (link_id) ) ''') mst_conn.execute(''' CREATE TABLE his_ed_links ( timestamp TEXT NOT NULL, page_id INTEGER NOT NULL, text TEXT NOT NULL, link_id INTEGER NOT NULL, present INTEGER NOT NULL, PRIMARY KEY (timestamp, page_id, text, link_id) FOREIGN KEY (page_id) REFERENCES mst_paths (page_id), FOREIGN KEY (link_id) REFERENCES mst_links (link_id) ) ''') mst_conn.execute(''' CREATE TABLE his_redirs ( timestamp TEXT NOT NULL, req_id INTEGER, req_url TEXT, redir_id INTEGER, redir_url TEXT, redir_type TEXT NOT NULL, PRIMARY KEY (timestamp, req_id, req_url), UNIQUE (timestamp, req_id), UNIQUE (timestamp, req_url), FOREIGN KEY (req_id) REFERENCES mst_paths (page_id), FOREIGN KEY (redir_id) REFERENCES mst_paths (page_id) ) ''') mst_conn.execute(''' CREATE TABLE his_unlinked ( timestamp TEXT NOT NULL, page_id INTEGER, url TEXT, unl_type TEXT, PRIMARY KEY (timestamp, page_id, url), UNIQUE (timestamp, page_id), UNIQUE (timestamp, url), FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') mst_conn.execute(''' CREATE TABLE his_short_urls ( timestamp TEXT NOT NULL, url TEXT NOT NULL, active INTEGER NOT NULL, PRIMARY KEY (timestamp, url) ) ''') mst_conn.execute(''' CREATE TABLE his_status_figures ( timestamp TEXT NOT NULL, name TEXT NOT NULL, value INTEGER NOT NULL, PRIMARY KEY (timestamp, name) ) ''') mst_conn.execute(''' CREATE TABLE non_orphan_ids ( page_id INTEGER NOT NULL UNIQUE, FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') mst_conn.execute(''' CREATE TABLE non_scrape_ids ( page_id INTEGER NOT NULL UNIQUE, FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') # Scrape db: create tables to contain operational results during a scrape mst_conn.execute(''' CREATE TABLE scr_page_ids ( page_id INTEGER NOT NULL UNIQUE, FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') mst_conn.execute(''' CREATE TABLE scr_redirs ( req_id INTEGER UNIQUE, req_url TEXT UNIQUE, redir_id INTEGER, redir_url TEXT, redir_type TEXT NOT NULL, PRIMARY KEY (req_id, req_url), FOREIGN KEY (req_id) REFERENCES mst_paths (page_id), FOREIGN KEY (redir_id) REFERENCES mst_paths (page_id) ) ''') mst_conn.execute(''' CREATE TABLE scr_links ( link_nr INTEGER PRIMARY KEY AUTOINCREMENT, url TEXT, page_id INTEGER, anchor TEXT, link_type TEXT, status INTEGER, link_id INTEGER UNIQUE, UNIQUE (url, anchor), UNIQUE (page_id, anchor), FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') mst_conn.execute(''' CREATE TABLE scr_ed_links ( page_id INTEGER NOT NULL, text TEXT NOT NULL, link_nr INTEGER NOT NULL, FOREIGN KEY (page_id) REFERENCES mst_paths (page_id), FOREIGN KEY (link_nr) REFERENCES scr_links (link_nr) ) ''') mst_conn.execute(''' CREATE TABLE scr_unlinked ( page_id INTEGER UNIQUE, url TEXT UNIQUE, unl_type TEXT, FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') fields = EXTRACTED_FIELDS + DERIVED_FIELDS info_columns = ', '.join([f'{f[0]} {f[1]}' for f in fields]) mst_conn.execute(f''' CREATE TABLE scr_pages_info ( page_id INTEGER PRIMARY KEY, {info_columns}, FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') mst_conn.execute(''' CREATE TABLE scr_status_figures ( name TEXT PRIMARY KEY, value INTEGER NOT NULL ) ''') # Scrape db: create views for non-orphan and non-scrape tables with added # paths. mst_conn.execute(''' CREATE VIEW non_orphan_paths AS SELECT * FROM non_orphan_ids NATURAL JOIN mst_paths ''') mst_conn.execute(''' CREATE VIEW non_scrape_paths AS SELECT * FROM non_scrape_ids NATURAL JOIN mst_paths ''') # Scrape db: create a simple table to contain the timestamp for a # specific scrape, together with the views to get the data for that scrape. mst_conn.executescript('CREATE TABLE tsd (timestamp TEXT)') mst_conn.execute(''' CREATE VIEW tsd_unlinked AS WITH last_unlinked_during_scrape AS ( SELECT DISTINCT page_id, url, last_value(unl_type) OVER win AS unl_type FROM his_unlinked AS u, tsd AS s WHERE u.timestamp <= s.timestamp WINDOW win AS ( PARTITION BY page_id, url ORDER BY u.timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) SELECT * FROM last_unlinked_during_scrape WHERE unl_type != 'absent' ''') mst_conn.execute(''' CREATE VIEW tsd_redirs AS WITH last_redirs_during_scrape AS ( SELECT DISTINCT req_id, req_url, last_value(redir_id) OVER win AS redir_id, last_value(redir_url) OVER win AS redir_url, last_value(redir_type) OVER win AS redir_type FROM his_redirs AS r, tsd AS s WHERE r.timestamp <= s.timestamp WINDOW win AS ( PARTITION BY req_id, req_url ORDER BY r.timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) SELECT * FROM last_redirs_during_scrape WHERE redir_type != 'absent' ''') mst_conn.execute(''' CREATE VIEW tsd_redirs_to_wcm (req_url, wcm_id, final_type) AS WITH -- build url-based redir chains scr_chains (org_url, prev_url, next_id, next_url, next_type) AS ( SELECT req_url, NULL, redir_id, redir_url, redir_type FROM tsd_redirs UNION ALL SELECT org_url, next_url, r.redir_id, r.redir_url, r.redir_type FROM scr_chains AS c LEFT JOIN tsd_redirs AS r ON c.next_url = r.req_url -- avoid endless loops by excluding redirs of an alias WHERE c.next_type <> 'alias' ) -- get all redir chains ending in a wcm url SELECT org_url, next_id, next_type FROM scr_chains WHERE next_id IS NOT NULL AND org_url IS NOT NULL ''') mst_conn.execute(''' CREATE VIEW tsd_ed_links AS WITH RECURSIVE last_his_ed_links_records_during_scrape AS ( SELECT DISTINCT page_id, text, link_id, last_value(present) OVER win AS present FROM his_ed_links AS e, tsd AS s WHERE e.timestamp <= s.timestamp WINDOW win AS ( PARTITION BY page_id, text, link_id ORDER BY e.timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ), -- Degroup same links per page scr_ed_links_singles AS ( SELECT page_id, text, link_id, present - 1 AS present FROM last_his_ed_links_records_during_scrape WHERE present > 0 UNION ALL SELECT page_id, text, link_id, present - 1 AS present FROM scr_ed_links_singles WHERE present > 0 ) SELECT page_id, text, link_id FROM scr_ed_links_singles ''') mst_conn.execute(''' CREATE VIEW tsd_links AS WITH tsd_links_status AS ( SELECT DISTINCT link_id, last_value(link_type) OVER win AS link_type, last_value(status) OVER win AS status FROM his_links_status AS l, tsd AS s WHERE l.timestamp <= s.timestamp WINDOW win AS ( PARTITION BY link_id ORDER BY l.timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) SELECT link_id, url, page_id, anchor, link_type, status FROM mst_links LEFT JOIN tsd_links_status USING (link_id) WHERE link_id IN (SELECT DISTINCT link_id FROM tsd_ed_links) ''') mst_conn.execute(''' CREATE VIEW tsd_int_ref_cnt (page_id, referral_cnt) AS WITH -- wcm id's of the internal links per page int_page_links (referring_page_id, referral_cnt, linked_page_id) AS ( SELECT e.page_id, count(*), ifnull(l.page_id, s.wcm_id) FROM tsd_ed_links AS e LEFT JOIN mst_links AS l USING (link_id) LEFT JOIN tsd_redirs_to_wcm AS s ON iif(instr(l.url, '#'), substr(l.url, 1, instr(l.url, '#') - 1), l.url) = req_url WHERE (l.page_id OR s.wcm_id) GROUP BY e.page_id, e.text, e.link_id ) SELECT linked_page_id, sum(referral_cnt) FROM int_page_links -- next clause removes some spurious in-page links WHERE referring_page_id != linked_page_id GROUP BY linked_page_id ''') mst_conn.execute(''' CREATE VIEW tsd_pages_info AS WITH scr_pages AS ( SELECT DISTINCT page_id, last_value(alive) OVER win AS alive FROM his_pages_life AS p, tsd AS s WHERE p.timestamp <= s.timestamp WINDOW win AS ( PARTITION BY page_id ORDER BY p.timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ), scr_last_info_ts (page_id, timestamp) AS ( SELECT DISTINCT page_id, last_value(i.timestamp) OVER win FROM his_pages_info AS i, tsd AS s WHERE i.timestamp <= s.timestamp WINDOW win AS ( PARTITION BY page_id ORDER BY i.timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) SELECT pi.* FROM scr_pages LEFT JOIN scr_last_info_ts AS li USING (page_id) LEFT JOIN his_pages_info AS pi USING (page_id, timestamp) WHERE alive ''') # Metrics db: created basic tables mst_conn.execute(''' CREATE TABLE mtrx.shd_paths ( page_id INTEGER NOT NULL UNIQUE, path TEXT NOT NULL UNIQUE, PRIMARY KEY (page_id) ) ''') mst_conn.execute(''' CREATE TABLE mtrx.shd_links ( link_id INTEGER PRIMARY KEY, url TEXT NOT NULL ) ''') mst_conn.execute(''' CREATE TABLE mtrx.dates ( date_id INTEGER UNIQUE, date TEXT NOT NULL UNIQUE, PRIMARY KEY (date_id AUTOINCREMENT) ) ''') # Metrics db: created tables to store analytic data mst_conn.execute(''' CREATE TABLE mtrx.daily ( date_id INTEGER NOT NULL, page_id INTEGER NOT NULL, nb_visits INTEGER, nb_hits INTEGER, entry_nb_visits INTEGER, entry_bounce_count INTEGER, bounce_rate REAL AS (round(CAST(entry_bounce_count AS REAL) / entry_nb_visits, 3)), exit_nb_visits INTEGER, exit_rate REAL AS (round(CAST(exit_nb_visits AS REAL) / nb_visits, 3)), organic_entries INTEGER, organic_entry_rate REAL AS (round(CAST(organic_entries AS REAL) / nb_visits, 3)), call_visits INTEGER, call_rate REAL AS (round(CAST(call_visits AS REAL) / nb_visits, 3)), PRIMARY KEY (date_id, page_id), FOREIGN KEY (date_id) REFERENCES dates, FOREIGN KEY (page_id) REFERENCES shd_paths ) ''') mst_conn.execute(''' CREATE TABLE mtrx.feedback ( date_id INTEGER NOT NULL, page_id INTEGER NOT NULL, pos_cnt INTEGER, neg_cnt INTEGER, pos_txt TEXT, neg_txt TEXT, PRIMARY KEY (date_id, page_id), FOREIGN KEY (date_id) REFERENCES dates, FOREIGN KEY (page_id) REFERENCES shd_paths ) ''') mst_conn.execute(''' CREATE TABLE mtrx.downloads ( date_id INTEGER NOT NULL REFERENCES dates, link_id INTEGER NOT NULL REFERENCES shd_links, nb_visits INTEGER, nb_hits INTEGER, PRIMARY KEY (date_id, link_id) ) ''') # Metrics db: create views to explicitly show referenced data mst_conn.execute(''' CREATE VIEW mtrx.daily_expl AS SELECT date, path, d.* FROM daily AS d NATURAL JOIN dates NATURAL JOIN shd_paths ''') mst_conn.execute(''' CREATE VIEW mtrx.feedback_expl AS SELECT date, path, f.* FROM feedback AS f NATURAL JOIN dates NATURAL JOIN shd_paths ''') mst_conn.execute(''' CREATE VIEW mtrx.downloads_expl AS SELECT date, url, dl.* FROM downloads AS dl NATURAL JOIN dates NATURAL JOIN shd_links ''') def scrape_site() -> str | None: """ **Scrape www.belastingdienst.nl.** Returns: timestamp of the scrape [yymmdd-hhmm] or None in case a 'scrape_in_progress.flag' is encountered One complete scrape is done of `www.belastingdienst.nl` and stored in the scrape tables of the scrapes database (documented with the `create_scrapes_storage` function) and a zip-file with pages sources. The scrape tables will contain `page_id`'s of the scraped pages, redirects that occurred while retrieving the pages, detailed info per page, use of links from editorial content and figures representing the site status. After finishing the scrape and completing the scrape tables, the result of the scrape is added to the history tables of the scrapes database. Further process and status information of the scrape is retained in the next files: - log file with info, warnings and/or errors of the scrape - copies of the 'robots.txt' and 'sitemap.xml' files if they are used to find unscraped pages and if they differ from the previously saved version In case the scraper operates in an environment with normal external access, all editorial links will be checked. Otherwise only links to (a few configured) trusted sites are checked. The outcome of all checks, executed or not, will be reflected in the *status* field of the *scr_links* and *his_links_status* tables of the scrapes database. """ global logger def ext_int_available() -> bool: """ **Tests whether external internet is available.** Returns: Availability. Only tests if www.google.nl can be reached. """ # noinspection PyBroadException try: ok = requests.get( 'https://www.google.nl', timeout=5).status_code == 200 except Exception: ok = False return ok # Some preliminaries ts = time.strftime('%y%m%d-%H%M') logger = logging.getLogger('scrape') log_dir = mst_dir / scrape_conf.log_dir_name log_file = str(log_dir / f'{ts}_{scrape_conf.log_name}') logger.setLevel(logging.INFO) fh = logging.FileHandler(log_file) fh.setLevel(logging.INFO) ch = logging.StreamHandler() ch.setLevel(logging.INFO) formatter = logging.Formatter( fmt='[%(asctime)s] %(levelname)-8s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S') fh.setFormatter(formatter) ch.setFormatter(formatter) logger.addHandler(fh) logger.addHandler(ch) logger.info(f'scrape {ts} started') # Check if another scrape is running flag = mst_dir / 'SCRAPE_IN_PROGRESS.flag' if flag.exists(): ct = dt.datetime.fromtimestamp(flag.stat().st_ctime) logger.warning(f'Another scrape in progress since ' f'{ct.strftime("%a %d %b %Y / %H:%M")}') return '' else: flag.touch() # Load url's that should not be scraped qry_result = mst_conn.execute( 'SELECT path FROM non_scrape_paths').fetchall() urls_done = {ROOT_URL + path for (path,) in qry_result} # Initialisations empty_scr_tables() api_ok = matomo_available() ext_ok = ext_int_available() if not ext_ok: logger.warning('www.google.nl not available: ' 'assuming limited external access') urls_todo = {WWW_SITE} urls_scraped = 0 pages_saved = 0 unlinked_urls_checked = False mst_conn.execute(''' INSERT INTO mst_scrapes (timestamp, scraper, primers) VALUES (?, ?, ?) ''', [ts, SCRAPER_VERSION, 'site url only']) start_time = time.time() # Do the scrape max_nr = 100_000 if scrape_conf.max_urls == 'all' else scrape_conf.max_urls src_zip = PageSourceZip(ts) while urls_todo and urls_scraped < max_nr: req_url = urls_todo.pop() wcm_url = scrape_page(req_url, urls_done, urls_todo, src_zip, ext_ok) urls_scraped += 1 if wcm_url: pages_saved += 1 # After initial link crawling, check any url's were missed if not urls_todo and not unlinked_urls_checked: logger.info(f'initial link crawling done: ' f'{urls_scraped} urls\'s visited / ' f'{pages_saved} pages saved') logger.info('finding url\'s that have not been crawled') urls_todo = find_unscraped_urls(ts, urls_done, historic=True, requested=api_ok) unlinked_urls_checked = True # Time cycles and print progress and prognosis num_todo = min(len(urls_todo), max_nr - urls_scraped) if urls_scraped % 25 == 0: page_time = (time.time() - start_time) / urls_scraped togo_time = int(num_todo * page_time) print(f'{urls_scraped:4} url\'s done, ' f'{page_time:.2f} secs per url / ' f'{num_todo:4} found todo, {togo_time // 60}:' f'{togo_time % 60:02} min to scrape them') elapsed = int(time.time() - start_time) # Important: do not alter the next logging text; the presence of this # text in the log file validates the scrape (ref. check_storage_integrity # function). logger.info(f'scrape finished in {elapsed // 60}:{elapsed % 60:02} min') logger.info(f'{urls_scraped} urls\'s visited / {pages_saved} pages saved') # Processing the scraped data post_process_scrape() extract_pages_info(ts) derive_pages_info() status_figures() add_scraped_data(ts) elapsed = int(time.time() - start_time) logger.info(f'scraping and processing data finished in ' f'{elapsed // 60}:{elapsed % 60:02} min') flag.unlink(missing_ok=True) return ts def empty_scr_tables() -> None: """ **Delete all content from scrape tables.** """ qry = ''' SELECT name FROM sqlite_schema WHERE type = 'table' AND name LIKE 'scr_%' ''' for name in [r[0] for r in mst_conn.execute(qry)]: mst_conn.execute(f'DELETE FROM {name}') def scrape_page( req_url: str, urls_done: Set[str], urls_todo: Set[str], page_zip: PageSourceZip, check_all_links: bool = True) -> Union[str, None]: """ **Scrape an html page and save the results to the scrape tables.** Arguments: req_url: url of requested page urls_done: url's that have been scraped already urls_todo: url's that still need to be scraped page_zip: archive to store scraped page sources check_all_links: when false, only trusted links will be checked Returns: canonical wcm url (or None for mentioned cases) Get the final response from the requested url while collecting any redirect that occurs during the request. Register the page under the canonical wcm path and save the page source to the zip-archive. Further store page-related data, editorial links and redirects to the scrape tables of the scrapes database (documented with the `create_scrapes_storage` function). The `urls_done` and `urls_todo` sets are updated accordingly, driving the further scrape of the site. When `check_all_links` is `True`, all links within the page that are external (do not start with `bd_www.constants.ROOT_URL`) will be checked. When set to `False`, only those external links will be checked that go to a site in the `trusted_domains` configuration parameter (refer to `bd_www.Config`). This option can be used when this module runs on a platform that (for reasons of security) has limited external site access. All url's are absolute, also in the resulting *scr_links*, *scr_redirs* and *scr_unlinked* tables. The `post_process_scrape` should be used to normalise these url's and determine the type of the unlinked pages. The returned canonical wcm url is the content attribute of the `<meta name="DCTERMS.identifier">` tag. This is the url as generated by the WCM system. The returned value will be `None` when the requested url - was already scraped (and saved) - redirects to an out-of-scope page - gave an unexpected response (which is logged) """ def check_ext_link( lnk: str, trusted_only: bool = False) -> tuple[str, str]: """ **Return type and response status of an external link.** Arguments: lnk: url to be checked trusted_only: when true, only links to trusted sites will will be checked; the returned response status will reflect this Returns: response status, link type The returned `response status` is the status code from the primary response upon requesting the given url. By default the returned `link type` will be 'ext', but when the response is a redirect, depending on the destination url, '-to-int' or '-to-ext' is added to this type. When requesting the url hits an error, it is returned via the `response status`. The sites to be trusted are taken from the `trusted_domains` configuration parameter in the [SCRAPE] section of the configuration file (see `bd_www.Config`). """ l_type = 'ext' if trusted_only: re_result = re.match(r'^(.*\.)?(.*\..*)$', urlparse(lnk).netloc) link_site = re_result[2] if re_result else None if link_site not in scrape_conf.trusted_domains: return 'unable to check link', l_type try: response = requests.get( lnk, allow_redirects=False, verify=False, timeout=5) except Exception as err: resp_status = f'error while checking link: {err}' else: resp_status = response.status_code if resp_status in (301, 302, 303, 307): location = response.headers['Location'] if location.startswith(ROOT_URL): l_type += '-to-int' else: l_type += '-to-ext' return str(resp_status), l_type def save_page(path: str, doc: str) -> Union[int, None]: """ **Save scraped page to the page source zip.** Arguments: path: relative to the ROOT_URL of the scrape doc: page source Returns: page_id or None if already saved If the page was not saved before, it will be saved with the page_id requested from the `mst_paths` table. A new page_id wil be created if needed. """ qry_res = mst_conn.execute( 'SELECT page_id FROM mst_paths WHERE path = ?', [path]).fetchone() if qry_res: p_id = qry_res[0] else: # This is a brand new page p_id = mst_conn.execute( 'INSERT INTO mst_paths (path) VALUES (?)', [path]).lastrowid if p_id in page_zip.page_ids(): # Page already saved during this scrape return None else: # Page not saved yet during this scrape page_zip.add_page(p_id, doc) mst_conn.execute('INSERT INTO scr_page_ids VALUES (?)', [p_id]) return p_id # Setup some variables page_as_string = '' soup = None redir_qry = ''' INSERT OR IGNORE INTO scr_redirs (req_url, redir_url, redir_type) VALUES (?, ?, ?) ''' # Request loop while True: # Keep requesting until no further rewrites or redirects occur, # while saving redirects to the temporary table and updating the # urls_done admin. # noinspection PyBroadException try: resp = requests.get(req_url, allow_redirects=False) except Exception: # The requester failed on this url; consider it done and continue. logger.error(f'failed to request {req_url}') urls_done.add(req_url) return None urls_done.add(req_url) # Check and handle status codes if resp.status_code == 200: pass elif resp.status_code in (301, 302): # This concerns a redirect response redir_url = resp.headers['Location'] # If request or redir is in scope, save the redirect with full # url's to the temporary table. if (req_url.startswith(ROOT_URL) or redir_url.startswith(ROOT_URL)): mst_conn.execute( redir_qry, [req_url, redir_url, resp.status_code]) if redir_url.startswith(ROOT_URL): # restart Request loop with redirected url req_url = redir_url continue else: # Redirect url is out of scope. return None else: # Returned status code is not 200, 301 or 302, so getting the page # did not succeed. Consider it done and return a None result. logger.error(f'unexpected response from {req_url}; ' f'status code is {resp.status_code}.') return None # Read and parse the response into a soup document page_as_string = resp.text soup = BeautifulSoup(page_as_string, features='lxml') # Has this page a client-side redirect by means of header tag # <meta http-equiv="refresh" content="0;url=...">? meta_tag = soup.find('meta', attrs={'http-equiv': 'refresh'}) if meta_tag: redir_url = meta_tag['content'].split('url=')[1] redir_url = urljoin(req_url, redir_url) # If request or redir is in scope, save the redirect with full # url's to the temporary table. if (req_url.startswith(ROOT_URL) or redir_url.startswith(ROOT_URL)): mst_conn.execute(redir_qry, [req_url, redir_url, 'client']) if redir_url.startswith(ROOT_URL): # Restart Request loop with redirected url req_url = redir_url continue else: # Redirect url is out of scope. return None # End of the Request loop. # Reaching this point means we're done requesting this page. break # Strip potential parameters and anchor in the response url resp_url = resp.url.split('?')[0].split('#')[0] # Determine the wcm url if resp_url.startswith(ROOT_URL): # Response url is in scope. Now the wcm url should be in the header # meta tag named 'DCTERMS.identifier'. meta_tag = soup.find( 'meta', attrs={'name': 'DCTERMS.identifier'}) if meta_tag: wcm_url = meta_tag['content'] wcm_url = urljoin(resp_url, wcm_url) if wcm_url != resp_url: # Save this as an alias in the redirects table mst_conn.execute(redir_qry, [resp_url, wcm_url, 'alias']) # Consider also the wcm_url as done urls_done.add(wcm_url) else: logger.error(f'page without wcm url; ' f'falling back to: {resp_url}') wcm_url = resp_url else: # Response url is not in scope. # Because of the logic from the code above this situation should # never occur. Since this is not destructive, scraping continues. logger.error(f'out of scope response from {req_url}') return None # Save in-scope page under the wcm path. # It is possible that a page in the urls_todo set redirects to a wcm_url # of which the page/path combination is saved already. Especially some # short url's cause this situation. Those cases are not saved (again). wcm_path = wcm_url.split(ROOT_URL)[1] page_id = save_page(wcm_path, page_as_string) if not page_id: # Page was saved and handled earlier during this scrape return None # Fetch all in-page links, save the editorial links to the scrape db and # add links of in-scope pages to the urls_todo set. # Cycle over the editorial, automated en rest (partial) trees of the page for tree in split_tree(soup): editorial = True if tree.html['tree'] == 'editorial' else False # Cycle over all links in the tree for a_tag in tree.find_all('a', href=True): # Unquote the link to be sure (history taught this) link = unquote(a_tag['href'].strip()) link_text = a_tag.text.strip() # Some links are discarded if link.startswith('#'): # Plain #-links continue if link == '/': # Home page links continue if 'readspeaker' in link or 'adobe' in link or 'java' in link: # Non content links continue link = urljoin(resp_url, link) # Separate anchor from in-scope url if link.startswith(ROOT_URL) and '#' in link: link, _, anchor = link.partition('#') else: anchor = None if link == resp_url: # Discard in-page link (in case it was not a plain #-link) continue if link and editorial: link_type = None link_status = None # Check if the link is new for this scrape qry = f''' SELECT link_nr FROM scr_links WHERE url = '{link}' AND anchor IS {f"'{anchor}'" if anchor else 'NULL'} ''' qry_result = mst_conn.execute(qry).fetchone() if qry_result: # This link was already handled during this scrape link_nr = qry_result[0] else: if re.match('^https?://', link): # Get type and status of the link if link.startswith(ROOT_URL): link_type = 'int' else: # The link is external link_status, link_type = check_ext_link( link + '#' + anchor if anchor else link, not check_all_links) # If an external link redirects to an internal # page, it should be scraped too. This is true for # short url's, since these are strictly external # relative to the root url. if link_type == 'ext-to-int': if link not in urls_done: urls_todo.add(link) # Store the link for this scrape link_nr = mst_conn.execute(''' INSERT INTO scr_links (url, anchor, link_type, status) VALUES (?, ?, ?, ?) ''', [link, anchor, link_type, link_status]).lastrowid # Save link in relation to the page (can be a duplicate if # the link is used more than once with the same text in the # same page). mst_conn.execute(''' INSERT INTO scr_ed_links (page_id, text, link_nr) VALUES (?, ?, ?) ''', [page_id, link_text, link_nr]) if link: # Discard links that do not need to be scraped if link.endswith('.xml'): continue if not link.startswith(ROOT_URL): # Link not in scope continue # Strip potential parameters (anchor was separated from # in-scope url before). valid_scrape_url = link.split('?')[0] if valid_scrape_url in urls_done: # Already handled continue urls_todo.add(valid_scrape_url) else: # This condition is probably not possible logger.warning(f'empty link in page {page_id}') return wcm_url def split_tree( soup: BeautifulSoup) -> (BeautifulSoup, BeautifulSoup, BeautifulSoup): """ **Split a soup doc in three separate docs: editorial, automated and rest.** Arguments: soup: bs4 representation of the page to be split Returns: editorial doc, automated doc, rest doc The *editorial doc* is a copy of the originating doc with all tags removed that do not contain editorial content. The *automated doc* is a new html (root) tag containing all tags with automated content. The *rest doc* is an html (root) tag containing all tags that did not represent editorial or automated content. No tags are lost, so all three docs together contain the same tags as the originating doc, but only the hierarchical structure of the *editorial doc* remains intact. To identify the separate docs, a `tree` attribute is added to the `<html>` tag with values 'editorial', 'automated' and 'rest' respectively. More specifically, the three docs will contain the next tags from the originating doc. ***rest doc***: - `<head>` - `<header>` - `<footer>` - `<div id="bld-nojs">`: for situation that javascript is not active - `<div class="bld-subnavigatie">`: left side navigation of bib pages - `<div class="bld-feedback">`: bottom feedback of content page - `<div>` with readspeaker buttons - `<div>` with modal dialog for the virtual assistant - `<script>` en `<noscript>` - comments ***automated doc***: - `<div class="add_content">`: automatically added content - all remaining tags from pages with type bld-overview ***editorial doc***: - all tags that do not go to one of the other docs """ # Make working copy from soup doc, because of destructive tag removal tree = copy.copy(soup) # Create rest doc as html trunk rst_tree = BeautifulSoup('<html tree="rest"></html>', features='lxml') # All that is not needed for the editorial or automated trees goes to rest: # - head, header and footer if tree.head: rst_tree.html.append(tree.head.extract()) if tree.body.header: rst_tree.html.append(tree.body.header.extract()) if tree.body.footer: rst_tree.html.append(tree.body.footer.extract()) # - content for non active javascript div_nojs = tree.find('div', id='bld-nojs') if div_nojs: rst_tree.html.append(div_nojs.extract()) # - sub-navigation div_subnav = tree.find(class_='bld-subnavigatie') if div_subnav: rst_tree.html.append(div_subnav.extract()) # - feedback div_feedback = tree.find(class_='bld-feedback') if div_feedback: rst_tree.html.append(div_feedback.extract()) # - readspeaker buttons for tag in tree.find_all('div', class_='rs_skip'): rst_tree.html.append(tag.extract()) # - modal dialog for the virtual assistant for tag in tree.find_all('div', id='vaModal'): rst_tree.html.append(tag.extract()) # - comments for tag in tree.find_all(string=lambda text: isinstance(text, Comment)): rst_tree.html.append(tag.extract()) # - scripts for tag in tree.find_all(('script', 'noscript')): rst_tree.html.append(tag.extract()) # - no-support div for tag in tree.find_all(id='bld-nosupport'): rst_tree.html.append(tag.extract()) # Is the page generated without any editor intervention? if tree('body', attrs={'data-pagetype': 'bld-overview'}): # Then editor tree becomes trunk only ed_tree = BeautifulSoup( '<html tree="editorial"></html>', features='lxml') # And automated tree gets all remaining content aut_tree = tree aut_tree.html['tree'] = 'automated' else: # Make new html trunk and graft all automated content on it aut_tree = BeautifulSoup( '<html tree="automated"></html>', features='lxml') for tag in tree.find_all('div', class_='content_add'): aut_tree.html.append(tag.extract()) # Editorial tree gets all remaining content ed_tree = tree ed_tree.html['tree'] = 'editorial' return ed_tree, aut_tree, rst_tree def find_unscraped_urls( ts: str, urls_done: Set[str], historic: bool = True, requested: bool = False) -> Set[str]: """ **Get url's that were not found via link crawling.** Arguments: ts: timestamp of the scrape [yymmdd-hhmm] urls_done: url's that have been scraped already historic: check all url's that were ever scraped before requested: check all url's that were requested during the last 8 days Returns: originating url's that responded with internal url's that were not scraped yet When scraping via link crawling is exhausted, this function should be used to find url's of pages that do exist, but were not scraped yet. The next url sets will be verified (partially depending on parameters from the configuration file; see further `bd_www.Config`): - latest unlinked url's (depending on `use_unlinked_urls` configuration parameter) - url's from the sitemap(s), located via the 'robots.txt' file (depending on `use_sitemap` configuration parameter) - url's that were scraped during all previous scrapes - url's that were requested during the last eight days (including today) - url's encountered during earlier scrapes that redirected to internal pages (implicitly including short url's) The 'robots.txt' and 'sitemap.xml' files that are used, will be saved in case they differ from the previously saved version. The (fully qualified) response url's of the non scraped pages are stored in the *scr_unlinked* table of the scrapes database. """ def save_scrape_file(content: str, name: str, dest_dir: Path) -> None: """ **Save content to file if it differs from the most recent version.** Arguments: content: content to be saved name: name of the file to save to dest_dir: directory to save in The name with which the content is saved will be preceded by the timestamp and an underscore character. The content will also be saved if `dest_dir` refers to an empty directory. """ dir_files = sorted(dest_dir.iterdir()) if dir_files: with open(dir_files[-1]) as lf: last_content = lf.read() if not dir_files or content != last_content: with open(dest_dir / f'{ts}_{name}', 'w') as nf: nf.write(content) def living(all_urls: Set[str], set_name: str) -> Set[str]: """ **Return and register url's of living pages that were not scraped yet.** Arguments: all_urls: url's to check set_name: identification used for logging Returns: originating url's that responded with living pages Apart from returning the set of originating url's that responded with an active internal page that was not scraped yet, the respons url's of these pages are stored in the `scr_unlinked` table of the scrapes database. Duration and results be logged. """ start = time.time() missed_urls = set() for u in all_urls - urls_done: # noinspection PyBroadException try: rsp = requests.get(u) except Exception: continue if rsp.status_code // 100 != 4 and rsp.url.startswith(ROOT_URL): # Url is active and responds with an internal page missed_urls.add(u) if rsp.url not in urls_done: mst_conn.execute( 'INSERT OR IGNORE INTO scr_unlinked (url) ' 'VALUES (?)', [rsp.url]) logger.debug(f"unlinked {set_name} url's: {rsp.url}") # Wrapping up secs = int(time.time() - start) msg = 'link crawling ' if missed_urls: msg += f"missed {len(missed_urls)} {set_name} url's" else: msg += f"did not miss any {set_name} url's" msg += f'; checking took {secs // 60}:{secs % 60:02} min' logger.info(msg) return missed_urls still_to_scrape = set() # Verify previously unlinked url's if scrape_conf.use_unlinked_urls: qry = f''' WITH last_unlinked_pages AS ( SELECT DISTINCT page_id, last_value(unl_type) OVER ( PARTITION BY page_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS unl_type FROM his_unlinked ), last_unlinked_urls AS ( SELECT DISTINCT url, last_value(unl_type) OVER ( PARTITION BY url ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS unl_type FROM his_unlinked ) SELECT '{ROOT_URL}' || path FROM last_unlinked_pages left JOIN mst_paths USING (page_id) WHERE path NOTNULL AND unl_type != 'absent' UNION SELECT url FROM last_unlinked_urls WHERE url NOTNULL AND unl_type != 'absent' ''' unlinked_urls = {r[0] for r in mst_conn.execute(qry).fetchall()} still_to_scrape |= living(unlinked_urls, 'unlinked') # Verify url's from the sitemap(s) if scrape_conf.use_sitemap: # Find the sitemap via robots.txt robots_url = WWW_SITE + '/robots.txt' response = requests.get(robots_url) if response.status_code != 200: logger.info( f'getting {robots_url} returned status {response.status_code}') else: robots_txt = response.text robots_dir = mst_dir / scrape_conf.robots_dir_name save_scrape_file(robots_txt, 'robots.txt', robots_dir) sitemap_urls = re.findall(r'Sitemap:\s(https?://\S*)', robots_txt) if len(sitemap_urls) == 0: logger.info('no sitemap.xml declared in robots.txt') else: urls_from_sitemaps = set() sitemap_dir = mst_dir / scrape_conf.sitemap_dir_name for i, url in enumerate(sitemap_urls, start=1): response = requests.get(url) if response.status_code != 200: logger.info(f'getting {url} returned status ' f'{response.status_code}') else: logger.info(f'sitemap found at {url}') if len(sitemap_urls) == 1: filename = f'sitemap.xml' else: filename = f'sitemap({i}).xml' save_scrape_file(response.text, filename, sitemap_dir) soup = BeautifulSoup(response.text, features='lxml') urls_from_sitemaps |= {loc.text for loc in soup.findAll('loc')} still_to_scrape |= living(urls_from_sitemaps, 'sitemap') # Verify all historic url's if historic: qry_result = mst_conn.execute( 'SELECT path FROM mst_paths').fetchall() historic_urls = {ROOT_URL + path for (path,) in qry_result} still_to_scrape |= living(historic_urls, 'historic') # Verify url's that were requested during the last eight days if requested: # noinspection PyBroadException # This try clause is used since Matomo is very unpredictable try: requested_urls = urls_last_days() except Exception: logger.critical('using urls_last_days() failed') else: # Check which do exist but are not crawled yet still_to_scrape |= living(requested_urls, 'visited') # Verify all previously requested url's that initiated redirects qry_result = mst_conn.execute( 'SELECT DISTINCT req_url FROM his_redirs').fetchall() redirected_urls = {url for (url,) in qry_result} still_to_scrape |= living(redirected_urls, 'redirected') return still_to_scrape def post_process_scrape() -> None: """ **Update tables after finishing a scrape.** After scraping all pages using the `scrape_page` function, this `post_process_scrape` function needs to be used to updates some tables in the scrapes database. This is essential before further processing can be done of the scraped data. This post-processing consists of two steps: **Step 1** - Url normalisation Convert the full url's in the *scr_links*, *scr_redirs* and *scr_unlinked* tables to `page_id`'s where possible. ***Remark:*** *This can result in `page_id`'s of url's/pages that were not alive during the scrape (especially in the `scr_links` table).* **Step 2** - Typing unlinked pages Determine the type of the pages in the *scr_unlinked* table by setting *unl_type* to: - 'alias' - when the page is an alias to a wcm page - 'special' - when the id is in the *non_orphan_ids* table of the scrapes database - 'orphan' - in all other cases **Finally** Because some of the transactions impact the size of the scrapes database, it is vacuumed before returning. """ # Replace url's in the scr_links table with page_id's where possible and # combine separated anchors with remaining urls (during the scrape anchors # were separated form internal url's). mst_conn.execute(f''' UPDATE scr_links SET page_id = ( SELECT page_id FROM mst_paths WHERE url = '{ROOT_URL}' || path), url = NULL WHERE url IN (SELECT '{ROOT_URL}' || path FROM mst_paths) ''') mst_conn.execute(''' UPDATE scr_links SET url = url || '#' || anchor, anchor = NULL WHERE url NOTNULL AND anchor NOTNULL ''') # Replace request and redirect url's in the redirs table with page_id's # where possible. mst_conn.execute(f''' UPDATE scr_redirs SET req_id = ( SELECT page_id FROM mst_paths WHERE req_url = '{ROOT_URL}' || path), req_url = NULL WHERE req_url IN (SELECT '{ROOT_URL}' || path FROM mst_paths) ''') mst_conn.execute(f''' UPDATE scr_redirs SET redir_id = ( SELECT page_id FROM mst_paths WHERE redir_url = '{ROOT_URL}' || path), redir_url = NULL WHERE redir_url IN (SELECT '{ROOT_URL}' || path FROM mst_paths) ''') # Replace url's in the 'scr_unlinked' table with page_id's where possible # and qualify the unlinked url's. mst_conn.execute(f''' UPDATE scr_unlinked SET page_id = ( SELECT page_id FROM mst_paths WHERE url = '{ROOT_URL}' || path), url = NULL WHERE url IN (SELECT '{ROOT_URL}' || path FROM mst_paths) ''') mst_conn.execute(''' WITH -- Url-based redir chains redir_chains (org_url, prev_url, next_id, next_url, next_type) AS ( SELECT req_url, NULL, redir_id, redir_url, redir_type FROM scr_redirs UNION ALL SELECT org_url, next_url, r.redir_id, r.redir_url, r.redir_type FROM redir_chains AS c LEFT JOIN scr_redirs AS r ON c.next_url = r.req_url -- avoid endless loops by excluding redirs of an alias WHERE c.next_type <> 'alias' ), -- All redir chains ending in a wcm url scr_redirs_to_wcm (req_url, wcm_id, final_type) AS ( SELECT org_url, next_id, next_type FROM redir_chains WHERE next_id IS NOT NULL AND org_url IS NOT NULL ) UPDATE scr_unlinked SET unl_type = iif(page_id IN non_orphan_ids, 'special', iif((SELECT final_type FROM scr_redirs_to_wcm WHERE url = req_url) = 'alias', 'alias', 'orphan' ) ) ''') logger.info('scr_links, scr_redirs and scr_unlinked tables ' 'have been normalised') # Finally mst_conn.execute('VACUUM') def extract_pages_info(ts: str) -> None: """ **Renew content of *scr_pages_info* table with extracted information.** Arguments: ts: timestamp of the scrape [yymmdd-hhmm] After deleting any previous content from the *scr_pages_info*, this function fills the 'extracted' data fields of this table for each scraped page with the next content: - *title* - content of `<title>` tag - *description* - content of `<meta name="description" content="..." />` - *num_h1s* - number of `<h1>` tags - *first_h1* - content of first `<h1>` tag - *language* - content of `<meta name="language" content="xx" />` - *modified* - content of `<meta name="DCTERMS.modified" content="..." />` - *pagetype* - attribute value of `<body data-pageType="...">` - *classes* - attribute value of `<body class="...">` - *ed_text* - editorial text of the page - *aut_text* - automated text of the page The *scr_pages_info* table accommodates additional fields to contain derived information for each page. This is further detailed in the `derive_pages_info` function. It will be logged when tags or attributes are missing or values are invalid. ***Data note:*** *Extracted information concerns data that is readily available within the page source. Storing this data in a separate table is strictly redundant, but serves faster access.* ***Implementation note:*** *The fields of the* scr_pages_info *table are defined by the constants `EXTRACTED_FIELDS` and `DERIVED_FIELDS`. Although some groundwork is laid to adapt the number and content of these fields, the actual code does not fully support such an adaptation.* """ logger.info('extracting pages info started') src_zip = PageSourceZip(ts) num_pages = len(src_zip.page_ids()) start_time = time.time() mst_conn.execute('DELETE FROM scr_pages_info') # Cycle over all pages for page_num, (page_id, doc) in enumerate(src_zip.iter_pages(), start=1): qry = f'SELECT path FROM mst_paths WHERE page_id = {page_id}' path = mst_conn.execute(qry).fetchone()[0] soup = BeautifulSoup(doc, features='lxml') info = {'page_id': page_id} # Get title title = soup.title if not title: logger.warning(f'page has no <title> tag: {path}') title = None else: title = title.text if not title: logger.warning(f'page with empty title: {path}') info['title'] = title # Get description description = soup.find(attrs={'name': 'description'}) if not description: # There are very much occurrences of this situation logger.debug( f'page has no <meta name="description"/> tag: {path}') description = None else: description = description['content'] if not description: logger.warning(f'page with empty description: {path}') info['description'] = description # Get info from <h1> tags h1s = [] for h1 in soup.find_all('h1'): h1s.append(h1.text) if not h1s: # h1 tag can be hidden in a script (e.g. search page) if h1s := re.findall(r'<h1>(.*)</h1>', doc): logger.warning(f'page with hidden h1-tag: {path}') else: logger.warning(f'page without h1: {path}') info['num_h1s'] = len(h1s) info['first_h1'] = h1s[0] if h1s else None # Get language language = soup.find('meta', attrs={'name': 'language'}) if not language: logger.warning( f'page has no <meta name="language"/> tag: {path}') language = None else: language = language['content'] if not language: logger.warning(f'page with empty language: {path}') info['language'] = language # Get date modified modified = soup.find('meta', attrs={'name': 'DCTERMS.modified'}) if not modified: logger.warning( f'page has no tag <meta name="DCTERMS.modified"/>: {path}') modified = None else: try: modified = dt.date.fromisoformat(modified['content']) except ValueError: logger.warning( f'page with improper modification date: {path}') modified = None info['modified'] = modified # Get type of page if 'data-pagetype' not in soup.body.attrs: logger.warning('page has no data-pagetype attribute in the ' f'<body> tag: {path}') pagetype = None else: pagetype = soup.body['data-pagetype'] if not pagetype: logger.warning( f'page with empty pagetype in <body> tag: {path}') info['pagetype'] = pagetype # Get classes if 'class' not in soup.body.attrs: logger.warning( f'page has no class attribute in the <body> tag: {path}') classes = None else: classes = soup.body['class'] if not classes: logger.warning( f'page with empty class in <body> tag: {path}') info['classes'] = ' '.join(classes) if classes else None # Get editorial and automated texts info['ed_text'], info['aut_text'] = get_text(soup) # add info to the database fields = ', '.join(info) q_marks = ('?, ' * len(info))[:-2] mst_conn.execute( f'INSERT INTO scr_pages_info ({fields}) VALUES ({q_marks})', list(info.values())) # print progress and prognosis if page_num % 250 == 0: page_time = (time.time() - start_time) / page_num togo_time = int((num_pages - page_num) * page_time) print( f'extracting pages info ' f'- togo: {num_pages - page_num} pages / ' f'{togo_time // 60}:{togo_time % 60:02} min') logger.info('extracting pages info finished') def get_text(soup: BeautifulSoup) -> [str, str]: """ **Retrieve essential editorial and automated text content of a page.** Arguments: soup: bs4 representation of a page Returns: editorial text, automated text The editorial and automated text of the page content is returned together as a twofold tuple. Basically the relevant texts are retrieved from partial trees containing only tags with editorial or automated content respectively. Whitespace within these texts is normalised and coherent chunks are separated by newlines. """ result = [] for tree in split_tree(soup): if tree.html['tree'] == 'rest': continue flatten_tagbranch_to_navstring(tree.html) # Replace non-breaking spaces with normal ones txt = tree.text.replace(b'\xc2\xa0'.decode(), ' ') # Substitute one space for any cluster of whitespace chars (getting rid # of returns, newlines, tabs, spaces, etc.; this is html, you know!). txt = re.sub(r'\s+', ' ', txt) # Change #br# markers (introduced while flattening the branches) to # newlines, while reducing multiples separated by whitespace only. # The final strip() removes potential trailing newlines. txt = re.sub(r'\s*(#br#\s*)+\s*', r'\n', txt).strip() result.append(txt) tree.decompose() return result def flatten_tagbranch_to_navstring(tag: Tag) -> None: """ **Reduce a complete tag branch to one NavigableString.** Arguments: tag: part of BeautifulSoup structure that will be reduced Returns: None (tag is replaced in place with one NavigableString) The reduction is realised within the BeautifulSoup data structure that the tag is part of. This means that the function replaces the tag branch (in place) into a single NavigableString containing all text of the complete tag branch. The function uses a recursive tree traversal algorithm with a NavigableString as leaf. Each instance of the function will combine the text content of all children into one NavigableString. Within this string all `<br>` tags are replaced by `#br#` markers. The text content of all former `<p>`, `<h1>`, `<h2>`, `<h3>`, `<li>` and `<div>` tags in the tag branch is enclosed between two `#br#` markers in the resulting NavigableString. As such `#br#` markers act as separators between logical chunks of text. Due to the recursive flattening process the resulting NavigableString may contain more consecutive `#br#` markers. Since lay-out is of no concern, this has no significance however. """ # Final leaf cases; done with this branch if type(tag) in {NavigableString, Stylesheet}: return # Has this tag children other then NavigableStrings? tag_children = list(tag.children) child_types = {type(c) for c in tag_children} if tag_children and child_types != {NavigableString}: # Flatten recursively all child branches to NavigableStrings for c in tag_children: # noinspection PyTypeChecker flatten_tagbranch_to_navstring(c) # At this point all children (if any) of tag are NavigableStrings tag_name = tag.name if tag_name == 'br': tag.replace_with('#br#') elif tag_name == 'a': tag.replace_with(f' {tag.text}') # The leading space is significant elif tag_name in {'p', 'h1', 'h2', 'h3', 'li', 'div'}: tag_text = tag.text tag.replace_with(f'#br#{tag_text}#br#') else: tag.replace_with(tag.text) return def derive_pages_info() -> None: """ **Add derived information for all pages to the *scr_pages_info* table.** Derived information as such is not available within a page, but calculated or interpreted from other information. To derive this information, the extracted information should already be available in the *scr_pages_info* table. This can be accomplished by using the `extract_pages_info` function. After deleting any previous content from he *derived* data fields of the *scr_pages_info* table, next new content is added for each of these fields: - *business* - 'belastingen', 'toeslagen' or 'douane', depending on the extracted *classes* field - *category* - 'dv', 'bib' or 'alg', depending on the extracted *pagetype* field, mapped on basis of the constants `DV_TYPES`, `BIB_TYPES` and `ALG_TYPES`; the *category* of the 'bld_wrapper' *pagetype* however is determined on the *category* of the majority of the pages that refer to the specific wrapper page It will be logged when info can not be derived due to inconsistent or unavailable information. """ # Clear the derived field contents set_cols = ', '.join([f'{f[0]} = NULL' for f in DERIVED_FIELDS]) mst_conn.execute(f'UPDATE scr_pages_info SET {set_cols}') # Prepare some queries bus_qry = 'UPDATE scr_pages_info SET business = ? WHERE page_id = ?' cat_qry = 'UPDATE scr_pages_info SET category = ? WHERE page_id = ?' for_qry = ''' SELECT page_id, pagetype, classes FROM scr_pages_info ORDER BY CASE pagetype WHEN 'bld-wrapper' THEN 2 ELSE 1 END ''' cat_groups_qry = ''' SELECT category FROM scr_ed_links LEFT JOIN scr_links AS l USING (link_nr) LEFT JOIN scr_pages_info USING (page_id) WHERE l.page_id = ? GROUP BY category ''' wrappers_without_cat = set() logger.info('deriving pages info started') # Cycle over all pages, with wrapper pages after all others, because # the wrapper category is determined by the categories of pages # linking to that wrapper page. for page_id, pagetype, classes in mst_conn.execute(for_qry): # Determine business if classes: if 'toeslagen' in classes or 'tsl' in classes: business = 'toeslagen' elif 'dou' in classes: business = 'douane' else: business = 'belastingen' else: business = None mst_conn.execute(bus_qry, [business, page_id]) # Determine category: dv, bib or alg if pagetype in DV_TYPES: category = 'dv' elif pagetype in BIB_TYPES: category = 'bib' elif pagetype in ALG_TYPES: category = 'alg' elif pagetype == 'bld-wrapper': # Group categories of all pages that link to this wrapper page categories = mst_conn.execute( cat_groups_qry, [page_id]).fetchall() if len(categories) == 1: # All pages linking to this wrapper have the same category category = categories[0][0] if not category: # Probably because other uncategorized wrappers are # linking to this one. Save its page_id for # post-processing. wrappers_without_cat.add(page_id) else: category = 'alg' else: category = None mst_conn.execute(cat_qry, [category, page_id]) # The main reason that wrappers do not get a category in the main loop, # is that pages that link to these wrappers are wrappers themselves that # had no category yet when asked for in the main loop. Repeating the loop # for the category of these pages will resolve most of them. To avoid # endless looping when wrappers without categories are linking to each # other, the number of cycles is maximized (crude but effective). max_cycles = len(wrappers_without_cat) * 3 cycle = 0 while wrappers_without_cat and cycle < max_cycles: cycle += 1 page_id = wrappers_without_cat.pop() # Algorithm for category resolution is the same as in the main loop categories = mst_conn.execute(cat_groups_qry, [page_id]).fetchall() if len(categories) == 1: category = categories[0][0] if not category: wrappers_without_cat.add(page_id) continue else: category = 'alg' mst_conn.execute(cat_qry, [category, page_id]) # Log the remaining cases and set their category to 'alg' to avoid NULLs # in the database. for page_id in wrappers_without_cat: path = mst_conn.execute(f'SELECT path FROM mst_paths ' f'WHERE page_id = {page_id}').fetchone()[0] logger.warning( f"unresolved category of wrapper set to 'alg': {path}") mst_conn.execute(cat_qry, ['alg', page_id]) logger.info('deriving pages info finished') def status_figures() -> None: """ **Renew typical scrape status figures of the *scr_status_figures* table.** After deleting any previous content from the *scr_status_figures* table, next number of pages, redirs, url-aliases or ed-links will be added to this table: - pages: all pages - pages_lang_`<language>`: per *language* - pages_buss_`<business>`: per *business* - pages_cat_`<category>`: per *category* - pages_type_`<pagetype>`: per *pagetype* - pages_h1_multi: with more than one h1-tag - pages_h1_multi_`<pagetype>`: with more than one h1-tag per *pagetype* - pages_h1_no: without h1-tag - pages_title_no: without or with empty title-tag - pages_title_dupl: with non-unique title-tag - pages_descr_no: without or with empty description meta-tag - pages_descr_long: with description meta-tag longer than 160 characters - redirs: total number of all redirects - redirs_`<type>`: number of redirects per *redir_type* - redirs_`<type>`_slash: redirects per *redir_type* with only differing a slash - redirs_wcm-url: wcm url's that get redirected - url-aliases: url's that alias an wcm url - url-aliases_`<num>`x: url's with `<num>` aliases - ed-links_`<int|ext>`: internal|external editorial links - ed-links_`<int|ext>`_uniq: unique internal|external editorial links - ed-links_`<int|ext>`_avg: internal|external editorial links per page - ed-links_int_redir: redirected internal editorial links - ed-links_int_non-wcm: internal editorial links to non-wcm url's - ed-links_int_orphan-pages: number of unlinked (i.e. orphan) pages """ mst_conn.execute('DELETE FROM scr_status_figures') ins_qry = 'INSERT INTO scr_status_figures (name, value) VALUES (?, ?)' # Total pages qry = ''' SELECT count(*) FROM scr_pages_info ''' num_pages = mst_conn.execute(qry).fetchone()[0] mst_conn.execute(ins_qry, ['pages', num_pages]) # Pages per language qry = ''' SELECT language, count(*) FROM scr_pages_info GROUP BY language ORDER BY language DESC ''' for language, count in mst_conn.execute(qry).fetchall(): mst_conn.execute(ins_qry, [f'pages_lang_{language}', count]) # Pages per business qry = ''' SELECT business, count(*) FROM scr_pages_info GROUP BY business ORDER BY business ''' for business, count in mst_conn.execute(qry).fetchall(): mst_conn.execute(ins_qry, [f'pages_buss_{business}', count]) # Pages per category qry = ''' SELECT category, count(*) FROM scr_pages_info GROUP BY category ORDER BY business DESC ''' for category, count in mst_conn.execute(qry).fetchall(): mst_conn.execute(ins_qry, [f'pages_cat_{category}', count]) # Pages per type qry = ''' SELECT pagetype, count(*) FROM scr_pages_info GROUP BY pagetype ORDER BY category DESC, count(*) ASC ''' for pagetype, count in mst_conn.execute(qry).fetchall(): mst_conn.execute(ins_qry, [f'pages_type_{pagetype}', count]) # Pages with more than one h1's qry = ''' SELECT count(*) FROM scr_pages_info WHERE num_h1s > 1 ''' mst_conn.execute(ins_qry, ['pages_h1_multi', mst_conn.execute(qry).fetchone()[0]]) # Pages per type with more than one h1's qry = ''' SELECT pagetype, count(*) FROM scr_pages_info WHERE num_h1s > 1 GROUP BY pagetype ''' for pagetype, count in mst_conn.execute(qry).fetchall(): mst_conn.execute(ins_qry, [f'pages_h1_multi_{pagetype}', count]) # Pages with no h1 qry = ''' SELECT count(*) FROM scr_pages_info WHERE num_h1s = 0 ''' mst_conn.execute(ins_qry, ['pages_h1_no', mst_conn.execute(qry).fetchone()[0]]) # Pages without title qry = ''' SELECT count(*) FROM scr_pages_info WHERE title = '' or title is NULL ''' mst_conn.execute(ins_qry, ['pages_title_no', mst_conn.execute(qry).fetchone()[0]]) # Pages with non unique title qry = ''' WITH title_freq AS ( SELECT count(*) as c FROM scr_pages_info GROUP BY title ) SELECT ifnull(sum(c), 0) FROM title_freq WHERE c > 1 ''' mst_conn.execute(ins_qry, ['pages_title_dupl', mst_conn.execute(qry).fetchone()[0]]) # Pages without description qry = ''' SELECT count(*) FROM scr_pages_info WHERE description = '' OR description IS NULL ''' mst_conn.execute(ins_qry, ['pages_descr_no', mst_conn.execute(qry).fetchone()[0]]) # Pages with description longer than 160 characters qry = ''' SELECT count(*) FROM scr_pages_info WHERE length(description) > 160 ''' mst_conn.execute(ins_qry, ['pages_descr_long', mst_conn.execute(qry).fetchone()[0]]) # Total redirects (an alias is no redirect) qry = ''' SELECT count(*) FROM scr_redirs WHERE redir_type != 'alias' ''' mst_conn.execute(ins_qry, ['redirs', mst_conn.execute(qry).fetchone()[0]]) # Redirects per type qry = ''' SELECT redir_type, count(*) FROM scr_redirs WHERE redir_type != 'alias' GROUP BY redir_type ''' for redir_type, count in mst_conn.execute(qry).fetchall(): mst_conn.execute(ins_qry, [f'redirs_{redir_type}', count]) # Redirects per type that only add or loose the last slash qry = ''' SELECT redir_type, count(*) FROM scr_redirs WHERE req_url || '/' = redir_url OR req_url = redir_url || '/' GROUP BY redir_type ''' for redir_type, count in mst_conn.execute(qry).fetchall(): mst_conn.execute(ins_qry, [f'redirs_{redir_type}_slash', count]) # Scrape wcm url's that are redirected qry = f''' SELECT count(*) FROM scr_redirs WHERE req_id IN scr_page_ids ''' mst_conn.execute(ins_qry, ['redirs_wcm-url', mst_conn.execute(qry).fetchone()[0]]) # Total aliases qry = ''' SELECT count(*) FROM scr_redirs WHERE redir_type = 'alias' ''' mst_conn.execute(ins_qry, ['url-aliases', mst_conn.execute(qry).fetchone()[0]]) # Frequency of aliases qry = f''' WITH redirs_full_urls AS ( SELECT ifnull(req_url, '{ROOT_URL}' || p1.path) AS req_url, ifnull(redir_url, '{ROOT_URL}' || p2.path) AS redir_url, redir_type FROM scr_redirs LEFT JOIN mst_paths AS p1 ON req_id = p1.page_id LEFT JOIN mst_paths AS p2 ON redir_id = p2.page_id ), alias_freq AS ( SELECT redir_url, count(*) AS aliases_per_url FROM redirs_full_urls WHERE redir_type = 'alias' GROUP BY redir_url ) SELECT aliases_per_url, count(*) FROM alias_freq GROUP BY aliases_per_url ''' for alias_per_url, count in mst_conn.execute(qry).fetchall(): mst_conn.execute(ins_qry, [f'url-aliases_{alias_per_url}x', count]) # Total, unique and average number of internal links qry = ''' SELECT count(*) FROM scr_ed_links LEFT JOIN scr_links USING (link_nr) WHERE link_type = 'int' OR link_type = 'ext-to-int' ''' ed_links_int = mst_conn.execute(qry).fetchone()[0] qry = ''' SELECT count(*) FROM scr_links WHERE link_type = 'int' OR link_type = 'ext-to-int' ''' ed_links_int_uniq = mst_conn.execute(qry).fetchone()[0] mst_conn.execute(ins_qry, ['ed-links_int', ed_links_int]) mst_conn.execute(ins_qry, ['ed-links_int_uniq', ed_links_int_uniq]) mst_conn.execute(ins_qry, ['ed-links_int_avg', round(ed_links_int / num_pages, 3)]) # Number of internal editorial links that are redirected. This count # includes wcm url's that are no longer alive but get redirected instead. qry = ''' SELECT count(*) FROM scr_ed_links LEFT JOIN scr_links AS l USING (link_nr) LEFT JOIN scr_redirs ON l.url IS req_url AND l.page_id IS req_id WHERE (redir_id NOTNULL OR redir_url NOTNULL) AND redir_type != 'alias' ''' links_redirected = mst_conn.execute(qry).fetchone()[0] mst_conn.execute(ins_qry, ['ed-links_int_redir', links_redirected]) # Number of internal editorial links that do not refer to a wcm url qry = ''' SELECT count(*) FROM scr_ed_links LEFT JOIN scr_links AS l USING (link_nr) LEFT JOIN scr_redirs ON l.url = req_url WHERE redir_type = 'alias' ''' links_non_wcm = mst_conn.execute(qry).fetchone()[0] mst_conn.execute(ins_qry, ['ed-links_int_non-wcm', links_non_wcm]) # Number of orphan pages qry = 'SELECT count(*) FROM scr_unlinked WHERE unl_type = "orphan"' num_orphans = mst_conn.execute(qry).fetchone()[0] mst_conn.execute(ins_qry, ['ed-links_int_orphan-pages', num_orphans]) # Total, unique and average number of external links qry = ''' SELECT count(*) FROM scr_ed_links LEFT JOIN scr_links USING (link_nr) WHERE link_type = 'ext' OR link_type = 'ext-to-ext' ''' ed_links_ext = mst_conn.execute(qry).fetchone()[0] qry = ''' SELECT count(*) FROM scr_links WHERE link_type = 'ext' OR link_type = 'ext-to-ext' ''' ed_links_ext_uniq = mst_conn.execute(qry).fetchone()[0] mst_conn.execute(ins_qry, ['ed-links_ext', ed_links_ext]) mst_conn.execute(ins_qry, ['ed-links_ext_uniq', ed_links_ext_uniq]) mst_conn.execute(ins_qry, ['ed-links_ext_avg', round(ed_links_ext / num_pages, 3)]) logger.info(f'status figures saved to scr_status_figures table') def add_scraped_data(ts: str) -> bool: """ **Add a fresh scrape to the scrapes history pages.** Arguments: ts: timestamp of the scrape [yymmdd-hhmm] Returns: success of the operation While scraping all pages using the `scrape_page` function, new pages that had no earlier existence have been registered in the *mst_paths* table. Apart from the scrape results that are stored as files, the essential remaining results of the scrape reside in the next tables of the scrapes database: - *scr_redirs* - *scr_links* - *scr_ed_links* - *scr_unlinked* Before adding the scraped results to the history tables, the data in these tables has to be normalised and completed using next functions in succession: - `post_process_scrape` - to normalise the *scr_redirs*, *scr_links* and *scr_unlinked* tables and qualify unlinked pages - `extract_pages_info` - to extract information from the page sources to the *scr_pages_info* table - `derive_pages_info` - to complete the content of the *scr_pages_info* table - `status_figures` - to refresh the *scr_status_figures* table After using this function all data of the scrape (as stored in the scrape tables) has been merged into the various history tables and new links have been added to the *mst_links* table. The scrape tables remain unaltered afterwards, but can be safely deleted/overwritten. This function can only be used for a scrape that has a later timestamp than any other scrape in history. Otherwise it will log an error and return unsuccessful. """ # Check if this scrape is later then the youngest in the history tables. qry = ''' SELECT name FROM sqlite_schema WHERE type = 'table' AND name LIKE 'his_%' ''' last_ok = True for table in [r[0] for r in mst_conn.execute(qry)]: max_ts = mst_conn.execute( f'SELECT max(timestamp) FROM {table}').fetchone()[0] if max_ts and max_ts > ts: last_ok = False logger.error( f'scrape {ts} predates latest history in the {table} table') if not last_ok: logger.error(f'scrape {ts} not added to history') return False logger.info('started adding scrape results to master and history tables') # Add scrape redirs results to the his_redirs table cte = ''' WITH last_redirs AS ( SELECT DISTINCT req_id, req_url, last_value(redir_id) OVER win AS redir_id, last_value(redir_url) OVER win AS redir_url, last_value(redir_type) OVER win AS redir_type FROM his_redirs WINDOW win AS ( PARTITION BY req_id, req_url ORDER BY timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) ''' # Save redirs that are new or have been changed mst_conn.execute(cte + f''' INSERT INTO his_redirs (timestamp, req_id, req_url, redir_id, redir_url, redir_type) SELECT '{ts}', s.* FROM scr_redirs AS s LEFT JOIN last_redirs AS m ON s.req_id IS m.req_id AND s.req_url IS m.req_url WHERE s.redir_id IS NOT m.redir_id OR s.redir_url IS NOT m.redir_url OR s.redir_type IS NOT m.redir_type ''') # Save redirs that disappeared with redir_type 'absent' mst_conn.execute(cte + f''' INSERT INTO his_redirs (timestamp, req_id, req_url, redir_type) SELECT '{ts}', m.req_id, m.req_url, 'absent' FROM last_redirs AS m LEFT JOIN scr_redirs AS s ON m.req_id IS s.req_id AND m.req_url IS s.req_url WHERE m.redir_type != 'absent' AND s.req_id ISNULL AND s.req_url ISNULL ''') logger.info('his_redirs table updated') # Add new links to the mst_links table... mst_conn.executescript(''' INSERT INTO mst_links (url, anchor, page_id) SELECT s.url, s.anchor, s.page_id FROM scr_links AS s LEFT JOIN mst_links AS m ON s.url IS m.url AND s.anchor IS m.anchor AND s.page_id IS m.page_id WHERE m.link_id ISNULL ''') # ... and save the link_id to the scr_links table for later use. mst_conn.execute(''' UPDATE scr_links AS s SET link_id = ( SELECT link_id FROM mst_links AS m WHERE m.url IS s.url AND m.page_id IS s.page_id AND m.anchor IS s.anchor ) ''') logger.info('mst_links table updated') # Add status for new links or links of which the status changed mst_conn.execute(f''' WITH last_links_status AS ( SELECT DISTINCT link_id, last_value(link_type) OVER win AS link_type, last_value(status) OVER win AS status FROM his_links_status WINDOW win AS ( PARTITION BY link_id ORDER BY timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) INSERT INTO his_links_status (timestamp, link_id, link_type, status) SELECT '{ts}', link_id, s.link_type, s.status FROM scr_links AS s LEFT JOIN last_links_status AS l USING (link_id) WHERE (l.link_id ISNULL OR s.link_type IS NOT l.link_type OR s.status IS NOT l.status) AND NOT (l.status NOTNULL AND l.status LIKE 'error%' AND s.status NOTNULL AND s.status LIKE 'error%') ''') logger.info('his_links_status table updated') # Add records for new editorial links and links that have a different # number of appearances in a page. cte = ''' WITH scr_ed_links_normalised AS ( SELECT e.page_id, text, link_id, count(*) AS present FROM scr_ed_links AS e LEFT JOIN scr_links USING (link_nr) GROUP BY e.page_id, text, link_id ), last_ed_links AS ( SELECT DISTINCT page_id, text, link_id, last_value(present) OVER win AS present FROM his_ed_links WINDOW win AS ( PARTITION BY page_id, text, link_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) ''' mst_conn.execute(cte + f''' INSERT INTO his_ed_links (timestamp, page_id, text, link_id, present) SELECT '{ts}', page_id, text, link_id, s.present FROM scr_ed_links_normalised AS s LEFT JOIN last_ed_links AS l USING (page_id, text, link_id) WHERE l.present ISNULL OR s.present != l.present ''') # Add records for editorial links that disappeared mst_conn.execute(cte + f''' INSERT INTO his_ed_links (timestamp, page_id, text, link_id, present) SELECT '{ts}', page_id, text, link_id, 0 FROM last_ed_links AS l LEFT JOIN scr_ed_links_normalised AS s USING (page_id, text, link_id) WHERE l.present > 0 AND s.present ISNULL ''') logger.info('his_ed_links table updated') # Update his_unlinked table with the scrape results cte = f''' WITH last_unlinked AS ( SELECT DISTINCT page_id, url, last_value(unl_type) OVER win AS unl_type FROM his_unlinked WINDOW win AS ( PARTITION BY page_id, url ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) ''' # Save unlinked pages that are new or whose type changed mst_conn.execute(cte + f''' INSERT INTO his_unlinked (timestamp, page_id, url, unl_type) SELECT '{ts}', s.page_id, s.url, s.unl_type FROM scr_unlinked AS s LEFT JOIN last_unlinked AS l ON s.page_id IS l.page_id AND s.url IS l.url WHERE s.unl_type IS NOT l.unl_type -- covers new and changed ''') # Mark unlinked pages that disappeared with unl_type 'absent' mst_conn.execute(cte + f''' INSERT INTO his_unlinked (timestamp, page_id, url, unl_type) SELECT '{ts}', l.page_id, l.url, 'absent' FROM last_unlinked AS l LEFT JOIN scr_unlinked AS s ON l.page_id IS s.page_id AND l.url IS s.url WHERE l.unl_type != 'absent' AND s.unl_type ISNULL ''') logger.info('his_unlinked table updated') # Update his_pages_life table with the scrape results cte = ''' WITH last_pages_life AS ( SELECT DISTINCT page_id, last_value(alive) OVER ( PARTITION BY page_id ORDER BY timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS alive FROM his_pages_life ) ''' # Mark pages alive that are new or became alive again mst_conn.execute(cte + f''' INSERT INTO his_pages_life (timestamp, page_id, alive) SELECT '{ts}', page_id, TRUE FROM scr_page_ids LEFT JOIN last_pages_life AS l USING (page_id) WHERE l.page_id ISNULL OR NOT alive ''') # Register pages that are no longer alive mst_conn.execute(cte + f''' INSERT INTO his_pages_life (timestamp, page_id, alive) SELECT '{ts}', page_id, FALSE FROM last_pages_life LEFT JOIN scr_page_ids AS s USING (page_id) WHERE alive AND s.page_id ISNULL ''') logger.info('his_pages_life table updated') # Update his_short_urls table with the scrape results cte = f''' WITH last_short_urls AS ( SELECT DISTINCT url, last_value(active) OVER ( PARTITION BY url ORDER BY timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS active FROM his_short_urls ), scr_short_urls AS ( SELECT req_url AS url FROM scr_redirs WHERE req_url NOT LIKE '%{ROOT_URL.partition('.')[2]}%' ) ''' # Mark short-url's active that are new or became active again mst_conn.execute(cte + f''' INSERT INTO his_short_urls (timestamp, url, active) SELECT '{ts}', url, TRUE FROM scr_short_urls LEFT JOIN last_short_urls USING (url) WHERE active ISNULL OR NOT active ''') num_active = mst_conn.execute(f''' SELECT count(*) FROM his_short_urls WHERE timestamp = '{ts}' AND active ''').fetchone()[0] # Register short-url's that are no longer active mst_conn.execute(cte + f''' INSERT INTO his_short_urls (timestamp, url, active) SELECT '{ts}', url, FALSE FROM last_short_urls LEFT JOIN scr_short_urls AS s USING (url) WHERE active AND s.url ISNULL ''') num_inactive = mst_conn.execute(f''' SELECT count(*) FROM his_short_urls WHERE timestamp = '{ts}' AND NOT active ''').fetchone()[0] logger.info(f'his_short_urls table updated: ' f'registered {num_active} active and {num_inactive} inactive') # Update his_pages_info table for pages with changed aspects sql = 'SELECT name FROM pragma_table_info("his_pages_info")' page_aspects = [row[0] for row in mst_conn.execute(sql).fetchall() if row[0] != 'timestamp'] fields = ', '.join(page_aspects) mst_conn.execute(f''' WITH last_pages_info_timestamps AS ( SELECT DISTINCT page_id, last_value(timestamp) OVER ( PARTITION BY page_id ORDER BY timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS timestamp FROM his_pages_info ), last_pages_info AS ( SELECT {fields} FROM last_pages_info_timestamps LEFT JOIN his_pages_info USING (timestamp, page_id) ), changed_pages_info AS ( SELECT * FROM scr_pages_info EXCEPT SELECT * FROM last_pages_info ) INSERT INTO his_pages_info SELECT '{ts}', * FROM changed_pages_info ''') logger.info('his_pages_info table updated') # Add scrape status figures to the his_status_figures table mst_conn.execute(f''' INSERT INTO his_status_figures (timestamp, name, value) SELECT '{ts}', name, value FROM scr_status_figures ''') logger.info('his_status_figures table updated with scrape results') logger.info('concluded adding scrape results to master and history tables') return True def check_storage_integrity(check_page_ids: bool = False) -> None: """ **Check integrity of the scrapes storage.** Arguments: check_page_ids: check the page_id/path relation Next checks will be executed: - available zip files with page sources against valid scrapes - validity of timestamps of site reports - validity of timestamps of 'sitemap.xml' files - validity of timestamps of 'robots.txt' files - page_id/path relation of all zip sources (depending on actual value of `check_page_ids` parameter; this will take quite some time) All detected exceptions will be logged; no data will be altered. """ # Setup logging global logger logger = logging.getLogger('check') logger.setLevel(logging.INFO) fh = logging.FileHandler(str(mst_dir / 'check.log')) fh.setLevel(logging.INFO) ch = logging.StreamHandler() ch.setLevel(logging.INFO) formatter = logging.Formatter( fmt='[%(asctime)s] %(levelname)-8s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S') fh.setFormatter(formatter) ch.setFormatter(formatter) logger.addHandler(fh) logger.addHandler(ch) # Check page sources against the valid scrapes valid_tss = set(valid_scrapes()) src_dir = mst_dir / scrape_conf.src_dir_name src_tss = set() for file in src_dir.iterdir(): if file.suffix == '.zip': ts = file.name[:11] src_tss.add(ts) if src_tss != valid_tss: if scrs_no_zip := valid_tss - src_tss: logger.warning( f'{len(scrs_no_zip)} scrapes without source zips: ' f'{", ".join(scrs_no_zip)}') if zips_no_scr := src_tss - valid_tss: logger.warning( f'{len(zips_no_scr)} source zips without scrape: ' f'{", ".join(zips_no_scr)}') logger.info('checked page source zip files') # Check site-reports against valid scrapes for report in report_conf.reports: report_conf.spec_report(report) rep_dir = mst_dir / report_conf.report_dir_name rep_tss = set() for item in rep_dir.iterdir(): if item.is_dir(): for file in item.iterdir(): ts = file.name[:11] if ts not in valid_tss: logger.warning(f'{report} report with invalid ' f'timestamp: {file.name}') else: rep_tss.add(ts) if scrs_no_rep := valid_tss - rep_tss: logger.warning( f'scrapes without {report} reports: {", ".join(scrs_no_rep)}') logger.info('checked site reports') # Check sitemaps against valid scrapes map_dir = mst_dir / scrape_conf.sitemap_dir_name for file in map_dir.iterdir(): ts = file.name[:11] if ts not in valid_tss: logger.warning(f'sitemap with invalid timestamp: {file.name}') logger.info('checked sitemaps files ') # Check robots files against valid scrapes robot_dir = mst_dir / scrape_conf.robots_dir_name for file in robot_dir.iterdir(): ts = file.name[:11] if ts not in valid_tss: logger.warning(f'robots file with invalid timestamp: {file.name}') logger.info('checked robots files') # Check page_id/path relations of all source zip files if check_page_ids: logger.info('started checking page_id/path relations of all scrapes; ' 'this will take some time') for ts in sorted(src_tss): src_zip = PageSourceZip(ts) for page_id, doc in src_zip.iter_pages(): soup = BeautifulSoup(doc, features='lxml') mst_path = mst_conn.execute( 'SELECT path FROM mst_paths WHERE page_id = ?', [page_id]).fetchone()[0] meta_tag = soup.find( 'meta', attrs={'name': 'DCTERMS.identifier'}) if meta_tag: full_src_path = meta_tag['content'] if full_src_path.startswith('/wps/wcm/connect'): src_path = full_src_path[16:] if src_path != mst_path: logger.critical( f'page {page_id} is registered as {mst_path}, ' f'but in source zip {src_zip.path.name} its ' f'path is {src_path}') else: # No consistency check possible logger.warning(f'strange DCTERMS.identifier in ' f'scrape {ts} / page {page_id}') else: # No consistency check possible logger.debug( f'page without wcm url: scrape {ts} / page {page_id}') logger.info(f'page_id/path relation checked of scrape {ts}') return def valid_scrapes( from_ts: str = '000101-0000', data_dir: Path = mst_dir) -> list[str]: """ **Return a sorted list of valid scrapes.** Arguments: from_ts: skip scrapes with an earlier timestamp data_dir: master directory to contain the (sub)directory with scrape logs Returns: sorted list of timestamps of valid scrapes Scrapes are assumed to be valid if the text 'scrape finished' appears in the content of the related scrape log. When this text appears in the scrape log the essential scrape was successful, even when further processing of the scraped data is not executed for some reason. Presence of log files with duplicate timestamps as well as log files that do not contain the discriminating text wil be logged. """ log_dir = data_dir / scrape_conf.log_dir_name valid_tss = set() for file in log_dir.iterdir(): if 'README' in file.name: continue ts = file.name[:11] if ts < from_ts: continue valid = False with open(file) as scr_log: for line in scr_log: if 'scrape finished' in line: valid = True break if valid: if ts in valid_tss: logger.warning(f'duplicate log for scrape {ts}') else: valid_tss.add(ts) else: logger.warning( f'scrape {ts} seems to be invalid c.q. unfinished') return sorted(valid_tss) def reprocess_scrapes( from_ts: str = '000101-0000', reprocess_links: bool = False, copy_db: bool = True) -> None: """ **Reprocess scraped data.** Arguments: from_ts: first timestamp of data to be reprocessed reprocess_links: reprocess content of his_ed_links table copy_db: work on a copy of the scrapes database ***Note:*** *The structure of the scrapes storage is detailed in the `create_scrapes_storage` function.* This function will reprocess the data in the history tables of all scrapes or starting from the scrape with timestamp `from_ts`. The reprocessing addresses all data that can be reproduced from the data that was gathered during the actual scrape. This 'basic' scrape data, which can not be reproduced itself, consists of: - page sources - available in source zip files - redirections - stored in the his_redirs table - unlinked pages - stored in the his_unlinked table (this information can be reproduced from link analysis of the the page sources, but since only editorial links are registered, it is treated as basic data) - status of links - stored in the his_links_status table Using this function assumes that the content of the master tables is complete and correct. In case there is some doubt if this assumption is correct, some consistency checks can be executed on these tables by using the `check_storage_integrity` function. While reprocessing, next tables are rebuilt with reprocessed data: - *his_ed_links* - using page sources (if `reprocess_links` is `True`) - *his_pages_info* - using page sources - *his_pages_life* - using page sources - *his_short_urls* - using his_redirs - *his_status_figures* - using his_pages_info Although the other history tables are rebuilt as well, it will be with the same (basic) data and as such produce tables with the same content. Since reprocessing can take quite some time, the reprocessing will by default be executed on a copy of the scrapes database in order to avoid a clash with a next scrape being added. Any scrape that is added after the execution of this function started, will be reprocessed as well. After reprocessing finishes, all original history tables will be saved with their original names prepended with 'svd_' and can be deleted from the database when the reprocessing was verified and successful. """ global logger logger = logging.getLogger('reprocess') logger.setLevel(logging.INFO) ch = logging.StreamHandler() ch.setLevel(logging.INFO) formatter = logging.Formatter( fmt='[%(asctime)s] %(levelname)-8s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S') ch.setFormatter(formatter) logger.addHandler(ch) if copy_db: scrs_db = mst_dir / scrape_conf.scrs_db_name v = 0 while True: db_name = scrs_db.stem + '_repr' + (f'_{v}' if v else '') repr_db = scrs_db.with_stem(db_name) if repr_db.exists(): v += 1 continue else: break shutil.copy(scrs_db, repr_db) mst_conn.switch_to(repr_db) # Save copies of the history tables remove data that will be renewed qry = ''' SELECT name FROM sqlite_schema WHERE type = 'table' AND name LIKE 'his_%' ''' for name in [r[0] for r in mst_conn.execute(qry)]: sql = mst_conn.execute( f'SELECT sql FROM sqlite_schema WHERE name = ?', [name]).fetchone()[0] svd_name = 'svd_' + name mst_conn.execute(sql.replace(name, svd_name)) mst_conn.execute(f'INSERT INTO {svd_name} SELECT * FROM {name}') mst_conn.execute(f'DELETE FROM {name} WHERE timestamp >= "{from_ts}"') mst_conn.execute('VACUUM') # Cycle over all scrapes that need to be reprocessed. Then per scrape: # - reinstate all scrape tables as if that scrape just finished # - process these tables the same way as if the scrape was new scrapes = valid_scrapes(from_ts) while scrapes: ts = scrapes.pop(0) if not scrapes: # Add scrapes that were added in the meantime. scrapes = valid_scrapes(ts)[1:] logger.info(f'started reprocessing scrape {ts}') empty_scr_tables() src_zip = PageSourceZip(ts) # Save all page_id's to the scr_page_ids table mst_conn.execute('INSERT INTO scr_page_ids VALUES ' + ','.join(f'({pid})' for pid in src_zip.page_ids())) # Since redirections are determined at the moment of scraping, # the scr_redirs table can only be reinstated from the svd_his_redirs # table. mst_conn.execute(f''' WITH last_redirs_during_scrape AS ( SELECT DISTINCT req_id, req_url, last_value(redir_id) OVER win AS redir_id, last_value(redir_url) OVER win AS redir_url, last_value(redir_type) OVER win AS redir_type FROM svd_his_redirs WHERE timestamp <= '{ts}' WINDOW win AS ( PARTITION BY req_id, req_url ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) INSERT INTO scr_redirs SELECT * FROM last_redirs_during_scrape WHERE redir_type != 'absent' ''') # Re-analysing unlinked pages can only be done when all links are # registered. Since only editorial links are registered, # the scr_unliked will be reinstated from the svd_his_unlinked table. mst_conn.execute(f''' WITH last_unlinked_during_scrape AS ( SELECT DISTINCT page_id, url, last_value(unl_type) OVER win AS unl_type FROM svd_his_unlinked WHERE timestamp <= '{ts}' WINDOW win AS ( PARTITION BY page_id, url ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) INSERT INTO scr_unlinked SELECT * FROM last_unlinked_during_scrape WHERE unl_type != 'absent' ''') # Handle scr_links and scr_ed_links tables if reprocess_links: # Cycle over all pages for page_id, doc in src_zip.iter_pages(): soup = BeautifulSoup(doc, features='lxml') mst_path = mst_conn.execute( 'SELECT path FROM mst_paths WHERE page_id = ?', [page_id]).fetchone()[0] mst_url = ROOT_URL + mst_path # Re-process all editorial links ed_tree = split_tree(soup)[0] for a_tag in ed_tree.find_all('a', href=True): link = a_tag['href'].strip() # Some links are discarded if link.startswith('#'): # Plain #-links continue if link == '/': # Home page links continue if ('readspeaker' in link or 'adobe' in link or 'java' in link): # Non content links continue link = urljoin(mst_url, link) # Separate anchor from in-scope url if link.startswith(ROOT_URL) and '#' in link: link, _, anchor = link.partition('#') else: anchor = None if link == mst_url: # Discard in-page link (if it was not a plain #-link) continue if not link: print('Link is None or empty') continue # Check if the link is new for this scrape qry = f''' SELECT link_nr FROM scr_links WHERE url = '{link}' AND anchor IS {f"'{anchor}'" if anchor else 'NULL'} ''' qry_result = mst_conn.execute(qry).fetchone() if qry_result: # This link was already handled during this scrape link_nr = qry_result[0] else: # Store the link for this scrape link_nr = mst_conn.execute(''' INSERT INTO scr_links (url, anchor) VALUES (?, ?) ''', [link, anchor]).lastrowid # Save link in relation to the page (can be a duplicate if # the link is used more than once with the same text in the # same page). link_text = a_tag.text.strip() mst_conn.execute(''' INSERT INTO scr_ed_links (page_id, text, link_nr) VALUES (?, ?, ?) ''', [page_id, link_text, link_nr]) # Add link_type and status from saved history (since these can # not be reproduced otherwise). mst_conn.execute(f''' WITH svd_links_status AS ( SELECT DISTINCT link_id, last_value(link_type) OVER win AS link_type, last_value(status) OVER win AS status FROM svd_his_links_status WHERE timestamp <= {ts} WINDOW win AS ( PARTITION BY link_id ORDER BY timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) UPDATE scr_links AS s SET link_type = ( SELECT link_type FROM svd_links_status AS t WHERE t.link_id = s.link_id), status = ( SELECT status FROM svd_links_status AS t WHERE t.link_id = s.link_id) ''') post_process_scrape() else: # Link related data does not need to be renewed: reinstate the # scr_links and scr_ed_links tables from the saved history. cte = f''' WITH last_ed_links_during_scrape AS ( SELECT DISTINCT page_id, text, link_id, last_value(present) OVER win AS present FROM svd_his_ed_links WHERE timestamp <= '{ts}' WINDOW win AS ( PARTITION BY page_id, text, link_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ), ''' mst_conn.execute(cte + f''' last_links_status_during_scrape AS ( SELECT DISTINCT link_id, last_value(link_type) OVER win AS link_type, last_value(status) OVER win AS status FROM svd_his_links_status WHERE timestamp <= '{ts}' WINDOW win AS ( PARTITION BY link_id ORDER BY timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) INSERT INTO scr_links (link_nr, url, page_id, anchor, link_type, status) SELECT link_id, url, page_id, anchor, link_type, status FROM mst_links LEFT JOIN last_links_status_during_scrape USING (link_id) WHERE link_id IN (SELECT DISTINCT link_id FROM last_ed_links_during_scrape WHERE present > 0) ''') mst_conn.execute(cte + f''' -- Degroup same links per page last_ed_links_ungrouped AS ( SELECT page_id, text, link_id, present - 1 AS present FROM last_ed_links_during_scrape WHERE present > 0 UNION ALL SELECT page_id, text, link_id, present - 1 AS present FROM last_ed_links_ungrouped WHERE present > 0 ) INSERT INTO scr_ed_links (page_id, text, link_nr) SELECT page_id, text, link_id FROM last_ed_links_ungrouped ''') extract_pages_info(ts) derive_pages_info() status_figures() add_scraped_data(ts) logger.info(f'concluded reprocessing scrape {ts}') if copy_db: mst_conn.switch_back() return if __name__ == '__main__': scrape_site()
Field definitions for the his_pages_info and scr_pages_info table (part 1 of 2).
Field definitions for the his_pages_info and scr_pages_info table (part 2 of 2).
Set of page-types that belong to the dv category.
Set of page-types that belong to the bib category.
Set of page-types that belong to the alg category (excluding 'bld_wrapper', which is handled specifically).
View Source
def create_new_data_store() -> None: """ **Create a new skeleton data store.** ***Note:*** *Since this function will not be used very often, it is mainly meant as means to document the storage concept.* Driven by the configuration file `bd_www.constants.CONFIG_NAME`, (sub)directories are created in the master directory to contain the various scraped data, and an empty scrapes as well as a metrics database is created with all necessary tables. The next data will be stored as files in separate **subdirectories** of the master directory: - page sources in zip files per scrape - 'robots.txt' files when used for finding unscraped url's and if different from the previous version - 'sitemap.xml' files when used for finding unscraped url's and if different from the previous version - log files for all scrapes - site reports per report variation, with subdirectories per reporting period Next **files** are stored directly in the master directory: - web analytics log file - reporting log file - scrapes database - metrics database Various names of above mentioned files and (sub)directories are determined by the actual parameters of the configuration file. Please refer to the `bd_www.Config` class for further information. The ***scrapes database*** consists of the following tables and views: **History tables** - contain all basic, extracted and derived data from all scrapes. All stored data is labeled with the timestamp of the scrape. With some pragmatic exceptions the data is only saved if different from the previous scrape (time delta storage). - *his_pages_life* - time delta register when a page was alive or not (*alive* is FALSE when a page was not present) - *his_pages_info* - time delta register of extracted and derived data per page (full record saved when one or more values differ from previous scrape) - *his_ed_links* - time delta register with usage of editorial links (*present* is 0 when an editorial link was not used) - *his_links_status* - time delta register with type and status of (editorial) links - *his_redirs* - time delta register with redirects that occurred while scraping (*redir_type* is 'absent' when it did not occur) - *his_unlinked* - time delta register with pages that were not connected via editorial links (*unl_type* is 'absent' when no longer unlinked or not present) - *his_short_urls* - time delta register when short url's were active - *his_status_figures* - figures representing the status of the site for each scrape **Master tables** - contain unique and primary data - *mst_paths* - unique relation between *path* and *page_id* - *mst_links* - unique relation between *url* or *page_id*/*anchor* and *link_id* - *mst_scrapes* - primary reference for available scrapes and unique source for *scraper* and *primers* fields **Non tables/views** - collections of manually administered *page_id*'s to force special handling of these pages. - *non_scrape_ids* - table with manually administered *page_id*'s that will not be scraped - *non_orphan_ids* - table with manually administered *page_id*'s that will not be considered orphans when becoming unlinked (*unl_type* will be set to 'special' in those cases) - *non_orphan_paths* - view of table *non_scrape_ids* with explicit *path* - *non_scrape_paths* - view of table *non_orphan_ids* with explicit *path* **Scrape tables** - operational stores for scrape results; purged when starting a new scrape. - *scr_page_ids* - id's of pages that were scraped - *scr_redirs* - redirects that occurred while scraping - *scr_pages_info* - extracted and derived data per page - *scr_links* - register of fully qualified (including *link_type* and *status*) links in the editorial content of the scraped pages - *scr_ed_links* - use of links in the editorial content each page - *scr_unlinked* - pages that can not be reached from other pages - *scr_status_figures* - figures representing the status of the site **Timestamped table and views** - after saving the timestamp in (the only record of) the *tsd* table, the timestamped views represent a dataset that is comparable to the scrape tables after the scrape with the specified timestamp finished. - *tsd* - scalar store to set the timestamp for the next timestamped views - *tsd_redirs* - identical to equivalent scrape table - *tsd_redirs_to_wcm* - all redirects related to the canonical url of the page they finally redirected to (not available as scrape table) - *tsd_pages_info* - identical to equivalent scrape table - *tsd_links* - identical to equivalent scrape table - *tsd_ed_links* - identical to equivalent scrape table - *tsd_unlinked* - identical to equivalent scrape table - *tsd_int_ref_cnt* - number of times that a pages is linked to from the editorial content of other pages (not available as scrape table) The ***metrics database*** consists of the following tables and views: **Basic tables** - contain unique and primary data - *shd_paths* - unique relation between *path* and *page_id*; acts as shadow subset of the *mst_paths* table in the scrape database as substitute for an inter-db relation that is not possible with SQLite - *shd_links* - unique relation between *link_id* and *url*; acts as shadow subset of the *mst_links* table in the scrape database as substitute for an inter-db relation that is not possible with SQLite - *dates* - unique relation between *date* and *date_id* **Analytics tables** - contain all historic analytics data - *daily* - numeric analytics data per date and page - *feedback* - numeric and textual feedback per date and page - *downloads* - numeric download data per date and download link **Analytics views** - views for the analytics tables with explicit relations - *daily_expl* - daily data table with explicit date and page path - *feedback_expl* - feedback data table with explicit date and page path - *downloads_expl* - downloads data table with explicit date and link path """ # Create directories and database file for dir_name in (scrape_conf.src_dir_name, scrape_conf.robots_dir_name, scrape_conf.log_dir_name, scrape_conf.sitemap_dir_name): (mst_dir / dir_name).mkdir(exist_ok=True) for report in report_conf.reports: report_conf.spec_report(report) report_dir = mst_dir / report_conf.report_dir_name report_dir.mkdir(exist_ok=True) for subdir_name in set(REPORT_SUBDIRS.values()): (report_dir / subdir_name).mkdir(exist_ok=True) log_readme = mst_dir / scrape_conf.log_dir_name / 'README.txt' with open(log_readme, 'w') as f: f.write(_README_TXT) # Scrape db: create all tables to contain the historical scrape data mst_conn.execute(''' CREATE TABLE mst_scrapes ( timestamp TEXT PRIMARY KEY, date AS ("20" || substr(timestamp,1,2) || "-" || substr(timestamp,3,2) || "-" || substr(timestamp,5,2)), scraper TEXT, primers TEXT ) ''') mst_conn.execute(''' CREATE TABLE mst_paths ( page_id INTEGER PRIMARY KEY AUTOINCREMENT, path TEXT NOT NULL UNIQUE ) ''') mst_conn.execute(''' CREATE TABLE mst_links ( link_id INTEGER PRIMARY KEY AUTOINCREMENT, url TEXT, anchor TEXT, page_id INTEGER, UNIQUE (url), UNIQUE (page_id, anchor), FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') mst_conn.execute(''' CREATE TABLE his_pages_life ( timestamp TEXT NOT NULL, page_id INTEGER NOT NULL, alive INTEGER NOT NULL, PRIMARY KEY (timestamp, page_id), FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') fields = EXTRACTED_FIELDS + DERIVED_FIELDS info_columns = ', '.join([f'{f[0]} {f[1]}' for f in fields]) mst_conn.execute(f''' CREATE TABLE his_pages_info ( timestamp TEXT NOT NULL, page_id INTEGER NOT NULL, {info_columns}, PRIMARY KEY (timestamp, page_id), FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') mst_conn.execute(''' CREATE TABLE his_links_status ( timestamp TEXT NOT NULL, link_id INTEGER NOT NULL, link_type TEXT, status INTEGER, PRIMARY KEY (timestamp, link_id), FOREIGN KEY (link_id) REFERENCES mst_links (link_id) ) ''') mst_conn.execute(''' CREATE TABLE his_ed_links ( timestamp TEXT NOT NULL, page_id INTEGER NOT NULL, text TEXT NOT NULL, link_id INTEGER NOT NULL, present INTEGER NOT NULL, PRIMARY KEY (timestamp, page_id, text, link_id) FOREIGN KEY (page_id) REFERENCES mst_paths (page_id), FOREIGN KEY (link_id) REFERENCES mst_links (link_id) ) ''') mst_conn.execute(''' CREATE TABLE his_redirs ( timestamp TEXT NOT NULL, req_id INTEGER, req_url TEXT, redir_id INTEGER, redir_url TEXT, redir_type TEXT NOT NULL, PRIMARY KEY (timestamp, req_id, req_url), UNIQUE (timestamp, req_id), UNIQUE (timestamp, req_url), FOREIGN KEY (req_id) REFERENCES mst_paths (page_id), FOREIGN KEY (redir_id) REFERENCES mst_paths (page_id) ) ''') mst_conn.execute(''' CREATE TABLE his_unlinked ( timestamp TEXT NOT NULL, page_id INTEGER, url TEXT, unl_type TEXT, PRIMARY KEY (timestamp, page_id, url), UNIQUE (timestamp, page_id), UNIQUE (timestamp, url), FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') mst_conn.execute(''' CREATE TABLE his_short_urls ( timestamp TEXT NOT NULL, url TEXT NOT NULL, active INTEGER NOT NULL, PRIMARY KEY (timestamp, url) ) ''') mst_conn.execute(''' CREATE TABLE his_status_figures ( timestamp TEXT NOT NULL, name TEXT NOT NULL, value INTEGER NOT NULL, PRIMARY KEY (timestamp, name) ) ''') mst_conn.execute(''' CREATE TABLE non_orphan_ids ( page_id INTEGER NOT NULL UNIQUE, FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') mst_conn.execute(''' CREATE TABLE non_scrape_ids ( page_id INTEGER NOT NULL UNIQUE, FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') # Scrape db: create tables to contain operational results during a scrape mst_conn.execute(''' CREATE TABLE scr_page_ids ( page_id INTEGER NOT NULL UNIQUE, FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') mst_conn.execute(''' CREATE TABLE scr_redirs ( req_id INTEGER UNIQUE, req_url TEXT UNIQUE, redir_id INTEGER, redir_url TEXT, redir_type TEXT NOT NULL, PRIMARY KEY (req_id, req_url), FOREIGN KEY (req_id) REFERENCES mst_paths (page_id), FOREIGN KEY (redir_id) REFERENCES mst_paths (page_id) ) ''') mst_conn.execute(''' CREATE TABLE scr_links ( link_nr INTEGER PRIMARY KEY AUTOINCREMENT, url TEXT, page_id INTEGER, anchor TEXT, link_type TEXT, status INTEGER, link_id INTEGER UNIQUE, UNIQUE (url, anchor), UNIQUE (page_id, anchor), FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') mst_conn.execute(''' CREATE TABLE scr_ed_links ( page_id INTEGER NOT NULL, text TEXT NOT NULL, link_nr INTEGER NOT NULL, FOREIGN KEY (page_id) REFERENCES mst_paths (page_id), FOREIGN KEY (link_nr) REFERENCES scr_links (link_nr) ) ''') mst_conn.execute(''' CREATE TABLE scr_unlinked ( page_id INTEGER UNIQUE, url TEXT UNIQUE, unl_type TEXT, FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') fields = EXTRACTED_FIELDS + DERIVED_FIELDS info_columns = ', '.join([f'{f[0]} {f[1]}' for f in fields]) mst_conn.execute(f''' CREATE TABLE scr_pages_info ( page_id INTEGER PRIMARY KEY, {info_columns}, FOREIGN KEY (page_id) REFERENCES mst_paths (page_id) ) ''') mst_conn.execute(''' CREATE TABLE scr_status_figures ( name TEXT PRIMARY KEY, value INTEGER NOT NULL ) ''') # Scrape db: create views for non-orphan and non-scrape tables with added # paths. mst_conn.execute(''' CREATE VIEW non_orphan_paths AS SELECT * FROM non_orphan_ids NATURAL JOIN mst_paths ''') mst_conn.execute(''' CREATE VIEW non_scrape_paths AS SELECT * FROM non_scrape_ids NATURAL JOIN mst_paths ''') # Scrape db: create a simple table to contain the timestamp for a # specific scrape, together with the views to get the data for that scrape. mst_conn.executescript('CREATE TABLE tsd (timestamp TEXT)') mst_conn.execute(''' CREATE VIEW tsd_unlinked AS WITH last_unlinked_during_scrape AS ( SELECT DISTINCT page_id, url, last_value(unl_type) OVER win AS unl_type FROM his_unlinked AS u, tsd AS s WHERE u.timestamp <= s.timestamp WINDOW win AS ( PARTITION BY page_id, url ORDER BY u.timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) SELECT * FROM last_unlinked_during_scrape WHERE unl_type != 'absent' ''') mst_conn.execute(''' CREATE VIEW tsd_redirs AS WITH last_redirs_during_scrape AS ( SELECT DISTINCT req_id, req_url, last_value(redir_id) OVER win AS redir_id, last_value(redir_url) OVER win AS redir_url, last_value(redir_type) OVER win AS redir_type FROM his_redirs AS r, tsd AS s WHERE r.timestamp <= s.timestamp WINDOW win AS ( PARTITION BY req_id, req_url ORDER BY r.timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) SELECT * FROM last_redirs_during_scrape WHERE redir_type != 'absent' ''') mst_conn.execute(''' CREATE VIEW tsd_redirs_to_wcm (req_url, wcm_id, final_type) AS WITH -- build url-based redir chains scr_chains (org_url, prev_url, next_id, next_url, next_type) AS ( SELECT req_url, NULL, redir_id, redir_url, redir_type FROM tsd_redirs UNION ALL SELECT org_url, next_url, r.redir_id, r.redir_url, r.redir_type FROM scr_chains AS c LEFT JOIN tsd_redirs AS r ON c.next_url = r.req_url -- avoid endless loops by excluding redirs of an alias WHERE c.next_type <> 'alias' ) -- get all redir chains ending in a wcm url SELECT org_url, next_id, next_type FROM scr_chains WHERE next_id IS NOT NULL AND org_url IS NOT NULL ''') mst_conn.execute(''' CREATE VIEW tsd_ed_links AS WITH RECURSIVE last_his_ed_links_records_during_scrape AS ( SELECT DISTINCT page_id, text, link_id, last_value(present) OVER win AS present FROM his_ed_links AS e, tsd AS s WHERE e.timestamp <= s.timestamp WINDOW win AS ( PARTITION BY page_id, text, link_id ORDER BY e.timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ), -- Degroup same links per page scr_ed_links_singles AS ( SELECT page_id, text, link_id, present - 1 AS present FROM last_his_ed_links_records_during_scrape WHERE present > 0 UNION ALL SELECT page_id, text, link_id, present - 1 AS present FROM scr_ed_links_singles WHERE present > 0 ) SELECT page_id, text, link_id FROM scr_ed_links_singles ''') mst_conn.execute(''' CREATE VIEW tsd_links AS WITH tsd_links_status AS ( SELECT DISTINCT link_id, last_value(link_type) OVER win AS link_type, last_value(status) OVER win AS status FROM his_links_status AS l, tsd AS s WHERE l.timestamp <= s.timestamp WINDOW win AS ( PARTITION BY link_id ORDER BY l.timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) SELECT link_id, url, page_id, anchor, link_type, status FROM mst_links LEFT JOIN tsd_links_status USING (link_id) WHERE link_id IN (SELECT DISTINCT link_id FROM tsd_ed_links) ''') mst_conn.execute(''' CREATE VIEW tsd_int_ref_cnt (page_id, referral_cnt) AS WITH -- wcm id's of the internal links per page int_page_links (referring_page_id, referral_cnt, linked_page_id) AS ( SELECT e.page_id, count(*), ifnull(l.page_id, s.wcm_id) FROM tsd_ed_links AS e LEFT JOIN mst_links AS l USING (link_id) LEFT JOIN tsd_redirs_to_wcm AS s ON iif(instr(l.url, '#'), substr(l.url, 1, instr(l.url, '#') - 1), l.url) = req_url WHERE (l.page_id OR s.wcm_id) GROUP BY e.page_id, e.text, e.link_id ) SELECT linked_page_id, sum(referral_cnt) FROM int_page_links -- next clause removes some spurious in-page links WHERE referring_page_id != linked_page_id GROUP BY linked_page_id ''') mst_conn.execute(''' CREATE VIEW tsd_pages_info AS WITH scr_pages AS ( SELECT DISTINCT page_id, last_value(alive) OVER win AS alive FROM his_pages_life AS p, tsd AS s WHERE p.timestamp <= s.timestamp WINDOW win AS ( PARTITION BY page_id ORDER BY p.timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ), scr_last_info_ts (page_id, timestamp) AS ( SELECT DISTINCT page_id, last_value(i.timestamp) OVER win FROM his_pages_info AS i, tsd AS s WHERE i.timestamp <= s.timestamp WINDOW win AS ( PARTITION BY page_id ORDER BY i.timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) SELECT pi.* FROM scr_pages LEFT JOIN scr_last_info_ts AS li USING (page_id) LEFT JOIN his_pages_info AS pi USING (page_id, timestamp) WHERE alive ''') # Metrics db: created basic tables mst_conn.execute(''' CREATE TABLE mtrx.shd_paths ( page_id INTEGER NOT NULL UNIQUE, path TEXT NOT NULL UNIQUE, PRIMARY KEY (page_id) ) ''') mst_conn.execute(''' CREATE TABLE mtrx.shd_links ( link_id INTEGER PRIMARY KEY, url TEXT NOT NULL ) ''') mst_conn.execute(''' CREATE TABLE mtrx.dates ( date_id INTEGER UNIQUE, date TEXT NOT NULL UNIQUE, PRIMARY KEY (date_id AUTOINCREMENT) ) ''') # Metrics db: created tables to store analytic data mst_conn.execute(''' CREATE TABLE mtrx.daily ( date_id INTEGER NOT NULL, page_id INTEGER NOT NULL, nb_visits INTEGER, nb_hits INTEGER, entry_nb_visits INTEGER, entry_bounce_count INTEGER, bounce_rate REAL AS (round(CAST(entry_bounce_count AS REAL) / entry_nb_visits, 3)), exit_nb_visits INTEGER, exit_rate REAL AS (round(CAST(exit_nb_visits AS REAL) / nb_visits, 3)), organic_entries INTEGER, organic_entry_rate REAL AS (round(CAST(organic_entries AS REAL) / nb_visits, 3)), call_visits INTEGER, call_rate REAL AS (round(CAST(call_visits AS REAL) / nb_visits, 3)), PRIMARY KEY (date_id, page_id), FOREIGN KEY (date_id) REFERENCES dates, FOREIGN KEY (page_id) REFERENCES shd_paths ) ''') mst_conn.execute(''' CREATE TABLE mtrx.feedback ( date_id INTEGER NOT NULL, page_id INTEGER NOT NULL, pos_cnt INTEGER, neg_cnt INTEGER, pos_txt TEXT, neg_txt TEXT, PRIMARY KEY (date_id, page_id), FOREIGN KEY (date_id) REFERENCES dates, FOREIGN KEY (page_id) REFERENCES shd_paths ) ''') mst_conn.execute(''' CREATE TABLE mtrx.downloads ( date_id INTEGER NOT NULL REFERENCES dates, link_id INTEGER NOT NULL REFERENCES shd_links, nb_visits INTEGER, nb_hits INTEGER, PRIMARY KEY (date_id, link_id) ) ''') # Metrics db: create views to explicitly show referenced data mst_conn.execute(''' CREATE VIEW mtrx.daily_expl AS SELECT date, path, d.* FROM daily AS d NATURAL JOIN dates NATURAL JOIN shd_paths ''') mst_conn.execute(''' CREATE VIEW mtrx.feedback_expl AS SELECT date, path, f.* FROM feedback AS f NATURAL JOIN dates NATURAL JOIN shd_paths ''') mst_conn.execute(''' CREATE VIEW mtrx.downloads_expl AS SELECT date, url, dl.* FROM downloads AS dl NATURAL JOIN dates NATURAL JOIN shd_links ''')
Create a new skeleton data store.
Note: Since this function will not be used very often, it is mainly meant as means to document the storage concept.
Driven by the configuration file bd_www.constants.CONFIG_NAME,
(sub)directories are created in the master directory to contain the
various scraped data, and an empty scrapes as well as a metrics database
is created with all necessary tables.
The next data will be stored as files in separate subdirectories of the master directory:
- page sources in zip files per scrape
- 'robots.txt' files when used for finding unscraped url's and if different from the previous version
- 'sitemap.xml' files when used for finding unscraped url's and if different from the previous version
- log files for all scrapes
- site reports per report variation, with subdirectories per reporting period
Next files are stored directly in the master directory:
- web analytics log file
- reporting log file
- scrapes database
- metrics database
Various names of above mentioned files and (sub)directories are
determined by the actual parameters of the configuration file. Please
refer to the bd_www.Config class for further information.
The scrapes database consists of the following tables and views:
History tables - contain all basic, extracted and derived data from all scrapes. All stored data is labeled with the timestamp of the scrape. With some pragmatic exceptions the data is only saved if different from the previous scrape (time delta storage).
- his_pages_life - time delta register when a page was alive or not (alive is FALSE when a page was not present)
- his_pages_info - time delta register of extracted and derived data per page (full record saved when one or more values differ from previous scrape)
- his_ed_links - time delta register with usage of editorial links (present is 0 when an editorial link was not used)
- his_links_status - time delta register with type and status of (editorial) links
- his_redirs - time delta register with redirects that occurred while scraping (redir_type is 'absent' when it did not occur)
- his_unlinked - time delta register with pages that were not connected via editorial links (unl_type is 'absent' when no longer unlinked or not present)
- his_short_urls - time delta register when short url's were active
- his_status_figures - figures representing the status of the site for each scrape
Master tables - contain unique and primary data
- mst_paths - unique relation between path and page_id
- mst_links - unique relation between url or page_id/anchor and link_id
- mst_scrapes - primary reference for available scrapes and unique source for scraper and primers fields
Non tables/views - collections of manually administered page_id's to force special handling of these pages.
- non_scrape_ids - table with manually administered page_id's that will not be scraped
- non_orphan_ids - table with manually administered page_id's that will not be considered orphans when becoming unlinked (unl_type will be set to 'special' in those cases)
- non_orphan_paths - view of table non_scrape_ids with explicit path
- non_scrape_paths - view of table non_orphan_ids with explicit path
Scrape tables - operational stores for scrape results; purged when starting a new scrape.
- scr_page_ids - id's of pages that were scraped
- scr_redirs - redirects that occurred while scraping
- scr_pages_info - extracted and derived data per page
- scr_links - register of fully qualified (including link_type and status) links in the editorial content of the scraped pages
- scr_ed_links - use of links in the editorial content each page
- scr_unlinked - pages that can not be reached from other pages
- scr_status_figures - figures representing the status of the site
Timestamped table and views - after saving the timestamp in (the only record of) the tsd table, the timestamped views represent a dataset that is comparable to the scrape tables after the scrape with the specified timestamp finished.
- tsd - scalar store to set the timestamp for the next timestamped views
- tsd_redirs - identical to equivalent scrape table
- tsd_redirs_to_wcm - all redirects related to the canonical url of the page they finally redirected to (not available as scrape table)
- tsd_pages_info - identical to equivalent scrape table
- tsd_links - identical to equivalent scrape table
- tsd_ed_links - identical to equivalent scrape table
- tsd_unlinked - identical to equivalent scrape table
- tsd_int_ref_cnt - number of times that a pages is linked to from the editorial content of other pages (not available as scrape table)
The metrics database consists of the following tables and views:
Basic tables - contain unique and primary data
- shd_paths - unique relation between path and page_id; acts as shadow subset of the mst_paths table in the scrape database as substitute for an inter-db relation that is not possible with SQLite
- shd_links - unique relation between link_id and url; acts as shadow subset of the mst_links table in the scrape database as substitute for an inter-db relation that is not possible with SQLite
- dates - unique relation between date and date_id
Analytics tables - contain all historic analytics data
- daily - numeric analytics data per date and page
- feedback - numeric and textual feedback per date and page
- downloads - numeric download data per date and download link
Analytics views - views for the analytics tables with explicit relations
- daily_expl - daily data table with explicit date and page path
- feedback_expl - feedback data table with explicit date and page path
- downloads_expl - downloads data table with explicit date and link path
View Source
def scrape_site() -> str | None: """ **Scrape www.belastingdienst.nl.** Returns: timestamp of the scrape [yymmdd-hhmm] or None in case a 'scrape_in_progress.flag' is encountered One complete scrape is done of `www.belastingdienst.nl` and stored in the scrape tables of the scrapes database (documented with the `create_scrapes_storage` function) and a zip-file with pages sources. The scrape tables will contain `page_id`'s of the scraped pages, redirects that occurred while retrieving the pages, detailed info per page, use of links from editorial content and figures representing the site status. After finishing the scrape and completing the scrape tables, the result of the scrape is added to the history tables of the scrapes database. Further process and status information of the scrape is retained in the next files: - log file with info, warnings and/or errors of the scrape - copies of the 'robots.txt' and 'sitemap.xml' files if they are used to find unscraped pages and if they differ from the previously saved version In case the scraper operates in an environment with normal external access, all editorial links will be checked. Otherwise only links to (a few configured) trusted sites are checked. The outcome of all checks, executed or not, will be reflected in the *status* field of the *scr_links* and *his_links_status* tables of the scrapes database. """ global logger def ext_int_available() -> bool: """ **Tests whether external internet is available.** Returns: Availability. Only tests if www.google.nl can be reached. """ # noinspection PyBroadException try: ok = requests.get( 'https://www.google.nl', timeout=5).status_code == 200 except Exception: ok = False return ok # Some preliminaries ts = time.strftime('%y%m%d-%H%M') logger = logging.getLogger('scrape') log_dir = mst_dir / scrape_conf.log_dir_name log_file = str(log_dir / f'{ts}_{scrape_conf.log_name}') logger.setLevel(logging.INFO) fh = logging.FileHandler(log_file) fh.setLevel(logging.INFO) ch = logging.StreamHandler() ch.setLevel(logging.INFO) formatter = logging.Formatter( fmt='[%(asctime)s] %(levelname)-8s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S') fh.setFormatter(formatter) ch.setFormatter(formatter) logger.addHandler(fh) logger.addHandler(ch) logger.info(f'scrape {ts} started') # Check if another scrape is running flag = mst_dir / 'SCRAPE_IN_PROGRESS.flag' if flag.exists(): ct = dt.datetime.fromtimestamp(flag.stat().st_ctime) logger.warning(f'Another scrape in progress since ' f'{ct.strftime("%a %d %b %Y / %H:%M")}') return '' else: flag.touch() # Load url's that should not be scraped qry_result = mst_conn.execute( 'SELECT path FROM non_scrape_paths').fetchall() urls_done = {ROOT_URL + path for (path,) in qry_result} # Initialisations empty_scr_tables() api_ok = matomo_available() ext_ok = ext_int_available() if not ext_ok: logger.warning('www.google.nl not available: ' 'assuming limited external access') urls_todo = {WWW_SITE} urls_scraped = 0 pages_saved = 0 unlinked_urls_checked = False mst_conn.execute(''' INSERT INTO mst_scrapes (timestamp, scraper, primers) VALUES (?, ?, ?) ''', [ts, SCRAPER_VERSION, 'site url only']) start_time = time.time() # Do the scrape max_nr = 100_000 if scrape_conf.max_urls == 'all' else scrape_conf.max_urls src_zip = PageSourceZip(ts) while urls_todo and urls_scraped < max_nr: req_url = urls_todo.pop() wcm_url = scrape_page(req_url, urls_done, urls_todo, src_zip, ext_ok) urls_scraped += 1 if wcm_url: pages_saved += 1 # After initial link crawling, check any url's were missed if not urls_todo and not unlinked_urls_checked: logger.info(f'initial link crawling done: ' f'{urls_scraped} urls\'s visited / ' f'{pages_saved} pages saved') logger.info('finding url\'s that have not been crawled') urls_todo = find_unscraped_urls(ts, urls_done, historic=True, requested=api_ok) unlinked_urls_checked = True # Time cycles and print progress and prognosis num_todo = min(len(urls_todo), max_nr - urls_scraped) if urls_scraped % 25 == 0: page_time = (time.time() - start_time) / urls_scraped togo_time = int(num_todo * page_time) print(f'{urls_scraped:4} url\'s done, ' f'{page_time:.2f} secs per url / ' f'{num_todo:4} found todo, {togo_time // 60}:' f'{togo_time % 60:02} min to scrape them') elapsed = int(time.time() - start_time) # Important: do not alter the next logging text; the presence of this # text in the log file validates the scrape (ref. check_storage_integrity # function). logger.info(f'scrape finished in {elapsed // 60}:{elapsed % 60:02} min') logger.info(f'{urls_scraped} urls\'s visited / {pages_saved} pages saved') # Processing the scraped data post_process_scrape() extract_pages_info(ts) derive_pages_info() status_figures() add_scraped_data(ts) elapsed = int(time.time() - start_time) logger.info(f'scraping and processing data finished in ' f'{elapsed // 60}:{elapsed % 60:02} min') flag.unlink(missing_ok=True) return ts
Scrape www.belastingdienst.nl.
Returns:
timestamp of the scrape [yymmdd-hhmm] or None in case a
'scrape_in_progress.flag' is encountered
One complete scrape is done of www.belastingdienst.nl and stored in the
scrape tables of the scrapes database (documented with the
create_scrapes_storage function) and a zip-file with pages sources.
The scrape tables will contain page_id's of the scraped pages,
redirects that occurred while retrieving the pages, detailed info per
page, use of links from editorial content and figures representing the
site status.
After finishing the scrape and completing the scrape tables, the result of the scrape is added to the history tables of the scrapes database.
Further process and status information of the scrape is retained in the next files:
- log file with info, warnings and/or errors of the scrape
- copies of the 'robots.txt' and 'sitemap.xml' files if they are used to find unscraped pages and if they differ from the previously saved version
In case the scraper operates in an environment with normal external access, all editorial links will be checked. Otherwise only links to (a few configured) trusted sites are checked. The outcome of all checks, executed or not, will be reflected in the status field of the scr_links and his_links_status tables of the scrapes database.
View Source
def empty_scr_tables() -> None: """ **Delete all content from scrape tables.** """ qry = ''' SELECT name FROM sqlite_schema WHERE type = 'table' AND name LIKE 'scr_%' ''' for name in [r[0] for r in mst_conn.execute(qry)]: mst_conn.execute(f'DELETE FROM {name}')
Delete all content from scrape tables.
View Source
def scrape_page( req_url: str, urls_done: Set[str], urls_todo: Set[str], page_zip: PageSourceZip, check_all_links: bool = True) -> Union[str, None]: """ **Scrape an html page and save the results to the scrape tables.** Arguments: req_url: url of requested page urls_done: url's that have been scraped already urls_todo: url's that still need to be scraped page_zip: archive to store scraped page sources check_all_links: when false, only trusted links will be checked Returns: canonical wcm url (or None for mentioned cases) Get the final response from the requested url while collecting any redirect that occurs during the request. Register the page under the canonical wcm path and save the page source to the zip-archive. Further store page-related data, editorial links and redirects to the scrape tables of the scrapes database (documented with the `create_scrapes_storage` function). The `urls_done` and `urls_todo` sets are updated accordingly, driving the further scrape of the site. When `check_all_links` is `True`, all links within the page that are external (do not start with `bd_www.constants.ROOT_URL`) will be checked. When set to `False`, only those external links will be checked that go to a site in the `trusted_domains` configuration parameter (refer to `bd_www.Config`). This option can be used when this module runs on a platform that (for reasons of security) has limited external site access. All url's are absolute, also in the resulting *scr_links*, *scr_redirs* and *scr_unlinked* tables. The `post_process_scrape` should be used to normalise these url's and determine the type of the unlinked pages. The returned canonical wcm url is the content attribute of the `<meta name="DCTERMS.identifier">` tag. This is the url as generated by the WCM system. The returned value will be `None` when the requested url - was already scraped (and saved) - redirects to an out-of-scope page - gave an unexpected response (which is logged) """ def check_ext_link( lnk: str, trusted_only: bool = False) -> tuple[str, str]: """ **Return type and response status of an external link.** Arguments: lnk: url to be checked trusted_only: when true, only links to trusted sites will will be checked; the returned response status will reflect this Returns: response status, link type The returned `response status` is the status code from the primary response upon requesting the given url. By default the returned `link type` will be 'ext', but when the response is a redirect, depending on the destination url, '-to-int' or '-to-ext' is added to this type. When requesting the url hits an error, it is returned via the `response status`. The sites to be trusted are taken from the `trusted_domains` configuration parameter in the [SCRAPE] section of the configuration file (see `bd_www.Config`). """ l_type = 'ext' if trusted_only: re_result = re.match(r'^(.*\.)?(.*\..*)$', urlparse(lnk).netloc) link_site = re_result[2] if re_result else None if link_site not in scrape_conf.trusted_domains: return 'unable to check link', l_type try: response = requests.get( lnk, allow_redirects=False, verify=False, timeout=5) except Exception as err: resp_status = f'error while checking link: {err}' else: resp_status = response.status_code if resp_status in (301, 302, 303, 307): location = response.headers['Location'] if location.startswith(ROOT_URL): l_type += '-to-int' else: l_type += '-to-ext' return str(resp_status), l_type def save_page(path: str, doc: str) -> Union[int, None]: """ **Save scraped page to the page source zip.** Arguments: path: relative to the ROOT_URL of the scrape doc: page source Returns: page_id or None if already saved If the page was not saved before, it will be saved with the page_id requested from the `mst_paths` table. A new page_id wil be created if needed. """ qry_res = mst_conn.execute( 'SELECT page_id FROM mst_paths WHERE path = ?', [path]).fetchone() if qry_res: p_id = qry_res[0] else: # This is a brand new page p_id = mst_conn.execute( 'INSERT INTO mst_paths (path) VALUES (?)', [path]).lastrowid if p_id in page_zip.page_ids(): # Page already saved during this scrape return None else: # Page not saved yet during this scrape page_zip.add_page(p_id, doc) mst_conn.execute('INSERT INTO scr_page_ids VALUES (?)', [p_id]) return p_id # Setup some variables page_as_string = '' soup = None redir_qry = ''' INSERT OR IGNORE INTO scr_redirs (req_url, redir_url, redir_type) VALUES (?, ?, ?) ''' # Request loop while True: # Keep requesting until no further rewrites or redirects occur, # while saving redirects to the temporary table and updating the # urls_done admin. # noinspection PyBroadException try: resp = requests.get(req_url, allow_redirects=False) except Exception: # The requester failed on this url; consider it done and continue. logger.error(f'failed to request {req_url}') urls_done.add(req_url) return None urls_done.add(req_url) # Check and handle status codes if resp.status_code == 200: pass elif resp.status_code in (301, 302): # This concerns a redirect response redir_url = resp.headers['Location'] # If request or redir is in scope, save the redirect with full # url's to the temporary table. if (req_url.startswith(ROOT_URL) or redir_url.startswith(ROOT_URL)): mst_conn.execute( redir_qry, [req_url, redir_url, resp.status_code]) if redir_url.startswith(ROOT_URL): # restart Request loop with redirected url req_url = redir_url continue else: # Redirect url is out of scope. return None else: # Returned status code is not 200, 301 or 302, so getting the page # did not succeed. Consider it done and return a None result. logger.error(f'unexpected response from {req_url}; ' f'status code is {resp.status_code}.') return None # Read and parse the response into a soup document page_as_string = resp.text soup = BeautifulSoup(page_as_string, features='lxml') # Has this page a client-side redirect by means of header tag # <meta http-equiv="refresh" content="0;url=...">? meta_tag = soup.find('meta', attrs={'http-equiv': 'refresh'}) if meta_tag: redir_url = meta_tag['content'].split('url=')[1] redir_url = urljoin(req_url, redir_url) # If request or redir is in scope, save the redirect with full # url's to the temporary table. if (req_url.startswith(ROOT_URL) or redir_url.startswith(ROOT_URL)): mst_conn.execute(redir_qry, [req_url, redir_url, 'client']) if redir_url.startswith(ROOT_URL): # Restart Request loop with redirected url req_url = redir_url continue else: # Redirect url is out of scope. return None # End of the Request loop. # Reaching this point means we're done requesting this page. break # Strip potential parameters and anchor in the response url resp_url = resp.url.split('?')[0].split('#')[0] # Determine the wcm url if resp_url.startswith(ROOT_URL): # Response url is in scope. Now the wcm url should be in the header # meta tag named 'DCTERMS.identifier'. meta_tag = soup.find( 'meta', attrs={'name': 'DCTERMS.identifier'}) if meta_tag: wcm_url = meta_tag['content'] wcm_url = urljoin(resp_url, wcm_url) if wcm_url != resp_url: # Save this as an alias in the redirects table mst_conn.execute(redir_qry, [resp_url, wcm_url, 'alias']) # Consider also the wcm_url as done urls_done.add(wcm_url) else: logger.error(f'page without wcm url; ' f'falling back to: {resp_url}') wcm_url = resp_url else: # Response url is not in scope. # Because of the logic from the code above this situation should # never occur. Since this is not destructive, scraping continues. logger.error(f'out of scope response from {req_url}') return None # Save in-scope page under the wcm path. # It is possible that a page in the urls_todo set redirects to a wcm_url # of which the page/path combination is saved already. Especially some # short url's cause this situation. Those cases are not saved (again). wcm_path = wcm_url.split(ROOT_URL)[1] page_id = save_page(wcm_path, page_as_string) if not page_id: # Page was saved and handled earlier during this scrape return None # Fetch all in-page links, save the editorial links to the scrape db and # add links of in-scope pages to the urls_todo set. # Cycle over the editorial, automated en rest (partial) trees of the page for tree in split_tree(soup): editorial = True if tree.html['tree'] == 'editorial' else False # Cycle over all links in the tree for a_tag in tree.find_all('a', href=True): # Unquote the link to be sure (history taught this) link = unquote(a_tag['href'].strip()) link_text = a_tag.text.strip() # Some links are discarded if link.startswith('#'): # Plain #-links continue if link == '/': # Home page links continue if 'readspeaker' in link or 'adobe' in link or 'java' in link: # Non content links continue link = urljoin(resp_url, link) # Separate anchor from in-scope url if link.startswith(ROOT_URL) and '#' in link: link, _, anchor = link.partition('#') else: anchor = None if link == resp_url: # Discard in-page link (in case it was not a plain #-link) continue if link and editorial: link_type = None link_status = None # Check if the link is new for this scrape qry = f''' SELECT link_nr FROM scr_links WHERE url = '{link}' AND anchor IS {f"'{anchor}'" if anchor else 'NULL'} ''' qry_result = mst_conn.execute(qry).fetchone() if qry_result: # This link was already handled during this scrape link_nr = qry_result[0] else: if re.match('^https?://', link): # Get type and status of the link if link.startswith(ROOT_URL): link_type = 'int' else: # The link is external link_status, link_type = check_ext_link( link + '#' + anchor if anchor else link, not check_all_links) # If an external link redirects to an internal # page, it should be scraped too. This is true for # short url's, since these are strictly external # relative to the root url. if link_type == 'ext-to-int': if link not in urls_done: urls_todo.add(link) # Store the link for this scrape link_nr = mst_conn.execute(''' INSERT INTO scr_links (url, anchor, link_type, status) VALUES (?, ?, ?, ?) ''', [link, anchor, link_type, link_status]).lastrowid # Save link in relation to the page (can be a duplicate if # the link is used more than once with the same text in the # same page). mst_conn.execute(''' INSERT INTO scr_ed_links (page_id, text, link_nr) VALUES (?, ?, ?) ''', [page_id, link_text, link_nr]) if link: # Discard links that do not need to be scraped if link.endswith('.xml'): continue if not link.startswith(ROOT_URL): # Link not in scope continue # Strip potential parameters (anchor was separated from # in-scope url before). valid_scrape_url = link.split('?')[0] if valid_scrape_url in urls_done: # Already handled continue urls_todo.add(valid_scrape_url) else: # This condition is probably not possible logger.warning(f'empty link in page {page_id}') return wcm_url
Scrape an html page and save the results to the scrape tables.
Arguments:
req_url: url of requested page
urls_done: url's that have been scraped already
urls_todo: url's that still need to be scraped
page_zip: archive to store scraped page sources
check_all_links: when false, only trusted links will be checked
Returns:
canonical wcm url (or None for mentioned cases)
Get the final response from the requested url while collecting any
redirect that occurs during the request. Register the page under the
canonical wcm path and save the page source to the zip-archive. Further
store page-related data, editorial links and redirects to the scrape
tables of the scrapes database (documented with the
create_scrapes_storage function). The urls_done and urls_todo sets
are updated accordingly, driving the further scrape of the site.
When check_all_links is True, all links within the page that are
external (do not start with bd_www.constants.ROOT_URL) will be checked.
When set to False, only those external links will be checked that go to
a site in the trusted_domains configuration parameter (refer to
bd_www.Config). This option can be used when this module runs on a
platform that (for reasons of security) has limited external site access.
All url's are absolute, also in the resulting scr_links, scr_redirs
and scr_unlinked tables. The post_process_scrape should be used to
normalise these url's and determine the type of the unlinked pages.
The returned canonical wcm url is the content attribute of the <meta
name="DCTERMS.identifier"> tag. This is the url as generated by the WCM
system.
The returned value will be None when the requested url
- was already scraped (and saved)
- redirects to an out-of-scope page
- gave an unexpected response (which is logged)
View Source
def split_tree( soup: BeautifulSoup) -> (BeautifulSoup, BeautifulSoup, BeautifulSoup): """ **Split a soup doc in three separate docs: editorial, automated and rest.** Arguments: soup: bs4 representation of the page to be split Returns: editorial doc, automated doc, rest doc The *editorial doc* is a copy of the originating doc with all tags removed that do not contain editorial content. The *automated doc* is a new html (root) tag containing all tags with automated content. The *rest doc* is an html (root) tag containing all tags that did not represent editorial or automated content. No tags are lost, so all three docs together contain the same tags as the originating doc, but only the hierarchical structure of the *editorial doc* remains intact. To identify the separate docs, a `tree` attribute is added to the `<html>` tag with values 'editorial', 'automated' and 'rest' respectively. More specifically, the three docs will contain the next tags from the originating doc. ***rest doc***: - `<head>` - `<header>` - `<footer>` - `<div id="bld-nojs">`: for situation that javascript is not active - `<div class="bld-subnavigatie">`: left side navigation of bib pages - `<div class="bld-feedback">`: bottom feedback of content page - `<div>` with readspeaker buttons - `<div>` with modal dialog for the virtual assistant - `<script>` en `<noscript>` - comments ***automated doc***: - `<div class="add_content">`: automatically added content - all remaining tags from pages with type bld-overview ***editorial doc***: - all tags that do not go to one of the other docs """ # Make working copy from soup doc, because of destructive tag removal tree = copy.copy(soup) # Create rest doc as html trunk rst_tree = BeautifulSoup('<html tree="rest"></html>', features='lxml') # All that is not needed for the editorial or automated trees goes to rest: # - head, header and footer if tree.head: rst_tree.html.append(tree.head.extract()) if tree.body.header: rst_tree.html.append(tree.body.header.extract()) if tree.body.footer: rst_tree.html.append(tree.body.footer.extract()) # - content for non active javascript div_nojs = tree.find('div', id='bld-nojs') if div_nojs: rst_tree.html.append(div_nojs.extract()) # - sub-navigation div_subnav = tree.find(class_='bld-subnavigatie') if div_subnav: rst_tree.html.append(div_subnav.extract()) # - feedback div_feedback = tree.find(class_='bld-feedback') if div_feedback: rst_tree.html.append(div_feedback.extract()) # - readspeaker buttons for tag in tree.find_all('div', class_='rs_skip'): rst_tree.html.append(tag.extract()) # - modal dialog for the virtual assistant for tag in tree.find_all('div', id='vaModal'): rst_tree.html.append(tag.extract()) # - comments for tag in tree.find_all(string=lambda text: isinstance(text, Comment)): rst_tree.html.append(tag.extract()) # - scripts for tag in tree.find_all(('script', 'noscript')): rst_tree.html.append(tag.extract()) # - no-support div for tag in tree.find_all(id='bld-nosupport'): rst_tree.html.append(tag.extract()) # Is the page generated without any editor intervention? if tree('body', attrs={'data-pagetype': 'bld-overview'}): # Then editor tree becomes trunk only ed_tree = BeautifulSoup( '<html tree="editorial"></html>', features='lxml') # And automated tree gets all remaining content aut_tree = tree aut_tree.html['tree'] = 'automated' else: # Make new html trunk and graft all automated content on it aut_tree = BeautifulSoup( '<html tree="automated"></html>', features='lxml') for tag in tree.find_all('div', class_='content_add'): aut_tree.html.append(tag.extract()) # Editorial tree gets all remaining content ed_tree = tree ed_tree.html['tree'] = 'editorial' return ed_tree, aut_tree, rst_tree
Split a soup doc in three separate docs: editorial, automated and rest.
Arguments:
soup: bs4 representation of the page to be split
Returns:
editorial doc, automated doc, rest doc
The editorial doc is a copy of the originating doc with all tags removed that do not contain editorial content. The automated doc is a new html (root) tag containing all tags with automated content. The rest doc is an html (root) tag containing all tags that did not represent editorial or automated content. No tags are lost, so all three docs together contain the same tags as the originating doc, but only the hierarchical structure of the editorial doc remains intact.
To identify the separate docs, a tree attribute is added to the <html>
tag with values 'editorial', 'automated' and 'rest' respectively.
More specifically, the three docs will contain the next tags from the originating doc.
rest doc:
<head><header><footer><div id="bld-nojs">: for situation that javascript is not active<div class="bld-subnavigatie">: left side navigation of bib pages<div class="bld-feedback">: bottom feedback of content page<div>with readspeaker buttons<div>with modal dialog for the virtual assistant<script>en<noscript>- comments
automated doc:
<div class="add_content">: automatically added content- all remaining tags from pages with type bld-overview
editorial doc:
- all tags that do not go to one of the other docs
View Source
def find_unscraped_urls( ts: str, urls_done: Set[str], historic: bool = True, requested: bool = False) -> Set[str]: """ **Get url's that were not found via link crawling.** Arguments: ts: timestamp of the scrape [yymmdd-hhmm] urls_done: url's that have been scraped already historic: check all url's that were ever scraped before requested: check all url's that were requested during the last 8 days Returns: originating url's that responded with internal url's that were not scraped yet When scraping via link crawling is exhausted, this function should be used to find url's of pages that do exist, but were not scraped yet. The next url sets will be verified (partially depending on parameters from the configuration file; see further `bd_www.Config`): - latest unlinked url's (depending on `use_unlinked_urls` configuration parameter) - url's from the sitemap(s), located via the 'robots.txt' file (depending on `use_sitemap` configuration parameter) - url's that were scraped during all previous scrapes - url's that were requested during the last eight days (including today) - url's encountered during earlier scrapes that redirected to internal pages (implicitly including short url's) The 'robots.txt' and 'sitemap.xml' files that are used, will be saved in case they differ from the previously saved version. The (fully qualified) response url's of the non scraped pages are stored in the *scr_unlinked* table of the scrapes database. """ def save_scrape_file(content: str, name: str, dest_dir: Path) -> None: """ **Save content to file if it differs from the most recent version.** Arguments: content: content to be saved name: name of the file to save to dest_dir: directory to save in The name with which the content is saved will be preceded by the timestamp and an underscore character. The content will also be saved if `dest_dir` refers to an empty directory. """ dir_files = sorted(dest_dir.iterdir()) if dir_files: with open(dir_files[-1]) as lf: last_content = lf.read() if not dir_files or content != last_content: with open(dest_dir / f'{ts}_{name}', 'w') as nf: nf.write(content) def living(all_urls: Set[str], set_name: str) -> Set[str]: """ **Return and register url's of living pages that were not scraped yet.** Arguments: all_urls: url's to check set_name: identification used for logging Returns: originating url's that responded with living pages Apart from returning the set of originating url's that responded with an active internal page that was not scraped yet, the respons url's of these pages are stored in the `scr_unlinked` table of the scrapes database. Duration and results be logged. """ start = time.time() missed_urls = set() for u in all_urls - urls_done: # noinspection PyBroadException try: rsp = requests.get(u) except Exception: continue if rsp.status_code // 100 != 4 and rsp.url.startswith(ROOT_URL): # Url is active and responds with an internal page missed_urls.add(u) if rsp.url not in urls_done: mst_conn.execute( 'INSERT OR IGNORE INTO scr_unlinked (url) ' 'VALUES (?)', [rsp.url]) logger.debug(f"unlinked {set_name} url's: {rsp.url}") # Wrapping up secs = int(time.time() - start) msg = 'link crawling ' if missed_urls: msg += f"missed {len(missed_urls)} {set_name} url's" else: msg += f"did not miss any {set_name} url's" msg += f'; checking took {secs // 60}:{secs % 60:02} min' logger.info(msg) return missed_urls still_to_scrape = set() # Verify previously unlinked url's if scrape_conf.use_unlinked_urls: qry = f''' WITH last_unlinked_pages AS ( SELECT DISTINCT page_id, last_value(unl_type) OVER ( PARTITION BY page_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS unl_type FROM his_unlinked ), last_unlinked_urls AS ( SELECT DISTINCT url, last_value(unl_type) OVER ( PARTITION BY url ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS unl_type FROM his_unlinked ) SELECT '{ROOT_URL}' || path FROM last_unlinked_pages left JOIN mst_paths USING (page_id) WHERE path NOTNULL AND unl_type != 'absent' UNION SELECT url FROM last_unlinked_urls WHERE url NOTNULL AND unl_type != 'absent' ''' unlinked_urls = {r[0] for r in mst_conn.execute(qry).fetchall()} still_to_scrape |= living(unlinked_urls, 'unlinked') # Verify url's from the sitemap(s) if scrape_conf.use_sitemap: # Find the sitemap via robots.txt robots_url = WWW_SITE + '/robots.txt' response = requests.get(robots_url) if response.status_code != 200: logger.info( f'getting {robots_url} returned status {response.status_code}') else: robots_txt = response.text robots_dir = mst_dir / scrape_conf.robots_dir_name save_scrape_file(robots_txt, 'robots.txt', robots_dir) sitemap_urls = re.findall(r'Sitemap:\s(https?://\S*)', robots_txt) if len(sitemap_urls) == 0: logger.info('no sitemap.xml declared in robots.txt') else: urls_from_sitemaps = set() sitemap_dir = mst_dir / scrape_conf.sitemap_dir_name for i, url in enumerate(sitemap_urls, start=1): response = requests.get(url) if response.status_code != 200: logger.info(f'getting {url} returned status ' f'{response.status_code}') else: logger.info(f'sitemap found at {url}') if len(sitemap_urls) == 1: filename = f'sitemap.xml' else: filename = f'sitemap({i}).xml' save_scrape_file(response.text, filename, sitemap_dir) soup = BeautifulSoup(response.text, features='lxml') urls_from_sitemaps |= {loc.text for loc in soup.findAll('loc')} still_to_scrape |= living(urls_from_sitemaps, 'sitemap') # Verify all historic url's if historic: qry_result = mst_conn.execute( 'SELECT path FROM mst_paths').fetchall() historic_urls = {ROOT_URL + path for (path,) in qry_result} still_to_scrape |= living(historic_urls, 'historic') # Verify url's that were requested during the last eight days if requested: # noinspection PyBroadException # This try clause is used since Matomo is very unpredictable try: requested_urls = urls_last_days() except Exception: logger.critical('using urls_last_days() failed') else: # Check which do exist but are not crawled yet still_to_scrape |= living(requested_urls, 'visited') # Verify all previously requested url's that initiated redirects qry_result = mst_conn.execute( 'SELECT DISTINCT req_url FROM his_redirs').fetchall() redirected_urls = {url for (url,) in qry_result} still_to_scrape |= living(redirected_urls, 'redirected') return still_to_scrape
Get url's that were not found via link crawling.
Arguments:
ts: timestamp of the scrape [yymmdd-hhmm]
urls_done: url's that have been scraped already
historic: check all url's that were ever scraped before
requested: check all url's that were requested during the last 8 days
Returns:
originating url's that responded with internal url's that were not
scraped yet
When scraping via link crawling is exhausted, this function should be
used to find url's of pages that do exist, but were not scraped yet. The
next url sets will be verified (partially depending on parameters from
the configuration file; see further bd_www.Config):
- latest unlinked url's (depending on
use_unlinked_urlsconfiguration parameter) - url's from the sitemap(s), located via the 'robots.txt' file (depending
on
use_sitemapconfiguration parameter) - url's that were scraped during all previous scrapes
- url's that were requested during the last eight days (including today)
- url's encountered during earlier scrapes that redirected to internal pages (implicitly including short url's)
The 'robots.txt' and 'sitemap.xml' files that are used, will be saved in case they differ from the previously saved version.
The (fully qualified) response url's of the non scraped pages are stored in the scr_unlinked table of the scrapes database.
View Source
def post_process_scrape() -> None: """ **Update tables after finishing a scrape.** After scraping all pages using the `scrape_page` function, this `post_process_scrape` function needs to be used to updates some tables in the scrapes database. This is essential before further processing can be done of the scraped data. This post-processing consists of two steps: **Step 1** - Url normalisation Convert the full url's in the *scr_links*, *scr_redirs* and *scr_unlinked* tables to `page_id`'s where possible. ***Remark:*** *This can result in `page_id`'s of url's/pages that were not alive during the scrape (especially in the `scr_links` table).* **Step 2** - Typing unlinked pages Determine the type of the pages in the *scr_unlinked* table by setting *unl_type* to: - 'alias' - when the page is an alias to a wcm page - 'special' - when the id is in the *non_orphan_ids* table of the scrapes database - 'orphan' - in all other cases **Finally** Because some of the transactions impact the size of the scrapes database, it is vacuumed before returning. """ # Replace url's in the scr_links table with page_id's where possible and # combine separated anchors with remaining urls (during the scrape anchors # were separated form internal url's). mst_conn.execute(f''' UPDATE scr_links SET page_id = ( SELECT page_id FROM mst_paths WHERE url = '{ROOT_URL}' || path), url = NULL WHERE url IN (SELECT '{ROOT_URL}' || path FROM mst_paths) ''') mst_conn.execute(''' UPDATE scr_links SET url = url || '#' || anchor, anchor = NULL WHERE url NOTNULL AND anchor NOTNULL ''') # Replace request and redirect url's in the redirs table with page_id's # where possible. mst_conn.execute(f''' UPDATE scr_redirs SET req_id = ( SELECT page_id FROM mst_paths WHERE req_url = '{ROOT_URL}' || path), req_url = NULL WHERE req_url IN (SELECT '{ROOT_URL}' || path FROM mst_paths) ''') mst_conn.execute(f''' UPDATE scr_redirs SET redir_id = ( SELECT page_id FROM mst_paths WHERE redir_url = '{ROOT_URL}' || path), redir_url = NULL WHERE redir_url IN (SELECT '{ROOT_URL}' || path FROM mst_paths) ''') # Replace url's in the 'scr_unlinked' table with page_id's where possible # and qualify the unlinked url's. mst_conn.execute(f''' UPDATE scr_unlinked SET page_id = ( SELECT page_id FROM mst_paths WHERE url = '{ROOT_URL}' || path), url = NULL WHERE url IN (SELECT '{ROOT_URL}' || path FROM mst_paths) ''') mst_conn.execute(''' WITH -- Url-based redir chains redir_chains (org_url, prev_url, next_id, next_url, next_type) AS ( SELECT req_url, NULL, redir_id, redir_url, redir_type FROM scr_redirs UNION ALL SELECT org_url, next_url, r.redir_id, r.redir_url, r.redir_type FROM redir_chains AS c LEFT JOIN scr_redirs AS r ON c.next_url = r.req_url -- avoid endless loops by excluding redirs of an alias WHERE c.next_type <> 'alias' ), -- All redir chains ending in a wcm url scr_redirs_to_wcm (req_url, wcm_id, final_type) AS ( SELECT org_url, next_id, next_type FROM redir_chains WHERE next_id IS NOT NULL AND org_url IS NOT NULL ) UPDATE scr_unlinked SET unl_type = iif(page_id IN non_orphan_ids, 'special', iif((SELECT final_type FROM scr_redirs_to_wcm WHERE url = req_url) = 'alias', 'alias', 'orphan' ) ) ''') logger.info('scr_links, scr_redirs and scr_unlinked tables ' 'have been normalised') # Finally mst_conn.execute('VACUUM')
Update tables after finishing a scrape.
After scraping all pages using the scrape_page function,
this post_process_scrape function needs to be used to updates some
tables in the scrapes database. This is essential before further
processing can be done of the scraped data.
This post-processing consists of two steps:
Step 1 - Url normalisation
Convert the full url's in the scr_links, scr_redirs and
scr_unlinked tables to page_id's where possible.
Remark: This can result in page_id's of url's/pages that were
not alive during the scrape (especially in the scr_links table).
Step 2 - Typing unlinked pages
Determine the type of the pages in the scr_unlinked table by setting unl_type to:
- 'alias' - when the page is an alias to a wcm page
- 'special' - when the id is in the non_orphan_ids table of the scrapes database
- 'orphan' - in all other cases
Finally
Because some of the transactions impact the size of the scrapes database, it is vacuumed before returning.
View Source
def extract_pages_info(ts: str) -> None: """ **Renew content of *scr_pages_info* table with extracted information.** Arguments: ts: timestamp of the scrape [yymmdd-hhmm] After deleting any previous content from the *scr_pages_info*, this function fills the 'extracted' data fields of this table for each scraped page with the next content: - *title* - content of `<title>` tag - *description* - content of `<meta name="description" content="..." />` - *num_h1s* - number of `<h1>` tags - *first_h1* - content of first `<h1>` tag - *language* - content of `<meta name="language" content="xx" />` - *modified* - content of `<meta name="DCTERMS.modified" content="..." />` - *pagetype* - attribute value of `<body data-pageType="...">` - *classes* - attribute value of `<body class="...">` - *ed_text* - editorial text of the page - *aut_text* - automated text of the page The *scr_pages_info* table accommodates additional fields to contain derived information for each page. This is further detailed in the `derive_pages_info` function. It will be logged when tags or attributes are missing or values are invalid. ***Data note:*** *Extracted information concerns data that is readily available within the page source. Storing this data in a separate table is strictly redundant, but serves faster access.* ***Implementation note:*** *The fields of the* scr_pages_info *table are defined by the constants `EXTRACTED_FIELDS` and `DERIVED_FIELDS`. Although some groundwork is laid to adapt the number and content of these fields, the actual code does not fully support such an adaptation.* """ logger.info('extracting pages info started') src_zip = PageSourceZip(ts) num_pages = len(src_zip.page_ids()) start_time = time.time() mst_conn.execute('DELETE FROM scr_pages_info') # Cycle over all pages for page_num, (page_id, doc) in enumerate(src_zip.iter_pages(), start=1): qry = f'SELECT path FROM mst_paths WHERE page_id = {page_id}' path = mst_conn.execute(qry).fetchone()[0] soup = BeautifulSoup(doc, features='lxml') info = {'page_id': page_id} # Get title title = soup.title if not title: logger.warning(f'page has no <title> tag: {path}') title = None else: title = title.text if not title: logger.warning(f'page with empty title: {path}') info['title'] = title # Get description description = soup.find(attrs={'name': 'description'}) if not description: # There are very much occurrences of this situation logger.debug( f'page has no <meta name="description"/> tag: {path}') description = None else: description = description['content'] if not description: logger.warning(f'page with empty description: {path}') info['description'] = description # Get info from <h1> tags h1s = [] for h1 in soup.find_all('h1'): h1s.append(h1.text) if not h1s: # h1 tag can be hidden in a script (e.g. search page) if h1s := re.findall(r'<h1>(.*)</h1>', doc): logger.warning(f'page with hidden h1-tag: {path}') else: logger.warning(f'page without h1: {path}') info['num_h1s'] = len(h1s) info['first_h1'] = h1s[0] if h1s else None # Get language language = soup.find('meta', attrs={'name': 'language'}) if not language: logger.warning( f'page has no <meta name="language"/> tag: {path}') language = None else: language = language['content'] if not language: logger.warning(f'page with empty language: {path}') info['language'] = language # Get date modified modified = soup.find('meta', attrs={'name': 'DCTERMS.modified'}) if not modified: logger.warning( f'page has no tag <meta name="DCTERMS.modified"/>: {path}') modified = None else: try: modified = dt.date.fromisoformat(modified['content']) except ValueError: logger.warning( f'page with improper modification date: {path}') modified = None info['modified'] = modified # Get type of page if 'data-pagetype' not in soup.body.attrs: logger.warning('page has no data-pagetype attribute in the ' f'<body> tag: {path}') pagetype = None else: pagetype = soup.body['data-pagetype'] if not pagetype: logger.warning( f'page with empty pagetype in <body> tag: {path}') info['pagetype'] = pagetype # Get classes if 'class' not in soup.body.attrs: logger.warning( f'page has no class attribute in the <body> tag: {path}') classes = None else: classes = soup.body['class'] if not classes: logger.warning( f'page with empty class in <body> tag: {path}') info['classes'] = ' '.join(classes) if classes else None # Get editorial and automated texts info['ed_text'], info['aut_text'] = get_text(soup) # add info to the database fields = ', '.join(info) q_marks = ('?, ' * len(info))[:-2] mst_conn.execute( f'INSERT INTO scr_pages_info ({fields}) VALUES ({q_marks})', list(info.values())) # print progress and prognosis if page_num % 250 == 0: page_time = (time.time() - start_time) / page_num togo_time = int((num_pages - page_num) * page_time) print( f'extracting pages info ' f'- togo: {num_pages - page_num} pages / ' f'{togo_time // 60}:{togo_time % 60:02} min') logger.info('extracting pages info finished')
Renew content of scr_pages_info table with extracted information.
Arguments:
ts: timestamp of the scrape [yymmdd-hhmm]
After deleting any previous content from the scr_pages_info, this function fills the 'extracted' data fields of this table for each scraped page with the next content:
- title - content of
<title>tag - description - content of
<meta name="description" content="..." /> - num_h1s - number of
<h1>tags - first_h1 - content of first
<h1>tag - language - content of
<meta name="language" content="xx" /> - modified - content of
<meta name="DCTERMS.modified" content="..." /> - pagetype - attribute value of
<body data-pageType="..."> - classes - attribute value of
<body class="..."> - ed_text - editorial text of the page
- aut_text - automated text of the page
The scr_pages_info table accommodates additional fields to contain
derived information for each page. This is further detailed in the
derive_pages_info function.
It will be logged when tags or attributes are missing or values are invalid.
Data note: Extracted information concerns data that is readily available within the page source. Storing this data in a separate table is strictly redundant, but serves faster access.
Implementation note: The fields of the scr_pages_info table are
defined by the constants EXTRACTED_FIELDS and DERIVED_FIELDS.
Although some groundwork is laid to adapt the number and content of these
fields, the actual code does not fully support such an adaptation.
View Source
def get_text(soup: BeautifulSoup) -> [str, str]: """ **Retrieve essential editorial and automated text content of a page.** Arguments: soup: bs4 representation of a page Returns: editorial text, automated text The editorial and automated text of the page content is returned together as a twofold tuple. Basically the relevant texts are retrieved from partial trees containing only tags with editorial or automated content respectively. Whitespace within these texts is normalised and coherent chunks are separated by newlines. """ result = [] for tree in split_tree(soup): if tree.html['tree'] == 'rest': continue flatten_tagbranch_to_navstring(tree.html) # Replace non-breaking spaces with normal ones txt = tree.text.replace(b'\xc2\xa0'.decode(), ' ') # Substitute one space for any cluster of whitespace chars (getting rid # of returns, newlines, tabs, spaces, etc.; this is html, you know!). txt = re.sub(r'\s+', ' ', txt) # Change #br# markers (introduced while flattening the branches) to # newlines, while reducing multiples separated by whitespace only. # The final strip() removes potential trailing newlines. txt = re.sub(r'\s*(#br#\s*)+\s*', r'\n', txt).strip() result.append(txt) tree.decompose() return result
Retrieve essential editorial and automated text content of a page.
Arguments:
soup: bs4 representation of a page
Returns:
editorial text, automated text
The editorial and automated text of the page content is returned together as a twofold tuple. Basically the relevant texts are retrieved from partial trees containing only tags with editorial or automated content respectively. Whitespace within these texts is normalised and coherent chunks are separated by newlines.
View Source
def derive_pages_info() -> None: """ **Add derived information for all pages to the *scr_pages_info* table.** Derived information as such is not available within a page, but calculated or interpreted from other information. To derive this information, the extracted information should already be available in the *scr_pages_info* table. This can be accomplished by using the `extract_pages_info` function. After deleting any previous content from he *derived* data fields of the *scr_pages_info* table, next new content is added for each of these fields: - *business* - 'belastingen', 'toeslagen' or 'douane', depending on the extracted *classes* field - *category* - 'dv', 'bib' or 'alg', depending on the extracted *pagetype* field, mapped on basis of the constants `DV_TYPES`, `BIB_TYPES` and `ALG_TYPES`; the *category* of the 'bld_wrapper' *pagetype* however is determined on the *category* of the majority of the pages that refer to the specific wrapper page It will be logged when info can not be derived due to inconsistent or unavailable information. """ # Clear the derived field contents set_cols = ', '.join([f'{f[0]} = NULL' for f in DERIVED_FIELDS]) mst_conn.execute(f'UPDATE scr_pages_info SET {set_cols}') # Prepare some queries bus_qry = 'UPDATE scr_pages_info SET business = ? WHERE page_id = ?' cat_qry = 'UPDATE scr_pages_info SET category = ? WHERE page_id = ?' for_qry = ''' SELECT page_id, pagetype, classes FROM scr_pages_info ORDER BY CASE pagetype WHEN 'bld-wrapper' THEN 2 ELSE 1 END ''' cat_groups_qry = ''' SELECT category FROM scr_ed_links LEFT JOIN scr_links AS l USING (link_nr) LEFT JOIN scr_pages_info USING (page_id) WHERE l.page_id = ? GROUP BY category ''' wrappers_without_cat = set() logger.info('deriving pages info started') # Cycle over all pages, with wrapper pages after all others, because # the wrapper category is determined by the categories of pages # linking to that wrapper page. for page_id, pagetype, classes in mst_conn.execute(for_qry): # Determine business if classes: if 'toeslagen' in classes or 'tsl' in classes: business = 'toeslagen' elif 'dou' in classes: business = 'douane' else: business = 'belastingen' else: business = None mst_conn.execute(bus_qry, [business, page_id]) # Determine category: dv, bib or alg if pagetype in DV_TYPES: category = 'dv' elif pagetype in BIB_TYPES: category = 'bib' elif pagetype in ALG_TYPES: category = 'alg' elif pagetype == 'bld-wrapper': # Group categories of all pages that link to this wrapper page categories = mst_conn.execute( cat_groups_qry, [page_id]).fetchall() if len(categories) == 1: # All pages linking to this wrapper have the same category category = categories[0][0] if not category: # Probably because other uncategorized wrappers are # linking to this one. Save its page_id for # post-processing. wrappers_without_cat.add(page_id) else: category = 'alg' else: category = None mst_conn.execute(cat_qry, [category, page_id]) # The main reason that wrappers do not get a category in the main loop, # is that pages that link to these wrappers are wrappers themselves that # had no category yet when asked for in the main loop. Repeating the loop # for the category of these pages will resolve most of them. To avoid # endless looping when wrappers without categories are linking to each # other, the number of cycles is maximized (crude but effective). max_cycles = len(wrappers_without_cat) * 3 cycle = 0 while wrappers_without_cat and cycle < max_cycles: cycle += 1 page_id = wrappers_without_cat.pop() # Algorithm for category resolution is the same as in the main loop categories = mst_conn.execute(cat_groups_qry, [page_id]).fetchall() if len(categories) == 1: category = categories[0][0] if not category: wrappers_without_cat.add(page_id) continue else: category = 'alg' mst_conn.execute(cat_qry, [category, page_id]) # Log the remaining cases and set their category to 'alg' to avoid NULLs # in the database. for page_id in wrappers_without_cat: path = mst_conn.execute(f'SELECT path FROM mst_paths ' f'WHERE page_id = {page_id}').fetchone()[0] logger.warning( f"unresolved category of wrapper set to 'alg': {path}") mst_conn.execute(cat_qry, ['alg', page_id]) logger.info('deriving pages info finished')
Add derived information for all pages to the scr_pages_info table.
Derived information as such is not available within a page,
but calculated or interpreted from other information. To derive this
information, the extracted information should already be available in the
scr_pages_info table. This can be accomplished by using the
extract_pages_info function.
After deleting any previous content from he derived data fields of the scr_pages_info table, next new content is added for each of these fields:
- business - 'belastingen', 'toeslagen' or 'douane', depending on the extracted classes field
- category - 'dv', 'bib' or 'alg', depending on the extracted pagetype
field, mapped on basis of the constants
DV_TYPES,BIB_TYPESandALG_TYPES; the category of the 'bld_wrapper' pagetype however is determined on the category of the majority of the pages that refer to the specific wrapper page
It will be logged when info can not be derived due to inconsistent or unavailable information.
View Source
def status_figures() -> None: """ **Renew typical scrape status figures of the *scr_status_figures* table.** After deleting any previous content from the *scr_status_figures* table, next number of pages, redirs, url-aliases or ed-links will be added to this table: - pages: all pages - pages_lang_`<language>`: per *language* - pages_buss_`<business>`: per *business* - pages_cat_`<category>`: per *category* - pages_type_`<pagetype>`: per *pagetype* - pages_h1_multi: with more than one h1-tag - pages_h1_multi_`<pagetype>`: with more than one h1-tag per *pagetype* - pages_h1_no: without h1-tag - pages_title_no: without or with empty title-tag - pages_title_dupl: with non-unique title-tag - pages_descr_no: without or with empty description meta-tag - pages_descr_long: with description meta-tag longer than 160 characters - redirs: total number of all redirects - redirs_`<type>`: number of redirects per *redir_type* - redirs_`<type>`_slash: redirects per *redir_type* with only differing a slash - redirs_wcm-url: wcm url's that get redirected - url-aliases: url's that alias an wcm url - url-aliases_`<num>`x: url's with `<num>` aliases - ed-links_`<int|ext>`: internal|external editorial links - ed-links_`<int|ext>`_uniq: unique internal|external editorial links - ed-links_`<int|ext>`_avg: internal|external editorial links per page - ed-links_int_redir: redirected internal editorial links - ed-links_int_non-wcm: internal editorial links to non-wcm url's - ed-links_int_orphan-pages: number of unlinked (i.e. orphan) pages """ mst_conn.execute('DELETE FROM scr_status_figures') ins_qry = 'INSERT INTO scr_status_figures (name, value) VALUES (?, ?)' # Total pages qry = ''' SELECT count(*) FROM scr_pages_info ''' num_pages = mst_conn.execute(qry).fetchone()[0] mst_conn.execute(ins_qry, ['pages', num_pages]) # Pages per language qry = ''' SELECT language, count(*) FROM scr_pages_info GROUP BY language ORDER BY language DESC ''' for language, count in mst_conn.execute(qry).fetchall(): mst_conn.execute(ins_qry, [f'pages_lang_{language}', count]) # Pages per business qry = ''' SELECT business, count(*) FROM scr_pages_info GROUP BY business ORDER BY business ''' for business, count in mst_conn.execute(qry).fetchall(): mst_conn.execute(ins_qry, [f'pages_buss_{business}', count]) # Pages per category qry = ''' SELECT category, count(*) FROM scr_pages_info GROUP BY category ORDER BY business DESC ''' for category, count in mst_conn.execute(qry).fetchall(): mst_conn.execute(ins_qry, [f'pages_cat_{category}', count]) # Pages per type qry = ''' SELECT pagetype, count(*) FROM scr_pages_info GROUP BY pagetype ORDER BY category DESC, count(*) ASC ''' for pagetype, count in mst_conn.execute(qry).fetchall(): mst_conn.execute(ins_qry, [f'pages_type_{pagetype}', count]) # Pages with more than one h1's qry = ''' SELECT count(*) FROM scr_pages_info WHERE num_h1s > 1 ''' mst_conn.execute(ins_qry, ['pages_h1_multi', mst_conn.execute(qry).fetchone()[0]]) # Pages per type with more than one h1's qry = ''' SELECT pagetype, count(*) FROM scr_pages_info WHERE num_h1s > 1 GROUP BY pagetype ''' for pagetype, count in mst_conn.execute(qry).fetchall(): mst_conn.execute(ins_qry, [f'pages_h1_multi_{pagetype}', count]) # Pages with no h1 qry = ''' SELECT count(*) FROM scr_pages_info WHERE num_h1s = 0 ''' mst_conn.execute(ins_qry, ['pages_h1_no', mst_conn.execute(qry).fetchone()[0]]) # Pages without title qry = ''' SELECT count(*) FROM scr_pages_info WHERE title = '' or title is NULL ''' mst_conn.execute(ins_qry, ['pages_title_no', mst_conn.execute(qry).fetchone()[0]]) # Pages with non unique title qry = ''' WITH title_freq AS ( SELECT count(*) as c FROM scr_pages_info GROUP BY title ) SELECT ifnull(sum(c), 0) FROM title_freq WHERE c > 1 ''' mst_conn.execute(ins_qry, ['pages_title_dupl', mst_conn.execute(qry).fetchone()[0]]) # Pages without description qry = ''' SELECT count(*) FROM scr_pages_info WHERE description = '' OR description IS NULL ''' mst_conn.execute(ins_qry, ['pages_descr_no', mst_conn.execute(qry).fetchone()[0]]) # Pages with description longer than 160 characters qry = ''' SELECT count(*) FROM scr_pages_info WHERE length(description) > 160 ''' mst_conn.execute(ins_qry, ['pages_descr_long', mst_conn.execute(qry).fetchone()[0]]) # Total redirects (an alias is no redirect) qry = ''' SELECT count(*) FROM scr_redirs WHERE redir_type != 'alias' ''' mst_conn.execute(ins_qry, ['redirs', mst_conn.execute(qry).fetchone()[0]]) # Redirects per type qry = ''' SELECT redir_type, count(*) FROM scr_redirs WHERE redir_type != 'alias' GROUP BY redir_type ''' for redir_type, count in mst_conn.execute(qry).fetchall(): mst_conn.execute(ins_qry, [f'redirs_{redir_type}', count]) # Redirects per type that only add or loose the last slash qry = ''' SELECT redir_type, count(*) FROM scr_redirs WHERE req_url || '/' = redir_url OR req_url = redir_url || '/' GROUP BY redir_type ''' for redir_type, count in mst_conn.execute(qry).fetchall(): mst_conn.execute(ins_qry, [f'redirs_{redir_type}_slash', count]) # Scrape wcm url's that are redirected qry = f''' SELECT count(*) FROM scr_redirs WHERE req_id IN scr_page_ids ''' mst_conn.execute(ins_qry, ['redirs_wcm-url', mst_conn.execute(qry).fetchone()[0]]) # Total aliases qry = ''' SELECT count(*) FROM scr_redirs WHERE redir_type = 'alias' ''' mst_conn.execute(ins_qry, ['url-aliases', mst_conn.execute(qry).fetchone()[0]]) # Frequency of aliases qry = f''' WITH redirs_full_urls AS ( SELECT ifnull(req_url, '{ROOT_URL}' || p1.path) AS req_url, ifnull(redir_url, '{ROOT_URL}' || p2.path) AS redir_url, redir_type FROM scr_redirs LEFT JOIN mst_paths AS p1 ON req_id = p1.page_id LEFT JOIN mst_paths AS p2 ON redir_id = p2.page_id ), alias_freq AS ( SELECT redir_url, count(*) AS aliases_per_url FROM redirs_full_urls WHERE redir_type = 'alias' GROUP BY redir_url ) SELECT aliases_per_url, count(*) FROM alias_freq GROUP BY aliases_per_url ''' for alias_per_url, count in mst_conn.execute(qry).fetchall(): mst_conn.execute(ins_qry, [f'url-aliases_{alias_per_url}x', count]) # Total, unique and average number of internal links qry = ''' SELECT count(*) FROM scr_ed_links LEFT JOIN scr_links USING (link_nr) WHERE link_type = 'int' OR link_type = 'ext-to-int' ''' ed_links_int = mst_conn.execute(qry).fetchone()[0] qry = ''' SELECT count(*) FROM scr_links WHERE link_type = 'int' OR link_type = 'ext-to-int' ''' ed_links_int_uniq = mst_conn.execute(qry).fetchone()[0] mst_conn.execute(ins_qry, ['ed-links_int', ed_links_int]) mst_conn.execute(ins_qry, ['ed-links_int_uniq', ed_links_int_uniq]) mst_conn.execute(ins_qry, ['ed-links_int_avg', round(ed_links_int / num_pages, 3)]) # Number of internal editorial links that are redirected. This count # includes wcm url's that are no longer alive but get redirected instead. qry = ''' SELECT count(*) FROM scr_ed_links LEFT JOIN scr_links AS l USING (link_nr) LEFT JOIN scr_redirs ON l.url IS req_url AND l.page_id IS req_id WHERE (redir_id NOTNULL OR redir_url NOTNULL) AND redir_type != 'alias' ''' links_redirected = mst_conn.execute(qry).fetchone()[0] mst_conn.execute(ins_qry, ['ed-links_int_redir', links_redirected]) # Number of internal editorial links that do not refer to a wcm url qry = ''' SELECT count(*) FROM scr_ed_links LEFT JOIN scr_links AS l USING (link_nr) LEFT JOIN scr_redirs ON l.url = req_url WHERE redir_type = 'alias' ''' links_non_wcm = mst_conn.execute(qry).fetchone()[0] mst_conn.execute(ins_qry, ['ed-links_int_non-wcm', links_non_wcm]) # Number of orphan pages qry = 'SELECT count(*) FROM scr_unlinked WHERE unl_type = "orphan"' num_orphans = mst_conn.execute(qry).fetchone()[0] mst_conn.execute(ins_qry, ['ed-links_int_orphan-pages', num_orphans]) # Total, unique and average number of external links qry = ''' SELECT count(*) FROM scr_ed_links LEFT JOIN scr_links USING (link_nr) WHERE link_type = 'ext' OR link_type = 'ext-to-ext' ''' ed_links_ext = mst_conn.execute(qry).fetchone()[0] qry = ''' SELECT count(*) FROM scr_links WHERE link_type = 'ext' OR link_type = 'ext-to-ext' ''' ed_links_ext_uniq = mst_conn.execute(qry).fetchone()[0] mst_conn.execute(ins_qry, ['ed-links_ext', ed_links_ext]) mst_conn.execute(ins_qry, ['ed-links_ext_uniq', ed_links_ext_uniq]) mst_conn.execute(ins_qry, ['ed-links_ext_avg', round(ed_links_ext / num_pages, 3)]) logger.info(f'status figures saved to scr_status_figures table')
Renew typical scrape status figures of the scr_status_figures table.
After deleting any previous content from the scr_status_figures table, next number of pages, redirs, url-aliases or ed-links will be added to this table:
- pages: all pages
- pages_lang_
<language>: per language - pages_buss_
<business>: per business - pages_cat_
<category>: per category - pages_type_
<pagetype>: per pagetype - pages_h1_multi: with more than one h1-tag
- pages_h1_multi_
<pagetype>: with more than one h1-tag per pagetype - pages_h1_no: without h1-tag
- pages_title_no: without or with empty title-tag
- pages_title_dupl: with non-unique title-tag
- pages_descr_no: without or with empty description meta-tag
- pages_descr_long: with description meta-tag longer than 160 characters
- redirs: total number of all redirects
- redirs_
<type>: number of redirects per redir_type - redirs_
<type>_slash: redirects per redir_type with only differing a slash - redirs_wcm-url: wcm url's that get redirected
- url-aliases: url's that alias an wcm url
- url-aliases_
<num>x: url's with<num>aliases - ed-links_
<int|ext>: internal|external editorial links - ed-links_
<int|ext>_uniq: unique internal|external editorial links - ed-links_
<int|ext>_avg: internal|external editorial links per page - ed-links_int_redir: redirected internal editorial links
- ed-links_int_non-wcm: internal editorial links to non-wcm url's
- ed-links_int_orphan-pages: number of unlinked (i.e. orphan) pages
View Source
def add_scraped_data(ts: str) -> bool: """ **Add a fresh scrape to the scrapes history pages.** Arguments: ts: timestamp of the scrape [yymmdd-hhmm] Returns: success of the operation While scraping all pages using the `scrape_page` function, new pages that had no earlier existence have been registered in the *mst_paths* table. Apart from the scrape results that are stored as files, the essential remaining results of the scrape reside in the next tables of the scrapes database: - *scr_redirs* - *scr_links* - *scr_ed_links* - *scr_unlinked* Before adding the scraped results to the history tables, the data in these tables has to be normalised and completed using next functions in succession: - `post_process_scrape` - to normalise the *scr_redirs*, *scr_links* and *scr_unlinked* tables and qualify unlinked pages - `extract_pages_info` - to extract information from the page sources to the *scr_pages_info* table - `derive_pages_info` - to complete the content of the *scr_pages_info* table - `status_figures` - to refresh the *scr_status_figures* table After using this function all data of the scrape (as stored in the scrape tables) has been merged into the various history tables and new links have been added to the *mst_links* table. The scrape tables remain unaltered afterwards, but can be safely deleted/overwritten. This function can only be used for a scrape that has a later timestamp than any other scrape in history. Otherwise it will log an error and return unsuccessful. """ # Check if this scrape is later then the youngest in the history tables. qry = ''' SELECT name FROM sqlite_schema WHERE type = 'table' AND name LIKE 'his_%' ''' last_ok = True for table in [r[0] for r in mst_conn.execute(qry)]: max_ts = mst_conn.execute( f'SELECT max(timestamp) FROM {table}').fetchone()[0] if max_ts and max_ts > ts: last_ok = False logger.error( f'scrape {ts} predates latest history in the {table} table') if not last_ok: logger.error(f'scrape {ts} not added to history') return False logger.info('started adding scrape results to master and history tables') # Add scrape redirs results to the his_redirs table cte = ''' WITH last_redirs AS ( SELECT DISTINCT req_id, req_url, last_value(redir_id) OVER win AS redir_id, last_value(redir_url) OVER win AS redir_url, last_value(redir_type) OVER win AS redir_type FROM his_redirs WINDOW win AS ( PARTITION BY req_id, req_url ORDER BY timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) ''' # Save redirs that are new or have been changed mst_conn.execute(cte + f''' INSERT INTO his_redirs (timestamp, req_id, req_url, redir_id, redir_url, redir_type) SELECT '{ts}', s.* FROM scr_redirs AS s LEFT JOIN last_redirs AS m ON s.req_id IS m.req_id AND s.req_url IS m.req_url WHERE s.redir_id IS NOT m.redir_id OR s.redir_url IS NOT m.redir_url OR s.redir_type IS NOT m.redir_type ''') # Save redirs that disappeared with redir_type 'absent' mst_conn.execute(cte + f''' INSERT INTO his_redirs (timestamp, req_id, req_url, redir_type) SELECT '{ts}', m.req_id, m.req_url, 'absent' FROM last_redirs AS m LEFT JOIN scr_redirs AS s ON m.req_id IS s.req_id AND m.req_url IS s.req_url WHERE m.redir_type != 'absent' AND s.req_id ISNULL AND s.req_url ISNULL ''') logger.info('his_redirs table updated') # Add new links to the mst_links table... mst_conn.executescript(''' INSERT INTO mst_links (url, anchor, page_id) SELECT s.url, s.anchor, s.page_id FROM scr_links AS s LEFT JOIN mst_links AS m ON s.url IS m.url AND s.anchor IS m.anchor AND s.page_id IS m.page_id WHERE m.link_id ISNULL ''') # ... and save the link_id to the scr_links table for later use. mst_conn.execute(''' UPDATE scr_links AS s SET link_id = ( SELECT link_id FROM mst_links AS m WHERE m.url IS s.url AND m.page_id IS s.page_id AND m.anchor IS s.anchor ) ''') logger.info('mst_links table updated') # Add status for new links or links of which the status changed mst_conn.execute(f''' WITH last_links_status AS ( SELECT DISTINCT link_id, last_value(link_type) OVER win AS link_type, last_value(status) OVER win AS status FROM his_links_status WINDOW win AS ( PARTITION BY link_id ORDER BY timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) INSERT INTO his_links_status (timestamp, link_id, link_type, status) SELECT '{ts}', link_id, s.link_type, s.status FROM scr_links AS s LEFT JOIN last_links_status AS l USING (link_id) WHERE (l.link_id ISNULL OR s.link_type IS NOT l.link_type OR s.status IS NOT l.status) AND NOT (l.status NOTNULL AND l.status LIKE 'error%' AND s.status NOTNULL AND s.status LIKE 'error%') ''') logger.info('his_links_status table updated') # Add records for new editorial links and links that have a different # number of appearances in a page. cte = ''' WITH scr_ed_links_normalised AS ( SELECT e.page_id, text, link_id, count(*) AS present FROM scr_ed_links AS e LEFT JOIN scr_links USING (link_nr) GROUP BY e.page_id, text, link_id ), last_ed_links AS ( SELECT DISTINCT page_id, text, link_id, last_value(present) OVER win AS present FROM his_ed_links WINDOW win AS ( PARTITION BY page_id, text, link_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) ''' mst_conn.execute(cte + f''' INSERT INTO his_ed_links (timestamp, page_id, text, link_id, present) SELECT '{ts}', page_id, text, link_id, s.present FROM scr_ed_links_normalised AS s LEFT JOIN last_ed_links AS l USING (page_id, text, link_id) WHERE l.present ISNULL OR s.present != l.present ''') # Add records for editorial links that disappeared mst_conn.execute(cte + f''' INSERT INTO his_ed_links (timestamp, page_id, text, link_id, present) SELECT '{ts}', page_id, text, link_id, 0 FROM last_ed_links AS l LEFT JOIN scr_ed_links_normalised AS s USING (page_id, text, link_id) WHERE l.present > 0 AND s.present ISNULL ''') logger.info('his_ed_links table updated') # Update his_unlinked table with the scrape results cte = f''' WITH last_unlinked AS ( SELECT DISTINCT page_id, url, last_value(unl_type) OVER win AS unl_type FROM his_unlinked WINDOW win AS ( PARTITION BY page_id, url ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) ''' # Save unlinked pages that are new or whose type changed mst_conn.execute(cte + f''' INSERT INTO his_unlinked (timestamp, page_id, url, unl_type) SELECT '{ts}', s.page_id, s.url, s.unl_type FROM scr_unlinked AS s LEFT JOIN last_unlinked AS l ON s.page_id IS l.page_id AND s.url IS l.url WHERE s.unl_type IS NOT l.unl_type -- covers new and changed ''') # Mark unlinked pages that disappeared with unl_type 'absent' mst_conn.execute(cte + f''' INSERT INTO his_unlinked (timestamp, page_id, url, unl_type) SELECT '{ts}', l.page_id, l.url, 'absent' FROM last_unlinked AS l LEFT JOIN scr_unlinked AS s ON l.page_id IS s.page_id AND l.url IS s.url WHERE l.unl_type != 'absent' AND s.unl_type ISNULL ''') logger.info('his_unlinked table updated') # Update his_pages_life table with the scrape results cte = ''' WITH last_pages_life AS ( SELECT DISTINCT page_id, last_value(alive) OVER ( PARTITION BY page_id ORDER BY timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS alive FROM his_pages_life ) ''' # Mark pages alive that are new or became alive again mst_conn.execute(cte + f''' INSERT INTO his_pages_life (timestamp, page_id, alive) SELECT '{ts}', page_id, TRUE FROM scr_page_ids LEFT JOIN last_pages_life AS l USING (page_id) WHERE l.page_id ISNULL OR NOT alive ''') # Register pages that are no longer alive mst_conn.execute(cte + f''' INSERT INTO his_pages_life (timestamp, page_id, alive) SELECT '{ts}', page_id, FALSE FROM last_pages_life LEFT JOIN scr_page_ids AS s USING (page_id) WHERE alive AND s.page_id ISNULL ''') logger.info('his_pages_life table updated') # Update his_short_urls table with the scrape results cte = f''' WITH last_short_urls AS ( SELECT DISTINCT url, last_value(active) OVER ( PARTITION BY url ORDER BY timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS active FROM his_short_urls ), scr_short_urls AS ( SELECT req_url AS url FROM scr_redirs WHERE req_url NOT LIKE '%{ROOT_URL.partition('.')[2]}%' ) ''' # Mark short-url's active that are new or became active again mst_conn.execute(cte + f''' INSERT INTO his_short_urls (timestamp, url, active) SELECT '{ts}', url, TRUE FROM scr_short_urls LEFT JOIN last_short_urls USING (url) WHERE active ISNULL OR NOT active ''') num_active = mst_conn.execute(f''' SELECT count(*) FROM his_short_urls WHERE timestamp = '{ts}' AND active ''').fetchone()[0] # Register short-url's that are no longer active mst_conn.execute(cte + f''' INSERT INTO his_short_urls (timestamp, url, active) SELECT '{ts}', url, FALSE FROM last_short_urls LEFT JOIN scr_short_urls AS s USING (url) WHERE active AND s.url ISNULL ''') num_inactive = mst_conn.execute(f''' SELECT count(*) FROM his_short_urls WHERE timestamp = '{ts}' AND NOT active ''').fetchone()[0] logger.info(f'his_short_urls table updated: ' f'registered {num_active} active and {num_inactive} inactive') # Update his_pages_info table for pages with changed aspects sql = 'SELECT name FROM pragma_table_info("his_pages_info")' page_aspects = [row[0] for row in mst_conn.execute(sql).fetchall() if row[0] != 'timestamp'] fields = ', '.join(page_aspects) mst_conn.execute(f''' WITH last_pages_info_timestamps AS ( SELECT DISTINCT page_id, last_value(timestamp) OVER ( PARTITION BY page_id ORDER BY timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS timestamp FROM his_pages_info ), last_pages_info AS ( SELECT {fields} FROM last_pages_info_timestamps LEFT JOIN his_pages_info USING (timestamp, page_id) ), changed_pages_info AS ( SELECT * FROM scr_pages_info EXCEPT SELECT * FROM last_pages_info ) INSERT INTO his_pages_info SELECT '{ts}', * FROM changed_pages_info ''') logger.info('his_pages_info table updated') # Add scrape status figures to the his_status_figures table mst_conn.execute(f''' INSERT INTO his_status_figures (timestamp, name, value) SELECT '{ts}', name, value FROM scr_status_figures ''') logger.info('his_status_figures table updated with scrape results') logger.info('concluded adding scrape results to master and history tables') return True
Add a fresh scrape to the scrapes history pages.
Arguments:
ts: timestamp of the scrape [yymmdd-hhmm]
Returns:
success of the operation
While scraping all pages using the scrape_page function, new pages that
had no earlier existence have been registered in the mst_paths table.
Apart from the scrape results that are stored as files, the essential
remaining results of the scrape reside in the next tables of the scrapes
database:
- scr_redirs
- scr_links
- scr_ed_links
- scr_unlinked
Before adding the scraped results to the history tables, the data in these tables has to be normalised and completed using next functions in succession:
post_process_scrape- to normalise the scr_redirs, scr_links and scr_unlinked tables and qualify unlinked pagesextract_pages_info- to extract information from the page sources to the scr_pages_info tablederive_pages_info- to complete the content of the scr_pages_info tablestatus_figures- to refresh the scr_status_figures table
After using this function all data of the scrape (as stored in the scrape tables) has been merged into the various history tables and new links have been added to the mst_links table. The scrape tables remain unaltered afterwards, but can be safely deleted/overwritten.
This function can only be used for a scrape that has a later timestamp than any other scrape in history. Otherwise it will log an error and return unsuccessful.
View Source
def check_storage_integrity(check_page_ids: bool = False) -> None: """ **Check integrity of the scrapes storage.** Arguments: check_page_ids: check the page_id/path relation Next checks will be executed: - available zip files with page sources against valid scrapes - validity of timestamps of site reports - validity of timestamps of 'sitemap.xml' files - validity of timestamps of 'robots.txt' files - page_id/path relation of all zip sources (depending on actual value of `check_page_ids` parameter; this will take quite some time) All detected exceptions will be logged; no data will be altered. """ # Setup logging global logger logger = logging.getLogger('check') logger.setLevel(logging.INFO) fh = logging.FileHandler(str(mst_dir / 'check.log')) fh.setLevel(logging.INFO) ch = logging.StreamHandler() ch.setLevel(logging.INFO) formatter = logging.Formatter( fmt='[%(asctime)s] %(levelname)-8s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S') fh.setFormatter(formatter) ch.setFormatter(formatter) logger.addHandler(fh) logger.addHandler(ch) # Check page sources against the valid scrapes valid_tss = set(valid_scrapes()) src_dir = mst_dir / scrape_conf.src_dir_name src_tss = set() for file in src_dir.iterdir(): if file.suffix == '.zip': ts = file.name[:11] src_tss.add(ts) if src_tss != valid_tss: if scrs_no_zip := valid_tss - src_tss: logger.warning( f'{len(scrs_no_zip)} scrapes without source zips: ' f'{", ".join(scrs_no_zip)}') if zips_no_scr := src_tss - valid_tss: logger.warning( f'{len(zips_no_scr)} source zips without scrape: ' f'{", ".join(zips_no_scr)}') logger.info('checked page source zip files') # Check site-reports against valid scrapes for report in report_conf.reports: report_conf.spec_report(report) rep_dir = mst_dir / report_conf.report_dir_name rep_tss = set() for item in rep_dir.iterdir(): if item.is_dir(): for file in item.iterdir(): ts = file.name[:11] if ts not in valid_tss: logger.warning(f'{report} report with invalid ' f'timestamp: {file.name}') else: rep_tss.add(ts) if scrs_no_rep := valid_tss - rep_tss: logger.warning( f'scrapes without {report} reports: {", ".join(scrs_no_rep)}') logger.info('checked site reports') # Check sitemaps against valid scrapes map_dir = mst_dir / scrape_conf.sitemap_dir_name for file in map_dir.iterdir(): ts = file.name[:11] if ts not in valid_tss: logger.warning(f'sitemap with invalid timestamp: {file.name}') logger.info('checked sitemaps files ') # Check robots files against valid scrapes robot_dir = mst_dir / scrape_conf.robots_dir_name for file in robot_dir.iterdir(): ts = file.name[:11] if ts not in valid_tss: logger.warning(f'robots file with invalid timestamp: {file.name}') logger.info('checked robots files') # Check page_id/path relations of all source zip files if check_page_ids: logger.info('started checking page_id/path relations of all scrapes; ' 'this will take some time') for ts in sorted(src_tss): src_zip = PageSourceZip(ts) for page_id, doc in src_zip.iter_pages(): soup = BeautifulSoup(doc, features='lxml') mst_path = mst_conn.execute( 'SELECT path FROM mst_paths WHERE page_id = ?', [page_id]).fetchone()[0] meta_tag = soup.find( 'meta', attrs={'name': 'DCTERMS.identifier'}) if meta_tag: full_src_path = meta_tag['content'] if full_src_path.startswith('/wps/wcm/connect'): src_path = full_src_path[16:] if src_path != mst_path: logger.critical( f'page {page_id} is registered as {mst_path}, ' f'but in source zip {src_zip.path.name} its ' f'path is {src_path}') else: # No consistency check possible logger.warning(f'strange DCTERMS.identifier in ' f'scrape {ts} / page {page_id}') else: # No consistency check possible logger.debug( f'page without wcm url: scrape {ts} / page {page_id}') logger.info(f'page_id/path relation checked of scrape {ts}') return
Check integrity of the scrapes storage.
Arguments:
check_page_ids: check the page_id/path relation
Next checks will be executed:
- available zip files with page sources against valid scrapes
- validity of timestamps of site reports
- validity of timestamps of 'sitemap.xml' files
- validity of timestamps of 'robots.txt' files
- page_id/path relation of all zip sources (depending on actual value
of
check_page_idsparameter; this will take quite some time)
All detected exceptions will be logged; no data will be altered.
View Source
def valid_scrapes( from_ts: str = '000101-0000', data_dir: Path = mst_dir) -> list[str]: """ **Return a sorted list of valid scrapes.** Arguments: from_ts: skip scrapes with an earlier timestamp data_dir: master directory to contain the (sub)directory with scrape logs Returns: sorted list of timestamps of valid scrapes Scrapes are assumed to be valid if the text 'scrape finished' appears in the content of the related scrape log. When this text appears in the scrape log the essential scrape was successful, even when further processing of the scraped data is not executed for some reason. Presence of log files with duplicate timestamps as well as log files that do not contain the discriminating text wil be logged. """ log_dir = data_dir / scrape_conf.log_dir_name valid_tss = set() for file in log_dir.iterdir(): if 'README' in file.name: continue ts = file.name[:11] if ts < from_ts: continue valid = False with open(file) as scr_log: for line in scr_log: if 'scrape finished' in line: valid = True break if valid: if ts in valid_tss: logger.warning(f'duplicate log for scrape {ts}') else: valid_tss.add(ts) else: logger.warning( f'scrape {ts} seems to be invalid c.q. unfinished') return sorted(valid_tss)
Return a sorted list of valid scrapes.
Arguments:
from_ts: skip scrapes with an earlier timestamp
data_dir: master directory to contain the (sub)directory with scrape
logs
Returns:
sorted list of timestamps of valid scrapes
Scrapes are assumed to be valid if the text 'scrape finished' appears in the content of the related scrape log. When this text appears in the scrape log the essential scrape was successful, even when further processing of the scraped data is not executed for some reason.
Presence of log files with duplicate timestamps as well as log files that do not contain the discriminating text wil be logged.
View Source
def reprocess_scrapes( from_ts: str = '000101-0000', reprocess_links: bool = False, copy_db: bool = True) -> None: """ **Reprocess scraped data.** Arguments: from_ts: first timestamp of data to be reprocessed reprocess_links: reprocess content of his_ed_links table copy_db: work on a copy of the scrapes database ***Note:*** *The structure of the scrapes storage is detailed in the `create_scrapes_storage` function.* This function will reprocess the data in the history tables of all scrapes or starting from the scrape with timestamp `from_ts`. The reprocessing addresses all data that can be reproduced from the data that was gathered during the actual scrape. This 'basic' scrape data, which can not be reproduced itself, consists of: - page sources - available in source zip files - redirections - stored in the his_redirs table - unlinked pages - stored in the his_unlinked table (this information can be reproduced from link analysis of the the page sources, but since only editorial links are registered, it is treated as basic data) - status of links - stored in the his_links_status table Using this function assumes that the content of the master tables is complete and correct. In case there is some doubt if this assumption is correct, some consistency checks can be executed on these tables by using the `check_storage_integrity` function. While reprocessing, next tables are rebuilt with reprocessed data: - *his_ed_links* - using page sources (if `reprocess_links` is `True`) - *his_pages_info* - using page sources - *his_pages_life* - using page sources - *his_short_urls* - using his_redirs - *his_status_figures* - using his_pages_info Although the other history tables are rebuilt as well, it will be with the same (basic) data and as such produce tables with the same content. Since reprocessing can take quite some time, the reprocessing will by default be executed on a copy of the scrapes database in order to avoid a clash with a next scrape being added. Any scrape that is added after the execution of this function started, will be reprocessed as well. After reprocessing finishes, all original history tables will be saved with their original names prepended with 'svd_' and can be deleted from the database when the reprocessing was verified and successful. """ global logger logger = logging.getLogger('reprocess') logger.setLevel(logging.INFO) ch = logging.StreamHandler() ch.setLevel(logging.INFO) formatter = logging.Formatter( fmt='[%(asctime)s] %(levelname)-8s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S') ch.setFormatter(formatter) logger.addHandler(ch) if copy_db: scrs_db = mst_dir / scrape_conf.scrs_db_name v = 0 while True: db_name = scrs_db.stem + '_repr' + (f'_{v}' if v else '') repr_db = scrs_db.with_stem(db_name) if repr_db.exists(): v += 1 continue else: break shutil.copy(scrs_db, repr_db) mst_conn.switch_to(repr_db) # Save copies of the history tables remove data that will be renewed qry = ''' SELECT name FROM sqlite_schema WHERE type = 'table' AND name LIKE 'his_%' ''' for name in [r[0] for r in mst_conn.execute(qry)]: sql = mst_conn.execute( f'SELECT sql FROM sqlite_schema WHERE name = ?', [name]).fetchone()[0] svd_name = 'svd_' + name mst_conn.execute(sql.replace(name, svd_name)) mst_conn.execute(f'INSERT INTO {svd_name} SELECT * FROM {name}') mst_conn.execute(f'DELETE FROM {name} WHERE timestamp >= "{from_ts}"') mst_conn.execute('VACUUM') # Cycle over all scrapes that need to be reprocessed. Then per scrape: # - reinstate all scrape tables as if that scrape just finished # - process these tables the same way as if the scrape was new scrapes = valid_scrapes(from_ts) while scrapes: ts = scrapes.pop(0) if not scrapes: # Add scrapes that were added in the meantime. scrapes = valid_scrapes(ts)[1:] logger.info(f'started reprocessing scrape {ts}') empty_scr_tables() src_zip = PageSourceZip(ts) # Save all page_id's to the scr_page_ids table mst_conn.execute('INSERT INTO scr_page_ids VALUES ' + ','.join(f'({pid})' for pid in src_zip.page_ids())) # Since redirections are determined at the moment of scraping, # the scr_redirs table can only be reinstated from the svd_his_redirs # table. mst_conn.execute(f''' WITH last_redirs_during_scrape AS ( SELECT DISTINCT req_id, req_url, last_value(redir_id) OVER win AS redir_id, last_value(redir_url) OVER win AS redir_url, last_value(redir_type) OVER win AS redir_type FROM svd_his_redirs WHERE timestamp <= '{ts}' WINDOW win AS ( PARTITION BY req_id, req_url ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) INSERT INTO scr_redirs SELECT * FROM last_redirs_during_scrape WHERE redir_type != 'absent' ''') # Re-analysing unlinked pages can only be done when all links are # registered. Since only editorial links are registered, # the scr_unliked will be reinstated from the svd_his_unlinked table. mst_conn.execute(f''' WITH last_unlinked_during_scrape AS ( SELECT DISTINCT page_id, url, last_value(unl_type) OVER win AS unl_type FROM svd_his_unlinked WHERE timestamp <= '{ts}' WINDOW win AS ( PARTITION BY page_id, url ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) INSERT INTO scr_unlinked SELECT * FROM last_unlinked_during_scrape WHERE unl_type != 'absent' ''') # Handle scr_links and scr_ed_links tables if reprocess_links: # Cycle over all pages for page_id, doc in src_zip.iter_pages(): soup = BeautifulSoup(doc, features='lxml') mst_path = mst_conn.execute( 'SELECT path FROM mst_paths WHERE page_id = ?', [page_id]).fetchone()[0] mst_url = ROOT_URL + mst_path # Re-process all editorial links ed_tree = split_tree(soup)[0] for a_tag in ed_tree.find_all('a', href=True): link = a_tag['href'].strip() # Some links are discarded if link.startswith('#'): # Plain #-links continue if link == '/': # Home page links continue if ('readspeaker' in link or 'adobe' in link or 'java' in link): # Non content links continue link = urljoin(mst_url, link) # Separate anchor from in-scope url if link.startswith(ROOT_URL) and '#' in link: link, _, anchor = link.partition('#') else: anchor = None if link == mst_url: # Discard in-page link (if it was not a plain #-link) continue if not link: print('Link is None or empty') continue # Check if the link is new for this scrape qry = f''' SELECT link_nr FROM scr_links WHERE url = '{link}' AND anchor IS {f"'{anchor}'" if anchor else 'NULL'} ''' qry_result = mst_conn.execute(qry).fetchone() if qry_result: # This link was already handled during this scrape link_nr = qry_result[0] else: # Store the link for this scrape link_nr = mst_conn.execute(''' INSERT INTO scr_links (url, anchor) VALUES (?, ?) ''', [link, anchor]).lastrowid # Save link in relation to the page (can be a duplicate if # the link is used more than once with the same text in the # same page). link_text = a_tag.text.strip() mst_conn.execute(''' INSERT INTO scr_ed_links (page_id, text, link_nr) VALUES (?, ?, ?) ''', [page_id, link_text, link_nr]) # Add link_type and status from saved history (since these can # not be reproduced otherwise). mst_conn.execute(f''' WITH svd_links_status AS ( SELECT DISTINCT link_id, last_value(link_type) OVER win AS link_type, last_value(status) OVER win AS status FROM svd_his_links_status WHERE timestamp <= {ts} WINDOW win AS ( PARTITION BY link_id ORDER BY timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) UPDATE scr_links AS s SET link_type = ( SELECT link_type FROM svd_links_status AS t WHERE t.link_id = s.link_id), status = ( SELECT status FROM svd_links_status AS t WHERE t.link_id = s.link_id) ''') post_process_scrape() else: # Link related data does not need to be renewed: reinstate the # scr_links and scr_ed_links tables from the saved history. cte = f''' WITH last_ed_links_during_scrape AS ( SELECT DISTINCT page_id, text, link_id, last_value(present) OVER win AS present FROM svd_his_ed_links WHERE timestamp <= '{ts}' WINDOW win AS ( PARTITION BY page_id, text, link_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ), ''' mst_conn.execute(cte + f''' last_links_status_during_scrape AS ( SELECT DISTINCT link_id, last_value(link_type) OVER win AS link_type, last_value(status) OVER win AS status FROM svd_his_links_status WHERE timestamp <= '{ts}' WINDOW win AS ( PARTITION BY link_id ORDER BY timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) ) INSERT INTO scr_links (link_nr, url, page_id, anchor, link_type, status) SELECT link_id, url, page_id, anchor, link_type, status FROM mst_links LEFT JOIN last_links_status_during_scrape USING (link_id) WHERE link_id IN (SELECT DISTINCT link_id FROM last_ed_links_during_scrape WHERE present > 0) ''') mst_conn.execute(cte + f''' -- Degroup same links per page last_ed_links_ungrouped AS ( SELECT page_id, text, link_id, present - 1 AS present FROM last_ed_links_during_scrape WHERE present > 0 UNION ALL SELECT page_id, text, link_id, present - 1 AS present FROM last_ed_links_ungrouped WHERE present > 0 ) INSERT INTO scr_ed_links (page_id, text, link_nr) SELECT page_id, text, link_id FROM last_ed_links_ungrouped ''') extract_pages_info(ts) derive_pages_info() status_figures() add_scraped_data(ts) logger.info(f'concluded reprocessing scrape {ts}') if copy_db: mst_conn.switch_back() return
Reprocess scraped data.
Arguments:
from_ts: first timestamp of data to be reprocessed
reprocess_links: reprocess content of his_ed_links table
copy_db: work on a copy of the scrapes database
Note: The structure of the scrapes storage is detailed in the
create_scrapes_storage function.
This function will reprocess the data in the history tables of all
scrapes or starting from the scrape with timestamp from_ts. The
reprocessing addresses all data that can be reproduced from the data that
was gathered during the actual scrape. This 'basic' scrape data,
which can not be reproduced itself, consists of:
- page sources - available in source zip files
- redirections - stored in the his_redirs table
- unlinked pages - stored in the his_unlinked table (this information can be reproduced from link analysis of the the page sources, but since only editorial links are registered, it is treated as basic data)
- status of links - stored in the his_links_status table
Using this function assumes that the content of the master tables is
complete and correct. In case there is some doubt if this assumption is
correct, some consistency checks can be executed on these tables by using
the check_storage_integrity function.
While reprocessing, next tables are rebuilt with reprocessed data:
- his_ed_links - using page sources (if
reprocess_linksisTrue) - his_pages_info - using page sources
- his_pages_life - using page sources
- his_short_urls - using his_redirs
- his_status_figures - using his_pages_info
Although the other history tables are rebuilt as well, it will be with the same (basic) data and as such produce tables with the same content.
Since reprocessing can take quite some time, the reprocessing will by default be executed on a copy of the scrapes database in order to avoid a clash with a next scrape being added. Any scrape that is added after the execution of this function started, will be reprocessed as well.
After reprocessing finishes, all original history tables will be saved with their original names prepended with 'svd_' and can be deleted from the database when the reprocessing was verified and successful.