bd_www.scrape

Functions to scrape www.belastingdienst.nl.

This module is used to execute a complete scrape of the www.belastingdienst.nl site. It can be executed directly or by importing and calling the scrape_site function from this bd_www package.

This module provides two maintenance functions:

This latter function should be used after changing the content of the manually administered non_scrape_ids and non_orphan_ids tables, with a from_ts starting timestamp after which these changes impact other data.

Apart from these maintenance functions, all other functions of this module implement parts of the main functionalities and are not meant to be used separately.

Note: The main design consideration for all this module is that scraping should be able to be scheduled and run unattended. Therefore no unhandled exceptions should occur and the basic scrape, i.e. getting and storing all pages together with the redirections that occur while retrieving them, should continue above all.

View Source
"""
***Functions to scrape www.belastingdienst.nl.***

This module is used to execute a complete scrape of the
www.belastingdienst.nl site. It can be executed directly or by importing and
calling the `scrape_site` function from this `bd_www` package.

This module provides two maintenance functions:

- `check_storage_integrity`: to execute some basic tests on the scrapes storage
- `reprocess_scrapes`: to reprocess earlier scrapes

This latter function should be used after changing the content of the
manually administered *non_scrape_ids* and *non_orphan_ids* tables, with a
`from_ts` starting timestamp after which these changes impact other data.

Apart from these maintenance functions, all other functions of this module
implement parts of the main functionalities and are not meant to be used
separately.

***Note:***
*The main design consideration for all this module is that scraping should be
able to be scheduled and run unattended. Therefore no unhandled exceptions
should occur and the basic scrape, i.e. getting and storing all pages
together with the redirections that occur while retrieving them, should
continue above all.*
"""

import copy
import datetime as dt
import logging
import re
import requests
import shutil
import time
import urllib3
from bs4 import BeautifulSoup
from bs4.element import NavigableString, Tag, Comment, Stylesheet
from pathlib import Path
from typing import Set, Union
from urllib.parse import urljoin, urlparse, unquote

from bd_www import scrape_conf, report_conf, mst_conn, mst_dir, PageSourceZip
from bd_www.constants import ROOT_URL, SCRAPER_VERSION, WWW_SITE
from bd_www.report import REPORT_SUBDIRS
from bd_www.matomo import matomo_available, urls_last_days

# Non-public constants
EXTRACTED_FIELDS = [
    ('title', 'TEXT'),
    ('description', 'TEXT'),
    ('num_h1s', 'INTEGER'),
    ('first_h1', 'TEXT'),
    ('language', 'TEXT'),
    ('modified', 'DATE'),
    ('pagetype', 'TEXT'),
    ('classes', 'TEXT'),
    ('ed_text', 'TEXT'),
    ('aut_text', 'TEXT')
]
"Field definitions for the *his_pages_info* and *scr_pages_info* table " \
    "(part 1 of 2)."
DERIVED_FIELDS = [
    ('business', 'TEXT'),
    ('category', 'TEXT')
]
"Field definitions for the *his_pages_info* and *scr_pages_info* table " \
    "(part 2 of 2)."
DV_TYPES = {'bld-campagne', 'bld-dv-content', 'bld-filter'}
"Set of page-types that belong to the `dv` category."
BIB_TYPES = {'bld-bd', 'bld-cluster', 'bld-concept', 'bld-direction',
             'bld-faq', 'bld-landing', 'bld-overview', 'bld-sitemap',
             'bld-target', 'bld-targetGroup'}
"Set of page-types that belong to the `bib` category."
ALG_TYPES = {'bld-error', 'bld-iahWrapper', 'bld-newsItem', 'bld-outage',
             'bld-response', 'bld-search'}
"Set of page-types that belong to the `alg` category (excluding " \
    "'bld_wrapper', which is handled specifically)."
_README_TXT = "IMPORTANT\n" \
             "=========\n" \
             "A scrape is validated by the availability of a log file " \
             "containing the text 'scrape finished'.\n" \
             "Only delete a log when a scrape needs to be completely " \
             "discarded. Do this with great care!"
"Text for the README file in the directory with the scrape logs."

# Under the hood, the requests package uses the urllib3 package. While
# checking links in the scrape_page function, this package can give some
# warnings that are not relevant in the context of this scraper.
urllib3.disable_warnings()

logger = logging.getLogger()


def create_new_data_store() -> None:
    """
    **Create a new skeleton data store.**

    ***Note:*** *Since this function will not be used very often,
    it is mainly meant as means to document the storage concept.*

    Driven by the configuration file `bd_www.constants.CONFIG_NAME`,
    (sub)directories are created in the master directory to contain the
    various scraped data, and an empty scrapes as well as a metrics database
    is created with all necessary tables.

    The next data will be stored as files in separate **subdirectories** of the
    master directory:

    - page sources in zip files per scrape
    - 'robots.txt' files when used for finding unscraped url's and if different
      from the previous version
    - 'sitemap.xml' files when used for finding unscraped url's and if
      different from the previous version
    - log files for all scrapes
    - site reports per report variation, with subdirectories per reporting
      period

    Next **files** are stored directly in the master directory:

    - web analytics log file
    - reporting log file
    - scrapes database
    - metrics database

    Various names of above mentioned files and (sub)directories are
    determined by the actual parameters of the configuration file. Please
    refer to the `bd_www.Config` class for further information.

    The ***scrapes database*** consists of the following tables and views:

    **History tables** - contain all basic, extracted and derived data from all
    scrapes. All stored data is labeled with the timestamp of the scrape.
    With some pragmatic exceptions the data is only saved if different from the
    previous scrape (time delta storage).

    - *his_pages_life* - time delta register when a page was alive or not
      (*alive* is FALSE when a page was not present)
    - *his_pages_info* - time delta register of extracted and derived data per
      page (full record saved when one or more values differ from previous
      scrape)
    - *his_ed_links* - time delta register with usage of editorial links
      (*present* is 0 when an editorial link was not used)
    - *his_links_status* - time delta register with type and status of
      (editorial) links
    - *his_redirs* - time delta register with redirects that occurred while
      scraping (*redir_type* is 'absent' when it did not occur)
    - *his_unlinked* - time delta register with pages that were not connected
      via editorial links (*unl_type* is 'absent' when no longer unlinked or
      not present)
    - *his_short_urls* - time delta register when short url's were active
    - *his_status_figures* - figures representing the status of the site for
      each scrape

    **Master tables** - contain unique and primary data

    - *mst_paths* - unique relation between *path* and *page_id*
    - *mst_links* - unique relation between *url* or *page_id*/*anchor* and
      *link_id*
    - *mst_scrapes* - primary reference for available scrapes and unique source
      for *scraper* and *primers* fields

    **Non tables/views** - collections of manually administered *page_id*'s
    to force special handling of these pages.

    - *non_scrape_ids* - table with manually administered *page_id*'s that will
      not be scraped
    - *non_orphan_ids* - table with manually administered *page_id*'s that will
      not be considered orphans when becoming unlinked (*unl_type*  will be set
      to 'special' in those cases)
    - *non_orphan_paths* - view of table *non_scrape_ids* with explicit *path*
    - *non_scrape_paths* - view of table *non_orphan_ids* with explicit *path*

    **Scrape tables** - operational stores for scrape results; purged when
    starting a new scrape.

    - *scr_page_ids* - id's of pages that were scraped
    - *scr_redirs* - redirects that occurred while scraping
    - *scr_pages_info* - extracted and derived data per page
    - *scr_links* - register of fully qualified (including *link_type* and
      *status*) links in the editorial content of the scraped pages
    - *scr_ed_links* - use of links in the editorial content each page
    - *scr_unlinked* - pages that can not be reached from other pages
    - *scr_status_figures* - figures representing the status of the site

    **Timestamped table and views** - after saving the timestamp in (the only
    record of) the *tsd* table, the timestamped views represent a dataset
    that is comparable to the scrape tables after the scrape with the
    specified timestamp finished.

    - *tsd* - scalar store to set the timestamp for the next timestamped views
    - *tsd_redirs* - identical to equivalent scrape table
    - *tsd_redirs_to_wcm* - all redirects related to the canonical url of the
      page they finally redirected to (not available as scrape table)
    - *tsd_pages_info* - identical to equivalent scrape table
    - *tsd_links* - identical to equivalent scrape table
    - *tsd_ed_links* - identical to equivalent scrape table
    - *tsd_unlinked* - identical to equivalent scrape table
    - *tsd_int_ref_cnt* - number of times that a pages is linked to from the
      editorial content of other pages (not available as scrape table)

    The ***metrics database*** consists of the following tables and views:

    **Basic tables** - contain unique and primary data

    - *shd_paths* - unique relation between *path* and *page_id*; acts as
      shadow subset of the *mst_paths* table in the scrape database as
      substitute for an inter-db relation that is not possible with SQLite
    - *shd_links* - unique relation between *link_id* and *url*; acts as
      shadow subset of the *mst_links* table in the scrape database as
      substitute for an inter-db relation that is not possible with SQLite
    - *dates* - unique relation between *date* and *date_id*

    **Analytics tables** - contain all historic analytics data

    - *daily* - numeric analytics data per date and page
    - *feedback* - numeric and textual feedback per date and page
    - *downloads* - numeric download data per date and download link

    **Analytics views** - views for the analytics tables with explicit relations

    - *daily_expl* - daily data table with explicit date and page path
    - *feedback_expl* - feedback data table with explicit date and page path
    - *downloads_expl* - downloads data table with explicit date and link path
    """

    # Create directories and database file
    for dir_name in (scrape_conf.src_dir_name,
                     scrape_conf.robots_dir_name,
                     scrape_conf.log_dir_name,
                     scrape_conf.sitemap_dir_name):
        (mst_dir / dir_name).mkdir(exist_ok=True)
    for report in report_conf.reports:
        report_conf.spec_report(report)
        report_dir = mst_dir / report_conf.report_dir_name
        report_dir.mkdir(exist_ok=True)
        for subdir_name in set(REPORT_SUBDIRS.values()):
            (report_dir / subdir_name).mkdir(exist_ok=True)
    log_readme = mst_dir / scrape_conf.log_dir_name / 'README.txt'
    with open(log_readme, 'w') as f:
        f.write(_README_TXT)

    # Scrape db: create all tables to contain the historical scrape data
    mst_conn.execute('''
        CREATE TABLE mst_scrapes (
            timestamp	TEXT PRIMARY KEY,
            date		AS ("20" || substr(timestamp,1,2)
                            || "-" || substr(timestamp,3,2)
                            || "-" || substr(timestamp,5,2)),
            scraper		TEXT,
            primers		TEXT
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE mst_paths (
            page_id	    INTEGER PRIMARY KEY AUTOINCREMENT,
            path        TEXT NOT NULL UNIQUE
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE mst_links (
            link_id		INTEGER PRIMARY KEY AUTOINCREMENT,
            url			TEXT,
            anchor      TEXT,
            page_id		INTEGER,
            UNIQUE (url),
            UNIQUE (page_id, anchor),
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE his_pages_life (
            timestamp	TEXT NOT NULL,
            page_id     INTEGER NOT NULL,
            alive       INTEGER NOT NULL,
            PRIMARY KEY (timestamp, page_id),
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    fields = EXTRACTED_FIELDS + DERIVED_FIELDS
    info_columns = ', '.join([f'{f[0]} {f[1]}' for f in fields])
    mst_conn.execute(f'''
        CREATE TABLE his_pages_info (
            timestamp	TEXT NOT NULL,
            page_id     INTEGER NOT NULL,
            {info_columns},
            PRIMARY KEY (timestamp, page_id),
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE his_links_status (
            timestamp	TEXT NOT NULL,
            link_id		INTEGER NOT NULL,
            link_type	TEXT,
            status		INTEGER,
            PRIMARY KEY (timestamp, link_id),
            FOREIGN KEY (link_id)
                REFERENCES mst_links (link_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE his_ed_links (
            timestamp	TEXT NOT NULL,
            page_id	    INTEGER NOT NULL,
            text 	    TEXT NOT NULL,
            link_id	    INTEGER NOT NULL,
            present		INTEGER NOT NULL,
            PRIMARY KEY (timestamp, page_id, text, link_id)
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id),
            FOREIGN KEY (link_id)
                REFERENCES mst_links (link_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE his_redirs (
            timestamp	TEXT NOT NULL,
            req_id      INTEGER,
            req_url     TEXT,
            redir_id    INTEGER,
            redir_url   TEXT,
            redir_type  TEXT NOT NULL,
            PRIMARY KEY (timestamp, req_id, req_url),
            UNIQUE (timestamp, req_id),
            UNIQUE (timestamp, req_url),
            FOREIGN KEY (req_id)
                REFERENCES mst_paths (page_id),
            FOREIGN KEY (redir_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE his_unlinked (
            timestamp	TEXT NOT NULL,
            page_id	    INTEGER,
            url		    TEXT,
            unl_type	TEXT,
            PRIMARY KEY (timestamp, page_id, url),
            UNIQUE (timestamp, page_id),
            UNIQUE (timestamp, url),
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE his_short_urls (
            timestamp	TEXT NOT NULL,
            url     	TEXT NOT NULL,
            active      INTEGER NOT NULL,
            PRIMARY KEY (timestamp, url)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE his_status_figures (
            timestamp	TEXT NOT NULL,
            name		TEXT NOT NULL,
            value		INTEGER NOT NULL,
            PRIMARY KEY (timestamp, name)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE non_orphan_ids (
            page_id	    INTEGER NOT NULL UNIQUE,
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE non_scrape_ids (
            page_id	    INTEGER NOT NULL UNIQUE,
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id)
        )
        ''')

    # Scrape db: create tables to contain operational results during a scrape
    mst_conn.execute('''
        CREATE TABLE scr_page_ids (
            page_id	    INTEGER NOT NULL UNIQUE,
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE scr_redirs (
            req_id      INTEGER UNIQUE,
            req_url     TEXT UNIQUE,
            redir_id    INTEGER,
            redir_url   TEXT,
            redir_type  TEXT NOT NULL,
            PRIMARY KEY (req_id, req_url),
            FOREIGN KEY (req_id)
                REFERENCES mst_paths (page_id),
            FOREIGN KEY (redir_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE scr_links (
            link_nr		INTEGER PRIMARY KEY AUTOINCREMENT,
            url			TEXT,
            page_id		INTEGER,
            anchor      TEXT,
            link_type	TEXT,
            status  	INTEGER,
            link_id	    INTEGER UNIQUE,
            UNIQUE (url, anchor),
            UNIQUE (page_id, anchor),
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE scr_ed_links (
            page_id	    INTEGER NOT NULL,
            text 	    TEXT NOT NULL,
            link_nr	    INTEGER NOT NULL,
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id),
            FOREIGN KEY (link_nr)
                REFERENCES scr_links (link_nr)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE scr_unlinked (
            page_id	    INTEGER UNIQUE,
            url		    TEXT UNIQUE,
            unl_type    TEXT,
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    fields = EXTRACTED_FIELDS + DERIVED_FIELDS
    info_columns = ', '.join([f'{f[0]} {f[1]}' for f in fields])
    mst_conn.execute(f'''
        CREATE TABLE scr_pages_info (
            page_id INTEGER PRIMARY KEY,
            {info_columns},
            FOREIGN KEY (page_id)
            REFERENCES mst_paths (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE scr_status_figures (
            name		TEXT PRIMARY KEY,
            value		INTEGER NOT NULL
        )
        ''')

    # Scrape db: create views for non-orphan and non-scrape tables with added
    # paths.
    mst_conn.execute('''
        CREATE VIEW non_orphan_paths AS
        SELECT * FROM non_orphan_ids NATURAL JOIN mst_paths
        ''')
    mst_conn.execute('''
        CREATE VIEW non_scrape_paths AS
        SELECT * FROM non_scrape_ids NATURAL JOIN mst_paths
        ''')

    # Scrape db: create a simple table to contain the timestamp for a
    # specific scrape, together with the views to get the data for that scrape.
    mst_conn.executescript('CREATE TABLE tsd (timestamp TEXT)')
    mst_conn.execute('''
        CREATE VIEW tsd_unlinked AS
        WITH
            last_unlinked_during_scrape AS (
                SELECT DISTINCT
                    page_id, url,
                    last_value(unl_type) OVER win AS unl_type
                FROM his_unlinked AS u, tsd AS s
                WHERE u.timestamp <= s.timestamp
                WINDOW win AS (
                    PARTITION BY page_id, url
                    ORDER BY u.timestamp
                    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            )
        SELECT *
        FROM last_unlinked_during_scrape
        WHERE unl_type != 'absent'
        ''')
    mst_conn.execute('''
        CREATE VIEW tsd_redirs AS
        WITH
            last_redirs_during_scrape AS (
                SELECT DISTINCT
                    req_id, req_url,
                    last_value(redir_id) OVER win AS redir_id,
                    last_value(redir_url) OVER win AS redir_url,
                    last_value(redir_type) OVER win AS redir_type
                FROM his_redirs AS r, tsd AS s
                WHERE r.timestamp <= s.timestamp
                WINDOW win AS (
                    PARTITION BY req_id, req_url
                    ORDER BY r.timestamp
                    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            )
        SELECT *
        FROM last_redirs_during_scrape
        WHERE redir_type != 'absent'
        ''')
    mst_conn.execute('''
        CREATE VIEW tsd_redirs_to_wcm (req_url, wcm_id, final_type) AS
        WITH
            -- build url-based redir chains
            scr_chains (org_url, prev_url, next_id, next_url, next_type) AS (
                SELECT
                    req_url, NULL, redir_id, redir_url, redir_type
                FROM tsd_redirs
                    UNION ALL
                SELECT
                    org_url, next_url, r.redir_id, r.redir_url, r.redir_type
                FROM scr_chains AS c
                LEFT JOIN tsd_redirs AS r ON c.next_url = r.req_url
                -- avoid endless loops by excluding redirs of an alias
                WHERE c.next_type <> 'alias'
            )
        -- get all redir chains ending in a wcm url
        SELECT org_url, next_id, next_type
        FROM scr_chains
        WHERE next_id IS NOT NULL AND org_url IS NOT NULL
        ''')
    mst_conn.execute('''
        CREATE VIEW tsd_ed_links AS
        WITH RECURSIVE
            last_his_ed_links_records_during_scrape AS (
                SELECT DISTINCT
                    page_id, text, link_id,
                    last_value(present) OVER win AS present
                FROM his_ed_links AS e, tsd AS s
                WHERE e.timestamp <= s.timestamp
                WINDOW win AS (
                    PARTITION BY page_id, text, link_id
                    ORDER BY e.timestamp
                    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            ),
            -- Degroup same links per page
            scr_ed_links_singles AS (
                SELECT page_id, text, link_id, present - 1 AS present
                FROM last_his_ed_links_records_during_scrape
                WHERE present > 0
                    UNION ALL
                SELECT page_id, text, link_id, present - 1 AS present
                FROM scr_ed_links_singles
                WHERE present > 0
            )
        SELECT page_id, text, link_id
        FROM scr_ed_links_singles
        ''')
    mst_conn.execute('''
        CREATE VIEW tsd_links AS
        WITH
            tsd_links_status AS (
                SELECT DISTINCT
                    link_id,
                    last_value(link_type) OVER win AS link_type,
                    last_value(status) OVER win AS status
                FROM his_links_status AS l, tsd AS s
                WHERE l.timestamp <= s.timestamp
                WINDOW win AS (
                    PARTITION BY link_id
                    ORDER BY l.timestamp
                    RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            )
        SELECT link_id, url, page_id, anchor, link_type, status
        FROM mst_links
        LEFT JOIN tsd_links_status USING (link_id)
        WHERE link_id IN (SELECT DISTINCT link_id FROM tsd_ed_links)
        ''')
    mst_conn.execute('''
        CREATE VIEW tsd_int_ref_cnt (page_id, referral_cnt) AS
        WITH
            -- wcm id's of the internal links per page
            int_page_links (referring_page_id,
                            referral_cnt,
                            linked_page_id) AS (
                SELECT e.page_id, count(*), ifnull(l.page_id, s.wcm_id)
                FROM tsd_ed_links AS e
                LEFT JOIN mst_links AS l USING (link_id)
                LEFT JOIN tsd_redirs_to_wcm AS s 
                    ON iif(instr(l.url, '#'),
                           substr(l.url, 1, instr(l.url, '#') - 1),
                           l.url) = req_url
                WHERE (l.page_id OR s.wcm_id)
                GROUP BY e.page_id, e.text, e.link_id
            )
        SELECT linked_page_id, sum(referral_cnt)
        FROM int_page_links
        -- next clause removes some spurious in-page links
        WHERE referring_page_id != linked_page_id
        GROUP BY linked_page_id
        ''')
    mst_conn.execute('''
        CREATE VIEW tsd_pages_info AS
        WITH
            scr_pages AS (
                SELECT DISTINCT
                    page_id,
                    last_value(alive) OVER win AS alive
                FROM his_pages_life AS p, tsd AS s
                WHERE p.timestamp <= s.timestamp
                WINDOW win AS (
                    PARTITION BY page_id
                    ORDER BY p.timestamp
                    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            ),
            scr_last_info_ts (page_id, timestamp) AS (
                SELECT DISTINCT page_id, last_value(i.timestamp) OVER win
                FROM his_pages_info AS i, tsd AS s
                WHERE i.timestamp <= s.timestamp
                WINDOW win AS (
                    PARTITION BY page_id
                    ORDER BY i.timestamp
                    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            )
        SELECT pi.*
        FROM scr_pages
        LEFT JOIN scr_last_info_ts AS li USING (page_id)
        LEFT JOIN his_pages_info AS pi USING (page_id, timestamp)
        WHERE alive
        ''')

    # Metrics db: created basic tables
    mst_conn.execute('''
        CREATE TABLE mtrx.shd_paths (
            page_id	    INTEGER NOT NULL UNIQUE,
            path	    TEXT NOT NULL UNIQUE,
            PRIMARY KEY (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE mtrx.shd_links (
            link_id     INTEGER PRIMARY KEY,
            url			TEXT NOT NULL
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE mtrx.dates (
            date_id	    INTEGER UNIQUE,
            date	    TEXT NOT NULL UNIQUE,
            PRIMARY KEY (date_id AUTOINCREMENT)
        )
        ''')

    # Metrics db: created tables to store analytic data
    mst_conn.execute('''
        CREATE TABLE mtrx.daily (
            date_id				INTEGER NOT NULL,
            page_id				INTEGER NOT NULL,
            nb_visits			INTEGER,
            nb_hits				INTEGER,
            entry_nb_visits		INTEGER,
            entry_bounce_count	INTEGER,
            bounce_rate			REAL AS
                (round(CAST(entry_bounce_count AS REAL) / entry_nb_visits, 3)),
            exit_nb_visits		INTEGER,
            exit_rate			REAL AS
                (round(CAST(exit_nb_visits AS REAL) / nb_visits, 3)),
            organic_entries		INTEGER,
            organic_entry_rate	REAL AS
                (round(CAST(organic_entries AS REAL) / nb_visits, 3)),
            call_visits			INTEGER,
            call_rate			REAL AS
                (round(CAST(call_visits AS REAL) / nb_visits, 3)),
            PRIMARY KEY (date_id, page_id),
            FOREIGN KEY (date_id) REFERENCES dates,
            FOREIGN KEY (page_id) REFERENCES shd_paths
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE mtrx.feedback (
            date_id		INTEGER NOT NULL,
            page_id		INTEGER NOT NULL,
            pos_cnt		INTEGER,
            neg_cnt		INTEGER,
            pos_txt		TEXT,
            neg_txt		TEXT,
            PRIMARY KEY (date_id, page_id),
            FOREIGN KEY (date_id) REFERENCES dates,
            FOREIGN KEY (page_id) REFERENCES shd_paths
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE mtrx.downloads (
            date_id	    INTEGER NOT NULL REFERENCES dates,
            link_id	    INTEGER NOT NULL REFERENCES shd_links,
            nb_visits	INTEGER,
            nb_hits		INTEGER,
            PRIMARY KEY (date_id, link_id)
        )
        ''')

    # Metrics db: create views to explicitly show referenced data
    mst_conn.execute('''
        CREATE VIEW mtrx.daily_expl AS
        SELECT date, path, d.*
        FROM daily AS d
        NATURAL JOIN dates
        NATURAL JOIN shd_paths
        ''')
    mst_conn.execute('''
        CREATE VIEW mtrx.feedback_expl AS
        SELECT date, path, f.*
        FROM feedback AS f
        NATURAL JOIN dates
        NATURAL JOIN shd_paths
        ''')
    mst_conn.execute('''
        CREATE VIEW mtrx.downloads_expl AS
        SELECT date, url, dl.*
        FROM downloads AS dl
        NATURAL JOIN dates
        NATURAL JOIN shd_links
        ''')


def scrape_site() -> str | None:
    """
    **Scrape www.belastingdienst.nl.**

    Returns:

        timestamp of the scrape [yymmdd-hhmm] or None in case a
            'scrape_in_progress.flag' is encountered

    One complete scrape is done of `www.belastingdienst.nl` and stored in the
    scrape tables of the scrapes database (documented with the
    `create_scrapes_storage` function) and a zip-file with pages sources.

    The scrape tables will contain `page_id`'s of the scraped pages,
    redirects that occurred while retrieving the pages, detailed info per
    page, use of links from editorial content and figures representing the
    site status.

    After finishing the scrape and completing the scrape tables, the result of
    the scrape is added to the history tables of the scrapes database.

    Further process and status information of the scrape is retained in the
    next files:

    - log file with info, warnings and/or errors of the scrape
    - copies of the 'robots.txt' and 'sitemap.xml' files if they are used to
      find unscraped pages and if they differ from the previously saved version

    In case the scraper operates in an environment with normal external
    access, all editorial links will be checked. Otherwise only links to (a
    few configured) trusted sites are checked. The outcome of all checks,
    executed or not, will be reflected in the *status* field of the
    *scr_links* and *his_links_status* tables of the scrapes database.
    """

    global logger

    def ext_int_available() -> bool:
        """
        **Tests whether external internet is available.**

        Returns:

            Availability.

        Only tests if www.google.nl can be reached.
        """
        # noinspection PyBroadException
        try:
            ok = requests.get(
                'https://www.google.nl', timeout=5).status_code == 200
        except Exception:
            ok = False
        return ok

    # Some preliminaries
    ts = time.strftime('%y%m%d-%H%M')
    logger = logging.getLogger('scrape')
    log_dir = mst_dir / scrape_conf.log_dir_name
    log_file = str(log_dir / f'{ts}_{scrape_conf.log_name}')
    logger.setLevel(logging.INFO)
    fh = logging.FileHandler(log_file)
    fh.setLevel(logging.INFO)
    ch = logging.StreamHandler()
    ch.setLevel(logging.INFO)
    formatter = logging.Formatter(
        fmt='[%(asctime)s] %(levelname)-8s - %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S')
    fh.setFormatter(formatter)
    ch.setFormatter(formatter)
    logger.addHandler(fh)
    logger.addHandler(ch)
    logger.info(f'scrape {ts} started')

    # Check if another scrape is running
    flag = mst_dir / 'SCRAPE_IN_PROGRESS.flag'
    if flag.exists():
        ct = dt.datetime.fromtimestamp(flag.stat().st_ctime)
        logger.warning(f'Another scrape in progress since '
                       f'{ct.strftime("%a %d %b %Y / %H:%M")}')
        return ''
    else:
        flag.touch()

    # Load url's that should not be scraped
    qry_result = mst_conn.execute(
        'SELECT path FROM non_scrape_paths').fetchall()
    urls_done = {ROOT_URL + path for (path,) in qry_result}

    # Initialisations
    empty_scr_tables()
    api_ok = matomo_available()
    ext_ok = ext_int_available()
    if not ext_ok:
        logger.warning('www.google.nl not available: '
                       'assuming limited external access')
    urls_todo = {WWW_SITE}
    urls_scraped = 0
    pages_saved = 0
    unlinked_urls_checked = False
    mst_conn.execute('''
        INSERT INTO mst_scrapes (timestamp, scraper, primers)
        VALUES (?, ?, ?)
        ''', [ts, SCRAPER_VERSION, 'site url only'])
    start_time = time.time()

    # Do the scrape
    max_nr = 100_000 if scrape_conf.max_urls == 'all' else scrape_conf.max_urls
    src_zip = PageSourceZip(ts)
    while urls_todo and urls_scraped < max_nr:
        req_url = urls_todo.pop()
        wcm_url = scrape_page(req_url, urls_done, urls_todo, src_zip, ext_ok)
        urls_scraped += 1
        if wcm_url:
            pages_saved += 1

        # After initial link crawling, check any url's were missed
        if not urls_todo and not unlinked_urls_checked:
            logger.info(f'initial link crawling done: '
                        f'{urls_scraped} urls\'s visited / '
                        f'{pages_saved} pages saved')
            logger.info('finding url\'s that have not been crawled')
            urls_todo = find_unscraped_urls(ts, urls_done,
                                            historic=True, requested=api_ok)
            unlinked_urls_checked = True

        # Time cycles and print progress and prognosis
        num_todo = min(len(urls_todo), max_nr - urls_scraped)
        if urls_scraped % 25 == 0:
            page_time = (time.time() - start_time) / urls_scraped
            togo_time = int(num_todo * page_time)
            print(f'{urls_scraped:4} url\'s done, '
                  f'{page_time:.2f} secs per url / '
                  f'{num_todo:4} found todo, {togo_time // 60}:'
                  f'{togo_time % 60:02} min to scrape them')

    elapsed = int(time.time() - start_time)
    # Important: do not alter the next logging text; the presence of this
    # text in the log file validates the scrape (ref. check_storage_integrity
    # function).
    logger.info(f'scrape finished in {elapsed // 60}:{elapsed % 60:02} min')
    logger.info(f'{urls_scraped} urls\'s visited / {pages_saved} pages saved')

    # Processing the scraped data
    post_process_scrape()
    extract_pages_info(ts)
    derive_pages_info()
    status_figures()
    add_scraped_data(ts)

    elapsed = int(time.time() - start_time)
    logger.info(f'scraping and processing data finished in '
                f'{elapsed // 60}:{elapsed % 60:02} min')
    flag.unlink(missing_ok=True)
    return ts


def empty_scr_tables() -> None:
    """
    **Delete all content from scrape tables.**
    """
    qry = '''
        SELECT name
        FROM sqlite_schema
        WHERE type = 'table' AND name LIKE 'scr_%'
        '''
    for name in [r[0] for r in mst_conn.execute(qry)]:
        mst_conn.execute(f'DELETE FROM {name}')


def scrape_page(
        req_url: str,
        urls_done: Set[str],
        urls_todo: Set[str],
        page_zip: PageSourceZip,
        check_all_links: bool = True) -> Union[str, None]:
    """
    **Scrape an html page and save the results to the scrape tables.**

    Arguments:

        req_url: url of requested page
        urls_done: url's that have been scraped already
        urls_todo: url's that still need to be scraped
        page_zip: archive to store scraped page sources
        check_all_links: when false, only trusted links will be checked

    Returns:

        canonical wcm url (or None for mentioned cases)

    Get the final response from the requested url while collecting any
    redirect that occurs during the request. Register the page under the
    canonical wcm path and save the page source to the zip-archive. Further
    store page-related data, editorial links and redirects to the scrape
    tables of the scrapes database (documented with the
    `create_scrapes_storage` function). The `urls_done` and `urls_todo` sets
    are updated accordingly, driving the further scrape of the site.

    When `check_all_links` is `True`, all links within the page that are
    external (do not start with `bd_www.constants.ROOT_URL`) will be checked.
    When set to `False`, only those external links will be checked that go to
    a site in the `trusted_domains` configuration parameter (refer to
    `bd_www.Config`). This option can be used when this module runs on a
    platform that (for reasons of security) has limited external site access.

    All url's are absolute, also in the resulting *scr_links*, *scr_redirs*
    and *scr_unlinked* tables. The `post_process_scrape` should be used to
    normalise these url's and determine the type of the unlinked pages.

    The returned canonical wcm url is the content attribute of the `<meta
    name="DCTERMS.identifier">` tag. This is the url as generated by the WCM
    system.

    The returned value will be `None` when the requested url

    -   was already scraped (and saved)
    -   redirects to an out-of-scope page
    -   gave an unexpected response (which is logged)
    """

    def check_ext_link(
            lnk: str, trusted_only: bool = False) -> tuple[str, str]:
        """
        **Return type and response status of an external link.**

        Arguments:

            lnk: url to be checked
            trusted_only: when true, only links to trusted sites will will be
                checked; the returned response status will reflect this

        Returns:

            response status, link type

        The returned `response status` is the status code from the primary
        response upon requesting the given url.

        By default the returned `link type` will be 'ext', but when the
        response is a redirect, depending on the destination url, '-to-int'
        or '-to-ext' is added to this type.

        When requesting the url hits an error, it is returned via the
        `response status`.

        The sites to be trusted are taken from the `trusted_domains`
        configuration parameter in the [SCRAPE] section of the configuration
        file (see `bd_www.Config`).
        """
        l_type = 'ext'
        if trusted_only:
            re_result = re.match(r'^(.*\.)?(.*\..*)$', urlparse(lnk).netloc)
            link_site = re_result[2] if re_result else None
            if link_site not in scrape_conf.trusted_domains:
                return 'unable to check link', l_type
        try:
            response = requests.get(
                lnk, allow_redirects=False, verify=False, timeout=5)
        except Exception as err:
            resp_status = f'error while checking link: {err}'
        else:
            resp_status = response.status_code
            if resp_status in (301, 302, 303, 307):
                location = response.headers['Location']
                if location.startswith(ROOT_URL):
                    l_type += '-to-int'
                else:
                    l_type += '-to-ext'
        return str(resp_status), l_type

    def save_page(path: str, doc: str) -> Union[int, None]:
        """
        **Save scraped page to the page source zip.**

        Arguments:

            path: relative to the ROOT_URL of the scrape
            doc: page source

        Returns:

            page_id or None if already saved

        If the page was not saved before, it will be saved with the page_id
        requested from the `mst_paths` table. A new page_id wil be created if
        needed.
        """
        qry_res = mst_conn.execute(
            'SELECT page_id FROM mst_paths WHERE path = ?',
            [path]).fetchone()
        if qry_res:
            p_id = qry_res[0]
        else:
            # This is a brand new page
            p_id = mst_conn.execute(
                'INSERT INTO mst_paths (path) VALUES (?)',
                [path]).lastrowid
        if p_id in page_zip.page_ids():
            # Page already saved during this scrape
            return None
        else:
            # Page not saved yet during this scrape
            page_zip.add_page(p_id, doc)
            mst_conn.execute('INSERT INTO scr_page_ids VALUES (?)', [p_id])
        return p_id

    # Setup some variables
    page_as_string = ''
    soup = None
    redir_qry = '''
        INSERT OR IGNORE INTO scr_redirs
            (req_url, redir_url, redir_type)
        VALUES
            (?, ?, ?)
        '''

    # Request loop
    while True:
        # Keep requesting until no further rewrites or redirects occur,
        # while saving redirects to the temporary table and updating the
        # urls_done admin.
        # noinspection PyBroadException
        try:
            resp = requests.get(req_url, allow_redirects=False)
        except Exception:
            # The requester failed on this url; consider it done and continue.
            logger.error(f'failed to request {req_url}')
            urls_done.add(req_url)
            return None
        urls_done.add(req_url)

        # Check and handle status codes
        if resp.status_code == 200:
            pass
        elif resp.status_code in (301, 302):
            # This concerns a redirect response
            redir_url = resp.headers['Location']
            # If request or redir is in scope, save the redirect with full
            # url's to the temporary table.
            if (req_url.startswith(ROOT_URL)
                    or redir_url.startswith(ROOT_URL)):
                mst_conn.execute(
                    redir_qry, [req_url, redir_url, resp.status_code])
            if redir_url.startswith(ROOT_URL):
                # restart Request loop with redirected url
                req_url = redir_url
                continue
            else:
                # Redirect url is out of scope.
                return None
        else:
            # Returned status code is not 200, 301 or 302, so getting the page
            # did not succeed. Consider it done and return a None result.
            logger.error(f'unexpected response from {req_url}; '
                         f'status code is {resp.status_code}.')
            return None

        # Read and parse the response into a soup document
        page_as_string = resp.text
        soup = BeautifulSoup(page_as_string, features='lxml')

        # Has this page a client-side redirect by means of header tag
        # <meta http-equiv="refresh" content="0;url=...">?
        meta_tag = soup.find('meta', attrs={'http-equiv': 'refresh'})
        if meta_tag:
            redir_url = meta_tag['content'].split('url=')[1]
            redir_url = urljoin(req_url, redir_url)
            # If request or redir is in scope, save the redirect with full
            # url's to the temporary table.
            if (req_url.startswith(ROOT_URL)
                    or redir_url.startswith(ROOT_URL)):
                mst_conn.execute(redir_qry, [req_url, redir_url, 'client'])
            if redir_url.startswith(ROOT_URL):
                # Restart Request loop with redirected url
                req_url = redir_url
                continue
            else:
                # Redirect url is out of scope.
                return None

        # End of the Request loop.
        # Reaching this point means we're done requesting this page.
        break

    # Strip potential parameters and anchor in the response url
    resp_url = resp.url.split('?')[0].split('#')[0]
    # Determine the wcm url
    if resp_url.startswith(ROOT_URL):
        # Response url is in scope. Now the wcm url should be in the header
        # meta tag named 'DCTERMS.identifier'.
        meta_tag = soup.find(
            'meta', attrs={'name': 'DCTERMS.identifier'})
        if meta_tag:
            wcm_url = meta_tag['content']
            wcm_url = urljoin(resp_url, wcm_url)
            if wcm_url != resp_url:
                # Save this as an alias in the redirects table
                mst_conn.execute(redir_qry, [resp_url, wcm_url, 'alias'])
                # Consider also the wcm_url as done
                urls_done.add(wcm_url)
        else:
            logger.error(f'page without wcm url; '
                         f'falling back to: {resp_url}')
            wcm_url = resp_url
    else:
        # Response url is not in scope.
        # Because of the logic from the code above this situation should
        # never occur. Since this is not destructive, scraping continues.
        logger.error(f'out of scope response from {req_url}')
        return None

    # Save in-scope page under the wcm path.
    # It is possible that a page in the urls_todo set redirects to a wcm_url
    # of which the page/path combination is saved already. Especially some
    # short url's cause this situation. Those cases are not saved (again).
    wcm_path = wcm_url.split(ROOT_URL)[1]
    page_id = save_page(wcm_path, page_as_string)
    if not page_id:
        # Page was saved and handled earlier during this scrape
        return None

    # Fetch all in-page links, save the editorial links to the scrape db and
    # add links of in-scope pages to the urls_todo set.
    # Cycle over the editorial, automated en rest (partial) trees of the page
    for tree in split_tree(soup):
        editorial = True if tree.html['tree'] == 'editorial' else False

        # Cycle over all links in the tree
        for a_tag in tree.find_all('a', href=True):
            # Unquote the link to be sure (history taught this)
            link = unquote(a_tag['href'].strip())
            link_text = a_tag.text.strip()
            # Some links are discarded
            if link.startswith('#'):
                # Plain #-links
                continue
            if link == '/':
                # Home page links
                continue
            if 'readspeaker' in link or 'adobe' in link or 'java' in link:
                # Non content links
                continue
            link = urljoin(resp_url, link)
            # Separate anchor from in-scope url
            if link.startswith(ROOT_URL) and '#' in link:
                link, _, anchor = link.partition('#')
            else:
                anchor = None
            if link == resp_url:
                # Discard in-page link (in case it was not a plain #-link)
                continue
            if link and editorial:
                link_type = None
                link_status = None
                # Check if the link is new for this scrape
                qry = f'''
                    SELECT link_nr
                    FROM scr_links
                    WHERE url = '{link}'
                        AND anchor IS {f"'{anchor}'" if anchor else 'NULL'}
                    '''
                qry_result = mst_conn.execute(qry).fetchone()
                if qry_result:
                    # This link was already handled during this scrape
                    link_nr = qry_result[0]
                else:
                    if re.match('^https?://', link):
                        # Get type and status of the link
                        if link.startswith(ROOT_URL):
                            link_type = 'int'
                        else:
                            # The link is external
                            link_status, link_type = check_ext_link(
                                link + '#' + anchor if anchor else link,
                                not check_all_links)
                            # If an external link redirects to an internal
                            # page, it should be scraped too. This is true for
                            # short url's, since these are strictly external
                            # relative to the root url.
                            if link_type == 'ext-to-int':
                                if link not in urls_done:
                                    urls_todo.add(link)
                    # Store the link for this scrape
                    link_nr = mst_conn.execute('''
                        INSERT INTO scr_links
                            (url, anchor, link_type, status)
                        VALUES
                            (?, ?, ?, ?)
                        ''', [link, anchor, link_type, link_status]).lastrowid
                # Save link in relation to the page (can be a duplicate if
                # the link is used more than once with the same text in the
                # same page).
                mst_conn.execute('''
                    INSERT INTO scr_ed_links (page_id, text, link_nr)
                    VALUES (?, ?, ?)
                    ''', [page_id, link_text, link_nr])
            if link:
                # Discard links that do not need to be scraped
                if link.endswith('.xml'):
                    continue
                if not link.startswith(ROOT_URL):
                    # Link not in scope
                    continue
                # Strip potential parameters (anchor was separated from
                # in-scope url before).
                valid_scrape_url = link.split('?')[0]
                if valid_scrape_url in urls_done:
                    # Already handled
                    continue
                urls_todo.add(valid_scrape_url)
            else:
                # This condition is probably not possible
                logger.warning(f'empty link in page {page_id}')

    return wcm_url


def split_tree(
        soup: BeautifulSoup) -> (BeautifulSoup, BeautifulSoup, BeautifulSoup):
    """
    **Split a soup doc in three separate docs: editorial, automated and rest.**

    Arguments:

        soup: bs4 representation of the page to be split

    Returns:

        editorial doc, automated doc, rest doc

    The *editorial doc* is a copy of the originating doc with all tags
    removed that do not contain editorial content. The *automated doc* is a
    new html (root) tag containing all tags with automated content. The *rest
    doc* is an html (root) tag containing all tags that did not represent
    editorial or automated content. No tags are lost, so all three docs
    together contain the same tags as the originating doc, but only the
    hierarchical structure of the *editorial doc* remains intact.

    To identify the separate docs, a `tree` attribute is added to the `<html>`
    tag with values 'editorial', 'automated' and 'rest' respectively.

    More specifically, the three docs will contain the next tags from the
    originating doc.

    ***rest doc***:

    - `<head>`
    - `<header>`
    - `<footer>`
    - `<div id="bld-nojs">`: for situation that javascript is not active
    - `<div class="bld-subnavigatie">`: left side navigation of bib pages
    - `<div class="bld-feedback">`: bottom feedback of content page
    - `<div>` with readspeaker buttons
    - `<div>` with modal dialog for the virtual assistant
    - `<script>` en `<noscript>`
    - comments

    ***automated doc***:

    - `<div class="add_content">`: automatically added content
    - all remaining tags from pages with type bld-overview

    ***editorial doc***:

    - all tags that do not go to one of the other docs
    """

    # Make working copy from soup doc, because of destructive tag removal
    tree = copy.copy(soup)
    # Create rest doc as html trunk
    rst_tree = BeautifulSoup('<html tree="rest"></html>', features='lxml')

    # All that is not needed for the editorial or automated trees goes to rest:
    # - head, header and footer
    if tree.head:
        rst_tree.html.append(tree.head.extract())
    if tree.body.header:
        rst_tree.html.append(tree.body.header.extract())
    if tree.body.footer:
        rst_tree.html.append(tree.body.footer.extract())
    # - content for non active javascript
    div_nojs = tree.find('div', id='bld-nojs')
    if div_nojs:
        rst_tree.html.append(div_nojs.extract())
    # - sub-navigation
    div_subnav = tree.find(class_='bld-subnavigatie')
    if div_subnav:
        rst_tree.html.append(div_subnav.extract())
    # - feedback
    div_feedback = tree.find(class_='bld-feedback')
    if div_feedback:
        rst_tree.html.append(div_feedback.extract())
    # - readspeaker buttons
    for tag in tree.find_all('div', class_='rs_skip'):
        rst_tree.html.append(tag.extract())
    # - modal dialog for the virtual assistant
    for tag in tree.find_all('div', id='vaModal'):
        rst_tree.html.append(tag.extract())
    # - comments
    for tag in tree.find_all(string=lambda text: isinstance(text, Comment)):
        rst_tree.html.append(tag.extract())
    # - scripts
    for tag in tree.find_all(('script', 'noscript')):
        rst_tree.html.append(tag.extract())
    # - no-support div
    for tag in tree.find_all(id='bld-nosupport'):
        rst_tree.html.append(tag.extract())

    # Is the page generated without any editor intervention?
    if tree('body', attrs={'data-pagetype': 'bld-overview'}):
        # Then editor tree becomes trunk only
        ed_tree = BeautifulSoup(
            '<html tree="editorial"></html>', features='lxml')
        # And automated tree gets all remaining content
        aut_tree = tree
        aut_tree.html['tree'] = 'automated'
    else:
        # Make new html trunk and graft all automated content on it
        aut_tree = BeautifulSoup(
            '<html tree="automated"></html>', features='lxml')
        for tag in tree.find_all('div', class_='content_add'):
            aut_tree.html.append(tag.extract())
        # Editorial tree gets all remaining content
        ed_tree = tree
        ed_tree.html['tree'] = 'editorial'

    return ed_tree, aut_tree, rst_tree


def find_unscraped_urls(
        ts: str, urls_done: Set[str],
        historic: bool = True, requested: bool = False) -> Set[str]:
    """
    **Get url's that were not found via link crawling.**

    Arguments:

        ts: timestamp of the scrape [yymmdd-hhmm]
        urls_done: url's that have been scraped already
        historic: check all url's that were ever scraped before
        requested: check all url's that were requested during the last 8 days

    Returns:

        originating url's that responded with internal url's that were not
        scraped yet

    When scraping via link crawling is exhausted, this function should be
    used to find url's of pages that do exist, but were not scraped yet. The
    next url sets will be verified (partially depending on parameters from
    the configuration file; see further `bd_www.Config`):

    - latest unlinked url's (depending on `use_unlinked_urls` configuration
      parameter)
    - url's from the sitemap(s), located via the 'robots.txt' file (depending
      on `use_sitemap` configuration parameter)
    - url's that were scraped during all previous scrapes
    - url's that were requested during the last eight days (including today)
    - url's encountered during earlier scrapes that redirected to internal
      pages (implicitly including short url's)

    The 'robots.txt' and 'sitemap.xml' files that are used, will be saved in
    case they differ from the previously saved version.

    The (fully qualified) response url's of the non scraped pages are stored
    in the *scr_unlinked* table of the scrapes database.
    """

    def save_scrape_file(content: str, name: str, dest_dir: Path) -> None:
        """
        **Save content to file if it differs from the most recent version.**

        Arguments:

            content: content to be saved
            name: name of the file to save to
            dest_dir: directory to save in

        The name with which the content is saved will be preceded by the
        timestamp and an underscore character.

        The content will also be saved if `dest_dir` refers to an empty
        directory.
        """
        dir_files = sorted(dest_dir.iterdir())
        if dir_files:
            with open(dir_files[-1]) as lf:
                last_content = lf.read()
        if not dir_files or content != last_content:
            with open(dest_dir / f'{ts}_{name}', 'w') as nf:
                nf.write(content)

    def living(all_urls: Set[str], set_name: str) -> Set[str]:
        """
        **Return and register url's of living pages that were not scraped yet.**

        Arguments:

            all_urls: url's to check
            set_name: identification used for logging

        Returns:

            originating url's that responded with living pages

        Apart from returning the set of originating url's that responded with
        an active internal page that was not scraped yet, the respons url's
        of these pages are stored in the `scr_unlinked` table of the scrapes
        database.

        Duration and results be logged.
        """
        start = time.time()
        missed_urls = set()
        for u in all_urls - urls_done:
            # noinspection PyBroadException
            try:
                rsp = requests.get(u)
            except Exception:
                continue
            if rsp.status_code // 100 != 4 and rsp.url.startswith(ROOT_URL):
                # Url is active and responds with an internal page
                missed_urls.add(u)
                if rsp.url not in urls_done:
                    mst_conn.execute(
                        'INSERT OR IGNORE INTO scr_unlinked (url) '
                        'VALUES (?)', [rsp.url])
                    logger.debug(f"unlinked {set_name} url's: {rsp.url}")

        # Wrapping up
        secs = int(time.time() - start)
        msg = 'link crawling '
        if missed_urls:
            msg += f"missed {len(missed_urls)} {set_name} url's"
        else:
            msg += f"did not miss any {set_name} url's"
        msg += f'; checking took {secs // 60}:{secs % 60:02} min'
        logger.info(msg)
        return missed_urls

    still_to_scrape = set()

    # Verify previously unlinked url's
    if scrape_conf.use_unlinked_urls:
        qry = f'''
            WITH
                last_unlinked_pages AS (
                    SELECT DISTINCT
                        page_id,
                        last_value(unl_type) OVER (
                            PARTITION BY page_id
                            ORDER BY timestamp
                            ROWS BETWEEN UNBOUNDED PRECEDING
                                AND UNBOUNDED FOLLOWING
                        ) AS unl_type
                    FROM his_unlinked
                ),
                last_unlinked_urls AS (
                    SELECT DISTINCT
                        url,
                        last_value(unl_type) OVER (
                            PARTITION BY url
                            ORDER BY timestamp
                            ROWS BETWEEN UNBOUNDED PRECEDING
                                AND UNBOUNDED FOLLOWING
                        ) AS unl_type
                    FROM his_unlinked
                )
            SELECT '{ROOT_URL}' || path
            FROM last_unlinked_pages
            left JOIN mst_paths USING (page_id)
            WHERE path NOTNULL AND unl_type != 'absent'
                UNION
            SELECT url
            FROM last_unlinked_urls
            WHERE url NOTNULL AND unl_type != 'absent'
            '''
        unlinked_urls = {r[0] for r in mst_conn.execute(qry).fetchall()}
        still_to_scrape |= living(unlinked_urls, 'unlinked')

    # Verify url's from the sitemap(s)
    if scrape_conf.use_sitemap:
        # Find the sitemap via robots.txt
        robots_url = WWW_SITE + '/robots.txt'
        response = requests.get(robots_url)
        if response.status_code != 200:
            logger.info(
                f'getting {robots_url} returned status {response.status_code}')
        else:
            robots_txt = response.text
            robots_dir = mst_dir / scrape_conf.robots_dir_name
            save_scrape_file(robots_txt, 'robots.txt', robots_dir)
            sitemap_urls = re.findall(r'Sitemap:\s(https?://\S*)', robots_txt)
            if len(sitemap_urls) == 0:
                logger.info('no sitemap.xml declared in robots.txt')
            else:
                urls_from_sitemaps = set()
                sitemap_dir = mst_dir / scrape_conf.sitemap_dir_name
                for i, url in enumerate(sitemap_urls, start=1):
                    response = requests.get(url)
                    if response.status_code != 200:
                        logger.info(f'getting {url} returned status '
                                    f'{response.status_code}')
                    else:
                        logger.info(f'sitemap found at {url}')
                        if len(sitemap_urls) == 1:
                            filename = f'sitemap.xml'
                        else:
                            filename = f'sitemap({i}).xml'
                        save_scrape_file(response.text, filename, sitemap_dir)
                        soup = BeautifulSoup(response.text, features='lxml')
                        urls_from_sitemaps |= {loc.text for loc
                                               in soup.findAll('loc')}
                still_to_scrape |= living(urls_from_sitemaps, 'sitemap')

    # Verify all historic url's
    if historic:
        qry_result = mst_conn.execute(
            'SELECT path FROM mst_paths').fetchall()
        historic_urls = {ROOT_URL + path for (path,) in qry_result}
        still_to_scrape |= living(historic_urls, 'historic')

    # Verify url's that were requested during the last eight days
    if requested:
        # noinspection PyBroadException
        # This try clause is used since Matomo is very unpredictable
        try:
            requested_urls = urls_last_days()
        except Exception:
            logger.critical('using urls_last_days() failed')
        else:
            # Check which do exist but are not crawled yet
            still_to_scrape |= living(requested_urls, 'visited')

    # Verify all previously requested url's that initiated redirects
    qry_result = mst_conn.execute(
        'SELECT DISTINCT req_url FROM his_redirs').fetchall()
    redirected_urls = {url for (url,) in qry_result}
    still_to_scrape |= living(redirected_urls, 'redirected')

    return still_to_scrape


def post_process_scrape() -> None:
    """
    **Update tables after finishing a scrape.**

    After scraping all pages using the `scrape_page` function,
    this `post_process_scrape` function needs to be used to updates some
    tables in the scrapes database. This is essential before further
    processing can be done of the scraped data.

    This post-processing consists of two steps:

    **Step 1** - Url normalisation

    Convert the full url's in the *scr_links*, *scr_redirs* and
    *scr_unlinked* tables to `page_id`'s where possible.

    ***Remark:*** *This can result in `page_id`'s of url's/pages that were
    not alive during the scrape (especially in the `scr_links` table).*

    **Step 2** - Typing unlinked pages

    Determine the type of the pages in the *scr_unlinked* table by setting
    *unl_type* to:

    - 'alias' - when the page is an alias to a wcm page
    - 'special' - when the id is in the *non_orphan_ids* table of the scrapes
      database
    - 'orphan' - in all other cases

    **Finally**

    Because some of the transactions impact the size of the scrapes database,
    it is vacuumed before returning.
    """

    # Replace url's in the scr_links table with page_id's where possible and
    # combine separated anchors with remaining urls (during the scrape anchors
    # were separated form internal url's).
    mst_conn.execute(f'''
        UPDATE scr_links
        SET
            page_id = (
                SELECT page_id
                FROM mst_paths
                WHERE url = '{ROOT_URL}' || path),
            url = NULL
        WHERE url IN (SELECT '{ROOT_URL}' || path FROM mst_paths)
        ''')
    mst_conn.execute('''
        UPDATE scr_links
        SET url = url || '#' || anchor,
            anchor = NULL
        WHERE url NOTNULL AND anchor NOTNULL
        ''')

    # Replace request and redirect url's in the redirs table with page_id's
    # where possible.
    mst_conn.execute(f'''
        UPDATE scr_redirs
        SET
            req_id = (
                SELECT page_id
                FROM mst_paths
                WHERE req_url = '{ROOT_URL}' || path),
            req_url = NULL
        WHERE req_url IN
            (SELECT '{ROOT_URL}' || path FROM mst_paths)
        ''')
    mst_conn.execute(f'''
        UPDATE scr_redirs
        SET
            redir_id = (
                SELECT page_id
                FROM mst_paths
                WHERE redir_url = '{ROOT_URL}' || path),
            redir_url = NULL
        WHERE redir_url IN
            (SELECT '{ROOT_URL}' || path FROM mst_paths)
        ''')

    # Replace url's in the 'scr_unlinked' table with page_id's where possible
    # and qualify the unlinked url's.
    mst_conn.execute(f'''
        UPDATE scr_unlinked
        SET
            page_id = (
                SELECT page_id
                FROM mst_paths
                WHERE url = '{ROOT_URL}' || path),
            url = NULL
        WHERE url IN (SELECT '{ROOT_URL}' || path FROM mst_paths)
        ''')
    mst_conn.execute('''
        WITH
            -- Url-based redir chains
            redir_chains (org_url, prev_url, next_id, next_url, next_type) AS (
                SELECT
                    req_url, NULL, redir_id, redir_url, redir_type
                FROM scr_redirs
                    UNION ALL
                SELECT
                    org_url, next_url, r.redir_id, r.redir_url, r.redir_type
                FROM redir_chains AS c
                LEFT JOIN scr_redirs AS r ON c.next_url = r.req_url
                -- avoid endless loops by excluding redirs of an alias
                WHERE c.next_type <> 'alias'
            ),
            -- All redir chains ending in a wcm url
            scr_redirs_to_wcm (req_url, wcm_id, final_type) AS (
                SELECT org_url, next_id, next_type
                FROM redir_chains
                WHERE next_id IS NOT NULL AND org_url IS NOT NULL
            )
        UPDATE scr_unlinked
        SET unl_type =
            iif(page_id IN non_orphan_ids,
                'special',
                iif((SELECT final_type
                     FROM scr_redirs_to_wcm
                     WHERE url = req_url) = 'alias',
                    'alias',
                    'orphan'
                   )
               )
        ''')
    logger.info('scr_links, scr_redirs and scr_unlinked tables '
                'have been normalised')

    # Finally
    mst_conn.execute('VACUUM')


def extract_pages_info(ts: str) -> None:
    """
    **Renew content of *scr_pages_info* table with extracted information.**

    Arguments:

        ts: timestamp of the scrape [yymmdd-hhmm]

    After deleting any previous content from the *scr_pages_info*,
    this function fills the 'extracted' data fields of this table for each
    scraped page with the next content:

    - *title* - content of `<title>` tag
    - *description* - content of `<meta name="description" content="..." />`
    - *num_h1s* - number of `<h1>` tags
    - *first_h1* - content of first `<h1>` tag
    - *language* - content of `<meta name="language" content="xx" />`
    - *modified* - content of `<meta name="DCTERMS.modified" content="..." />`
    - *pagetype* - attribute value of `<body data-pageType="...">`
    - *classes* - attribute value of `<body class="...">`
    - *ed_text* - editorial text of the page
    - *aut_text* - automated text of the page

    The *scr_pages_info* table accommodates additional fields to contain
    derived information for each page. This is further detailed in the
    `derive_pages_info` function.

    It will be logged when tags or attributes are missing or values are
    invalid.

    ***Data note:*** *Extracted information concerns data that is readily
    available within the page source. Storing this data in a separate table
    is strictly redundant, but serves faster access.*

    ***Implementation note:*** *The fields of the* scr_pages_info *table are
    defined by the constants `EXTRACTED_FIELDS` and `DERIVED_FIELDS`.
    Although some groundwork is laid to adapt the number and content of these
    fields, the actual code does not fully support such an adaptation.*
    """

    logger.info('extracting pages info started')

    src_zip = PageSourceZip(ts)
    num_pages = len(src_zip.page_ids())
    start_time = time.time()
    mst_conn.execute('DELETE FROM scr_pages_info')

    # Cycle over all pages
    for page_num, (page_id, doc) in enumerate(src_zip.iter_pages(), start=1):
        qry = f'SELECT path FROM mst_paths WHERE page_id = {page_id}'
        path = mst_conn.execute(qry).fetchone()[0]
        soup = BeautifulSoup(doc, features='lxml')
        info = {'page_id': page_id}

        # Get title
        title = soup.title
        if not title:
            logger.warning(f'page has no <title> tag: {path}')
            title = None
        else:
            title = title.text
            if not title:
                logger.warning(f'page with empty title: {path}')
        info['title'] = title

        # Get description
        description = soup.find(attrs={'name': 'description'})
        if not description:
            # There are very much occurrences of this situation
            logger.debug(
                f'page has no <meta name="description"/> tag: {path}')
            description = None
        else:
            description = description['content']
            if not description:
                logger.warning(f'page with empty description: {path}')
        info['description'] = description

        # Get info from <h1> tags
        h1s = []
        for h1 in soup.find_all('h1'):
            h1s.append(h1.text)
        if not h1s:
            # h1 tag can be hidden in a script (e.g. search page)
            if h1s := re.findall(r'<h1>(.*)</h1>', doc):
                logger.warning(f'page with hidden h1-tag: {path}')
            else:
                logger.warning(f'page without h1: {path}')
        info['num_h1s'] = len(h1s)
        info['first_h1'] = h1s[0] if h1s else None

        # Get language
        language = soup.find('meta', attrs={'name': 'language'})
        if not language:
            logger.warning(
                f'page has no <meta name="language"/> tag: {path}')
            language = None
        else:
            language = language['content']
            if not language:
                logger.warning(f'page with empty language: {path}')
        info['language'] = language

        # Get date modified
        modified = soup.find('meta', attrs={'name': 'DCTERMS.modified'})
        if not modified:
            logger.warning(
                f'page has no tag <meta name="DCTERMS.modified"/>: {path}')
            modified = None
        else:
            try:
                modified = dt.date.fromisoformat(modified['content'])
            except ValueError:
                logger.warning(
                    f'page with improper modification date: {path}')
                modified = None
        info['modified'] = modified

        # Get type of page
        if 'data-pagetype' not in soup.body.attrs:
            logger.warning('page has no data-pagetype attribute in the '
                           f'<body> tag: {path}')
            pagetype = None
        else:
            pagetype = soup.body['data-pagetype']
            if not pagetype:
                logger.warning(
                    f'page with empty pagetype in <body> tag: {path}')
        info['pagetype'] = pagetype

        # Get classes
        if 'class' not in soup.body.attrs:
            logger.warning(
                f'page has no class attribute in the <body> tag: {path}')
            classes = None
        else:
            classes = soup.body['class']
            if not classes:
                logger.warning(
                    f'page with empty class in <body> tag: {path}')
        info['classes'] = ' '.join(classes) if classes else None

        # Get editorial and automated texts
        info['ed_text'], info['aut_text'] = get_text(soup)

        # add info to the database
        fields = ', '.join(info)
        q_marks = ('?, ' * len(info))[:-2]
        mst_conn.execute(
            f'INSERT INTO scr_pages_info ({fields}) VALUES ({q_marks})',
            list(info.values()))

        # print progress and prognosis
        if page_num % 250 == 0:
            page_time = (time.time() - start_time) / page_num
            togo_time = int((num_pages - page_num) * page_time)
            print(
                f'extracting pages info '
                f'- togo: {num_pages - page_num} pages / '
                f'{togo_time // 60}:{togo_time % 60:02} min')

    logger.info('extracting pages info finished')


def get_text(soup: BeautifulSoup) -> [str, str]:
    """
    **Retrieve essential editorial and automated text content of a page.**

    Arguments:

        soup: bs4 representation of a page

    Returns:

        editorial text, automated text

    The editorial and automated text of the page content is returned together
    as a twofold tuple. Basically the relevant texts are retrieved from
    partial trees containing only tags with editorial or automated content
    respectively. Whitespace within these texts is normalised and coherent
    chunks are separated by newlines.
    """

    result = []
    for tree in split_tree(soup):
        if tree.html['tree'] == 'rest':
            continue

        flatten_tagbranch_to_navstring(tree.html)

        # Replace non-breaking spaces with normal ones
        txt = tree.text.replace(b'\xc2\xa0'.decode(), ' ')

        # Substitute one space for any cluster of whitespace chars (getting rid
        # of returns, newlines, tabs, spaces, etc.; this is html, you know!).
        txt = re.sub(r'\s+', ' ', txt)

        # Change #br# markers (introduced while flattening the branches) to
        # newlines, while reducing multiples separated by whitespace only.
        # The final strip() removes potential trailing newlines.
        txt = re.sub(r'\s*(#br#\s*)+\s*', r'\n', txt).strip()

        result.append(txt)
        tree.decompose()

    return result


def flatten_tagbranch_to_navstring(tag: Tag) -> None:
    """
    **Reduce a complete tag branch to one NavigableString.**

    Arguments:

        tag: part of BeautifulSoup structure that will be reduced

    Returns:

        None (tag is replaced in place with one NavigableString)

    The reduction is realised within the BeautifulSoup data structure that
    the tag is part of. This means that the function replaces the tag branch
    (in place) into a single NavigableString containing all text of the
    complete tag branch.

    The function uses a recursive tree traversal algorithm with a
    NavigableString as leaf. Each instance of the function will combine the
    text content of all children into one NavigableString. Within this string
    all `<br>` tags are replaced by `#br#` markers. The text content of all
    former `<p>`, `<h1>`, `<h2>`, `<h3>`, `<li>` and `<div>` tags in the tag
    branch is enclosed between two `#br#` markers in the resulting
    NavigableString. As such `#br#` markers act as separators between logical
    chunks of text. Due to the recursive flattening process the resulting
    NavigableString may contain more consecutive `#br#` markers. Since
    lay-out is of no concern, this has no significance however.
    """

    # Final leaf cases; done with this branch
    if type(tag) in {NavigableString, Stylesheet}:
        return

    # Has this tag children other then NavigableStrings?
    tag_children = list(tag.children)
    child_types = {type(c) for c in tag_children}
    if tag_children and child_types != {NavigableString}:
        # Flatten recursively all child branches to NavigableStrings
        for c in tag_children:
            # noinspection PyTypeChecker
            flatten_tagbranch_to_navstring(c)

    # At this point all children (if any) of tag are NavigableStrings
    tag_name = tag.name
    if tag_name == 'br':
        tag.replace_with('#br#')
    elif tag_name == 'a':
        tag.replace_with(f' {tag.text}')  # The leading space is significant
    elif tag_name in {'p', 'h1', 'h2', 'h3', 'li', 'div'}:
        tag_text = tag.text
        tag.replace_with(f'#br#{tag_text}#br#')
    else:
        tag.replace_with(tag.text)

    return


def derive_pages_info() -> None:
    """
    **Add derived information for all pages to the *scr_pages_info* table.**

    Derived information as such is not available within a page,
    but calculated or interpreted from other information. To derive this
    information, the extracted information should already be available in the
    *scr_pages_info* table. This can be accomplished by using the
    `extract_pages_info` function.

    After deleting any previous content from he *derived* data fields of the
    *scr_pages_info* table, next new content is added for each of these fields:

    - *business* - 'belastingen', 'toeslagen' or 'douane', depending on the
      extracted *classes* field
    - *category* - 'dv', 'bib' or 'alg', depending on the extracted *pagetype*
      field, mapped on basis of the constants `DV_TYPES`, `BIB_TYPES` and
      `ALG_TYPES`; the *category* of the 'bld_wrapper' *pagetype* however is
      determined on the *category* of the majority of the pages that refer to
      the specific wrapper page

    It will be logged when info can not be derived due to inconsistent or
    unavailable information.
    """

    # Clear the derived field contents
    set_cols = ', '.join([f'{f[0]} = NULL' for f in DERIVED_FIELDS])
    mst_conn.execute(f'UPDATE scr_pages_info SET {set_cols}')

    # Prepare some queries
    bus_qry = 'UPDATE scr_pages_info SET business = ? WHERE page_id = ?'
    cat_qry = 'UPDATE scr_pages_info SET category = ? WHERE page_id = ?'
    for_qry = '''
        SELECT page_id, pagetype, classes
        FROM scr_pages_info
        ORDER BY CASE pagetype WHEN 'bld-wrapper' THEN 2 ELSE 1 END
        '''
    cat_groups_qry = '''
        SELECT category
        FROM scr_ed_links
        LEFT JOIN scr_links AS l USING (link_nr)
        LEFT JOIN scr_pages_info USING (page_id)
        WHERE l.page_id = ?
        GROUP BY category
        '''
    wrappers_without_cat = set()
    logger.info('deriving pages info started')

    # Cycle over all pages, with wrapper pages after all others, because
    # the wrapper category is determined by the categories of pages
    # linking to that wrapper page.
    for page_id, pagetype, classes in mst_conn.execute(for_qry):

        # Determine business
        if classes:
            if 'toeslagen' in classes or 'tsl' in classes:
                business = 'toeslagen'
            elif 'dou' in classes:
                business = 'douane'
            else:
                business = 'belastingen'
        else:
            business = None
        mst_conn.execute(bus_qry, [business, page_id])

        # Determine category: dv, bib or alg
        if pagetype in DV_TYPES:
            category = 'dv'
        elif pagetype in BIB_TYPES:
            category = 'bib'
        elif pagetype in ALG_TYPES:
            category = 'alg'
        elif pagetype == 'bld-wrapper':
            # Group categories of all pages that link to this wrapper page
            categories = mst_conn.execute(
                cat_groups_qry, [page_id]).fetchall()
            if len(categories) == 1:
                # All pages linking to this wrapper have the same category
                category = categories[0][0]
                if not category:
                    # Probably because other uncategorized wrappers are
                    # linking to this one. Save its page_id for
                    # post-processing.
                    wrappers_without_cat.add(page_id)
            else:
                category = 'alg'
        else:
            category = None
        mst_conn.execute(cat_qry, [category, page_id])

    # The main reason that wrappers do not get a category in the main loop,
    # is that pages that link to these wrappers are wrappers themselves that
    # had no category yet when asked for in the main loop. Repeating the loop
    # for the category of these pages will resolve most of them. To avoid
    # endless looping when wrappers without categories are linking to each
    # other, the number of cycles is maximized (crude but effective).
    max_cycles = len(wrappers_without_cat) * 3
    cycle = 0
    while wrappers_without_cat and cycle < max_cycles:
        cycle += 1
        page_id = wrappers_without_cat.pop()
        # Algorithm for category resolution is the same as in the main loop
        categories = mst_conn.execute(cat_groups_qry, [page_id]).fetchall()
        if len(categories) == 1:
            category = categories[0][0]
            if not category:
                wrappers_without_cat.add(page_id)
                continue
        else:
            category = 'alg'
        mst_conn.execute(cat_qry, [category, page_id])

    # Log the remaining cases and set their category to 'alg' to avoid NULLs
    # in the database.
    for page_id in wrappers_without_cat:
        path = mst_conn.execute(f'SELECT path FROM mst_paths '
                                f'WHERE page_id = {page_id}').fetchone()[0]
        logger.warning(
            f"unresolved category of wrapper set to 'alg': {path}")
        mst_conn.execute(cat_qry, ['alg', page_id])

    logger.info('deriving pages info finished')


def status_figures() -> None:
    """
    **Renew typical scrape status figures of the *scr_status_figures* table.**

    After deleting any previous content from the *scr_status_figures* table,
    next number of pages, redirs, url-aliases or ed-links will be added to
    this table:

    - pages: all pages
    - pages_lang_`<language>`: per *language*
    - pages_buss_`<business>`: per *business*
    - pages_cat_`<category>`: per *category*
    - pages_type_`<pagetype>`: per *pagetype*
    - pages_h1_multi: with more than one h1-tag
    - pages_h1_multi_`<pagetype>`: with more than one h1-tag per *pagetype*
    - pages_h1_no: without h1-tag
    - pages_title_no: without or with empty title-tag
    - pages_title_dupl: with non-unique title-tag
    - pages_descr_no: without or with empty description meta-tag
    - pages_descr_long: with description meta-tag longer than 160 characters
    - redirs: total number of all redirects
    - redirs_`<type>`: number of redirects per *redir_type*
    - redirs_`<type>`_slash: redirects per *redir_type* with only differing a
      slash
    - redirs_wcm-url: wcm url's that get redirected
    - url-aliases: url's that alias an wcm url
    - url-aliases_`<num>`x: url's with `<num>` aliases
    - ed-links_`<int|ext>`: internal|external editorial links
    - ed-links_`<int|ext>`_uniq: unique internal|external editorial links
    - ed-links_`<int|ext>`_avg: internal|external editorial links per page
    - ed-links_int_redir: redirected internal editorial links
    - ed-links_int_non-wcm: internal editorial links to non-wcm url's
    - ed-links_int_orphan-pages: number of unlinked (i.e. orphan) pages
    """

    mst_conn.execute('DELETE FROM scr_status_figures')
    ins_qry = 'INSERT INTO scr_status_figures (name, value) VALUES (?, ?)'

    # Total pages
    qry = '''
        SELECT count(*)
        FROM scr_pages_info
        '''
    num_pages = mst_conn.execute(qry).fetchone()[0]
    mst_conn.execute(ins_qry, ['pages', num_pages])

    # Pages per language
    qry = '''
        SELECT language, count(*)
        FROM scr_pages_info
        GROUP BY language
        ORDER BY language DESC
        '''
    for language, count in mst_conn.execute(qry).fetchall():
        mst_conn.execute(ins_qry, [f'pages_lang_{language}', count])

    # Pages per business
    qry = '''
        SELECT business, count(*)
        FROM scr_pages_info
        GROUP BY business
        ORDER BY business
        '''
    for business, count in mst_conn.execute(qry).fetchall():
        mst_conn.execute(ins_qry, [f'pages_buss_{business}', count])

    # Pages per category
    qry = '''
        SELECT category, count(*)
        FROM scr_pages_info
        GROUP BY category
        ORDER BY business DESC
        '''
    for category, count in mst_conn.execute(qry).fetchall():
        mst_conn.execute(ins_qry, [f'pages_cat_{category}', count])

    # Pages per type
    qry = '''
        SELECT pagetype, count(*)
        FROM scr_pages_info
        GROUP BY pagetype
        ORDER BY category DESC, count(*) ASC
        '''
    for pagetype, count in mst_conn.execute(qry).fetchall():
        mst_conn.execute(ins_qry, [f'pages_type_{pagetype}', count])

    # Pages with more than one h1's
    qry = '''
        SELECT count(*)
        FROM scr_pages_info
        WHERE num_h1s > 1
        '''
    mst_conn.execute(ins_qry, ['pages_h1_multi',
                               mst_conn.execute(qry).fetchone()[0]])

    # Pages per type with more than one h1's
    qry = '''
        SELECT pagetype, count(*)
        FROM scr_pages_info
        WHERE num_h1s > 1
        GROUP BY pagetype
        '''
    for pagetype, count in mst_conn.execute(qry).fetchall():
        mst_conn.execute(ins_qry, [f'pages_h1_multi_{pagetype}', count])

    # Pages with no h1
    qry = '''
        SELECT count(*)
        FROM scr_pages_info
        WHERE num_h1s = 0
        '''
    mst_conn.execute(ins_qry, ['pages_h1_no',
                               mst_conn.execute(qry).fetchone()[0]])

    # Pages without title
    qry = '''
        SELECT count(*)
        FROM scr_pages_info
        WHERE title = '' or title is NULL
        '''
    mst_conn.execute(ins_qry, ['pages_title_no',
                               mst_conn.execute(qry).fetchone()[0]])

    # Pages with non unique title
    qry = '''
        WITH
            title_freq AS (
                SELECT count(*) as c
                FROM scr_pages_info
                GROUP BY title
            )
        SELECT ifnull(sum(c), 0)
        FROM title_freq
        WHERE c > 1
        '''
    mst_conn.execute(ins_qry, ['pages_title_dupl',
                               mst_conn.execute(qry).fetchone()[0]])

    # Pages without description
    qry = '''
        SELECT count(*)
        FROM scr_pages_info
        WHERE description = '' OR description IS NULL
        '''
    mst_conn.execute(ins_qry, ['pages_descr_no',
                               mst_conn.execute(qry).fetchone()[0]])

    # Pages with description longer than 160 characters
    qry = '''
        SELECT count(*)
        FROM scr_pages_info
        WHERE length(description) > 160
        '''
    mst_conn.execute(ins_qry, ['pages_descr_long',
                               mst_conn.execute(qry).fetchone()[0]])

    # Total redirects (an alias is no redirect)
    qry = '''
        SELECT count(*)
        FROM scr_redirs
        WHERE redir_type != 'alias'
        '''
    mst_conn.execute(ins_qry, ['redirs',
                               mst_conn.execute(qry).fetchone()[0]])

    # Redirects per type
    qry = '''
        SELECT redir_type, count(*)
        FROM scr_redirs
        WHERE redir_type != 'alias'
        GROUP BY redir_type
        '''
    for redir_type, count in mst_conn.execute(qry).fetchall():
        mst_conn.execute(ins_qry, [f'redirs_{redir_type}', count])

    # Redirects per type that only add or loose the last slash
    qry = '''
        SELECT redir_type, count(*)
        FROM scr_redirs
        WHERE req_url || '/' = redir_url OR req_url = redir_url || '/'
        GROUP BY redir_type
        '''
    for redir_type, count in mst_conn.execute(qry).fetchall():
        mst_conn.execute(ins_qry, [f'redirs_{redir_type}_slash', count])

    # Scrape wcm url's that are redirected
    qry = f'''
        SELECT count(*)
        FROM scr_redirs
        WHERE req_id IN scr_page_ids
        '''
    mst_conn.execute(ins_qry, ['redirs_wcm-url',
                               mst_conn.execute(qry).fetchone()[0]])

    # Total aliases
    qry = '''
        SELECT count(*)
        FROM scr_redirs
        WHERE redir_type = 'alias'
        '''
    mst_conn.execute(ins_qry, ['url-aliases',
                               mst_conn.execute(qry).fetchone()[0]])

    # Frequency of aliases
    qry = f'''
        WITH
            redirs_full_urls AS (
                SELECT
                    ifnull(req_url, '{ROOT_URL}' || p1.path) AS req_url,
                    ifnull(redir_url, '{ROOT_URL}' || p2.path) AS redir_url,
                    redir_type
                FROM scr_redirs
                LEFT JOIN mst_paths AS p1 ON req_id = p1.page_id
                LEFT JOIN mst_paths AS p2 ON redir_id = p2.page_id
            ),
            alias_freq AS (
                SELECT redir_url, count(*) AS aliases_per_url
                FROM redirs_full_urls
                WHERE redir_type = 'alias'
                GROUP BY redir_url
            )
        SELECT aliases_per_url, count(*)
        FROM alias_freq
        GROUP BY aliases_per_url
        '''
    for alias_per_url, count in mst_conn.execute(qry).fetchall():
        mst_conn.execute(ins_qry, [f'url-aliases_{alias_per_url}x', count])

    # Total, unique and average number of internal links
    qry = '''
        SELECT count(*)
        FROM scr_ed_links
        LEFT JOIN scr_links USING (link_nr)
        WHERE link_type = 'int' OR link_type = 'ext-to-int'
        '''
    ed_links_int = mst_conn.execute(qry).fetchone()[0]
    qry = '''
        SELECT count(*)
        FROM scr_links
        WHERE link_type = 'int' OR link_type = 'ext-to-int'
        '''
    ed_links_int_uniq = mst_conn.execute(qry).fetchone()[0]
    mst_conn.execute(ins_qry, ['ed-links_int', ed_links_int])
    mst_conn.execute(ins_qry, ['ed-links_int_uniq', ed_links_int_uniq])
    mst_conn.execute(ins_qry, ['ed-links_int_avg',
                               round(ed_links_int / num_pages, 3)])

    # Number of internal editorial links that are redirected. This count
    # includes wcm url's that are no longer alive but get redirected instead.
    qry = '''
        SELECT count(*)
        FROM scr_ed_links
        LEFT JOIN scr_links AS l USING (link_nr)
        LEFT JOIN scr_redirs ON l.url IS req_url AND l.page_id IS req_id
        WHERE (redir_id NOTNULL OR redir_url NOTNULL) AND redir_type != 'alias'
        '''
    links_redirected = mst_conn.execute(qry).fetchone()[0]
    mst_conn.execute(ins_qry, ['ed-links_int_redir', links_redirected])

    # Number of internal editorial links that do not refer to a wcm url
    qry = '''
        SELECT count(*)
        FROM scr_ed_links
        LEFT JOIN scr_links AS l USING (link_nr)
        LEFT JOIN scr_redirs ON l.url = req_url
        WHERE redir_type = 'alias'
        '''
    links_non_wcm = mst_conn.execute(qry).fetchone()[0]
    mst_conn.execute(ins_qry, ['ed-links_int_non-wcm', links_non_wcm])

    # Number of orphan pages
    qry = 'SELECT count(*) FROM scr_unlinked WHERE unl_type = "orphan"'
    num_orphans = mst_conn.execute(qry).fetchone()[0]
    mst_conn.execute(ins_qry, ['ed-links_int_orphan-pages', num_orphans])

    # Total, unique and average number of external links
    qry = '''
        SELECT count(*)
        FROM scr_ed_links
        LEFT JOIN scr_links USING (link_nr)
        WHERE link_type = 'ext' OR link_type = 'ext-to-ext'
        '''
    ed_links_ext = mst_conn.execute(qry).fetchone()[0]
    qry = '''
        SELECT count(*)
        FROM scr_links
        WHERE link_type = 'ext' OR link_type = 'ext-to-ext'
        '''
    ed_links_ext_uniq = mst_conn.execute(qry).fetchone()[0]
    mst_conn.execute(ins_qry, ['ed-links_ext', ed_links_ext])
    mst_conn.execute(ins_qry, ['ed-links_ext_uniq', ed_links_ext_uniq])
    mst_conn.execute(ins_qry, ['ed-links_ext_avg',
                               round(ed_links_ext / num_pages, 3)])

    logger.info(f'status figures saved to scr_status_figures table')


def add_scraped_data(ts: str) -> bool:
    """
    **Add a fresh scrape to the scrapes history pages.**

    Arguments:

        ts: timestamp of the scrape [yymmdd-hhmm]

    Returns:

        success of the operation

    While scraping all pages using the `scrape_page` function, new pages that 
    had no earlier existence have been registered in the *mst_paths* table.
    Apart from the scrape results that are stored as files, the essential 
    remaining results of the scrape reside in the next tables of the scrapes
    database: 

    - *scr_redirs*
    - *scr_links*
    - *scr_ed_links*
    - *scr_unlinked*

    Before adding the scraped results to the history tables, the data in
    these tables has to be normalised and completed using next functions in
    succession:

    - `post_process_scrape` - to normalise the *scr_redirs*, *scr_links*
      and *scr_unlinked* tables and qualify unlinked pages
    - `extract_pages_info` - to extract information from the page sources to
      the *scr_pages_info* table
    - `derive_pages_info` - to complete the content of the *scr_pages_info*
      table
    - `status_figures` - to refresh the *scr_status_figures* table

    After using this function all data of the scrape (as stored in the scrape
    tables) has been merged into the various history tables and new links
    have been added to the *mst_links* table. The scrape tables remain
    unaltered afterwards, but can be safely deleted/overwritten.

    This function can only be used for a scrape that has a later timestamp
    than any other scrape in history. Otherwise it will log an error and
    return unsuccessful.
    """

    # Check if this scrape is later then the youngest in the history tables.
    qry = '''
        SELECT name
        FROM sqlite_schema
        WHERE type = 'table' AND name LIKE 'his_%'
        '''
    last_ok = True
    for table in [r[0] for r in mst_conn.execute(qry)]:
        max_ts = mst_conn.execute(
            f'SELECT max(timestamp) FROM {table}').fetchone()[0]
        if max_ts and max_ts > ts:
            last_ok = False
            logger.error(
                f'scrape {ts} predates latest history in the {table} table')
    if not last_ok:
        logger.error(f'scrape {ts} not added to history')
        return False

    logger.info('started adding scrape results to master and history tables')

    # Add scrape redirs results to the his_redirs table
    cte = '''
        WITH
            last_redirs AS (
                SELECT DISTINCT
                    req_id, req_url,
                    last_value(redir_id) OVER win AS redir_id,
                    last_value(redir_url) OVER win AS redir_url,
                    last_value(redir_type) OVER win AS redir_type
                FROM his_redirs
                WINDOW win AS (
                    PARTITION BY req_id, req_url
                    ORDER BY timestamp
                    RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            )
        '''
    # Save redirs that are new or have been changed
    mst_conn.execute(cte + f'''
        INSERT INTO his_redirs
            (timestamp, req_id, req_url, redir_id, redir_url, redir_type)
        SELECT '{ts}', s.*
        FROM scr_redirs AS s
        LEFT JOIN last_redirs AS m
            ON s.req_id IS m.req_id
            AND s.req_url IS m.req_url
        WHERE s.redir_id IS NOT m.redir_id
            OR s.redir_url IS NOT m.redir_url
            OR s.redir_type IS NOT m.redir_type
        ''')
    # Save redirs that disappeared with redir_type 'absent'
    mst_conn.execute(cte + f'''
        INSERT INTO his_redirs
            (timestamp, req_id, req_url, redir_type)
        SELECT
            '{ts}', m.req_id, m.req_url, 'absent'
        FROM last_redirs AS m
        LEFT JOIN scr_redirs AS s
            ON m.req_id IS s.req_id
            AND m.req_url IS s.req_url
        WHERE m.redir_type != 'absent'
            AND s.req_id ISNULL
            AND s.req_url ISNULL
        ''')
    logger.info('his_redirs table updated')

    # Add new links to the mst_links table...
    mst_conn.executescript('''
        INSERT INTO mst_links (url, anchor, page_id)
        SELECT s.url, s.anchor, s.page_id
        FROM scr_links AS s
        LEFT JOIN mst_links AS m
            ON s.url IS m.url
            AND s.anchor IS m.anchor
            AND s.page_id IS m.page_id
        WHERE m.link_id ISNULL
        ''')
    # ... and save the link_id to the scr_links table for later use.
    mst_conn.execute('''
        UPDATE scr_links AS s
        SET link_id = (
            SELECT link_id
            FROM mst_links AS m
            WHERE m.url IS s.url
                AND m.page_id IS s.page_id
                AND m.anchor IS s.anchor
            )
        ''')
    logger.info('mst_links table updated')

    # Add status for new links or links of which the status changed
    mst_conn.execute(f'''
        WITH
            last_links_status AS (
                SELECT DISTINCT
                    link_id,
                    last_value(link_type) OVER win AS link_type,
                    last_value(status) OVER win AS status
                FROM his_links_status
                WINDOW win AS (
                    PARTITION BY link_id
                    ORDER BY timestamp
                    RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            )
        INSERT INTO his_links_status (timestamp, link_id, link_type, status)
        SELECT '{ts}', link_id, s.link_type, s.status
        FROM scr_links AS s
        LEFT JOIN last_links_status AS l USING (link_id)
        WHERE (l.link_id ISNULL 
               OR s.link_type IS NOT l.link_type 
               OR s.status IS NOT l.status)
            AND NOT
                (l.status NOTNULL AND l.status LIKE 'error%' 
                 AND s.status NOTNULL AND s.status LIKE 'error%')
        ''')
    logger.info('his_links_status table updated')

    # Add records for new editorial links and links that have a different
    # number of appearances in a page.
    cte = '''
        WITH
            scr_ed_links_normalised AS (
                SELECT e.page_id, text, link_id, count(*) AS present
                FROM scr_ed_links AS e
                LEFT JOIN scr_links USING (link_nr)
                GROUP BY e.page_id, text, link_id
            ),
            last_ed_links AS (
                SELECT DISTINCT
                    page_id, text, link_id,
                    last_value(present) OVER win AS present
                FROM his_ed_links
                WINDOW win AS (
                    PARTITION BY page_id, text, link_id
                    ORDER BY timestamp
                    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            )
        '''
    mst_conn.execute(cte + f'''
        INSERT INTO his_ed_links
            (timestamp, page_id, text, link_id, present)
        SELECT '{ts}', page_id, text, link_id, s.present
        FROM scr_ed_links_normalised AS s
        LEFT JOIN last_ed_links AS l USING (page_id, text, link_id)
        WHERE l.present ISNULL
            OR s.present != l.present
        ''')
    # Add records for editorial links that disappeared
    mst_conn.execute(cte + f'''
        INSERT INTO his_ed_links
            (timestamp, page_id, text, link_id, present)
        SELECT '{ts}', page_id, text, link_id, 0
        FROM last_ed_links AS l
        LEFT JOIN scr_ed_links_normalised AS s USING (page_id, text, link_id)
        WHERE l.present > 0 AND s.present ISNULL
        ''')
    logger.info('his_ed_links table updated')

    # Update his_unlinked table with the scrape results
    cte = f'''
        WITH
            last_unlinked AS (
                SELECT DISTINCT
                    page_id, url,
                    last_value(unl_type) OVER win AS unl_type
                FROM his_unlinked
                WINDOW win AS (
                    PARTITION BY page_id, url
                    ORDER BY timestamp
                    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            )
        '''
    # Save unlinked pages that are new or whose type changed
    mst_conn.execute(cte + f'''
        INSERT INTO his_unlinked (timestamp, page_id, url, unl_type)
        SELECT '{ts}', s.page_id, s.url, s.unl_type
        FROM scr_unlinked AS s
        LEFT JOIN last_unlinked AS l
            ON s.page_id IS l.page_id
            AND s.url IS l.url
        WHERE s.unl_type IS NOT l.unl_type  -- covers new and changed
        ''')
    # Mark unlinked pages that disappeared with unl_type 'absent'
    mst_conn.execute(cte + f'''
        INSERT INTO his_unlinked (timestamp, page_id, url, unl_type)
        SELECT '{ts}', l.page_id, l.url, 'absent'
        FROM last_unlinked AS l
        LEFT JOIN scr_unlinked AS s
            ON l.page_id IS s.page_id
            AND l.url IS s.url
        WHERE l.unl_type != 'absent' AND s.unl_type ISNULL
        ''')
    logger.info('his_unlinked table updated')

    # Update his_pages_life table with the scrape results
    cte = '''
        WITH
            last_pages_life AS (
                SELECT DISTINCT
                    page_id,
                    last_value(alive) OVER (
                        PARTITION BY page_id
                        ORDER BY timestamp
                        RANGE BETWEEN UNBOUNDED PRECEDING
                            AND UNBOUNDED FOLLOWING
                    ) AS alive
                FROM his_pages_life
            )
        '''
    # Mark pages alive that are new or became alive again
    mst_conn.execute(cte + f'''
        INSERT INTO his_pages_life (timestamp, page_id, alive)
        SELECT '{ts}', page_id, TRUE
        FROM scr_page_ids
        LEFT JOIN last_pages_life AS l USING (page_id)
        WHERE l.page_id ISNULL OR NOT alive 
        ''')
    # Register pages that are no longer alive
    mst_conn.execute(cte + f'''
        INSERT INTO his_pages_life (timestamp, page_id, alive)
        SELECT '{ts}', page_id, FALSE
        FROM last_pages_life
        LEFT JOIN scr_page_ids AS s USING (page_id)
        WHERE alive AND s.page_id ISNULL
        ''')
    logger.info('his_pages_life table updated')

    # Update his_short_urls table with the scrape results
    cte = f'''
        WITH
            last_short_urls AS (
                SELECT DISTINCT
                    url,
                    last_value(active) OVER (
                        PARTITION BY url
                        ORDER BY timestamp
                        RANGE BETWEEN UNBOUNDED PRECEDING
                            AND UNBOUNDED FOLLOWING
                    ) AS active
                FROM his_short_urls
            ),
            scr_short_urls AS (
                SELECT req_url AS url
                FROM scr_redirs
                WHERE req_url NOT LIKE '%{ROOT_URL.partition('.')[2]}%'
            )
        '''
    # Mark short-url's active that are new or became active again
    mst_conn.execute(cte + f'''
        INSERT INTO his_short_urls (timestamp, url, active)
        SELECT '{ts}', url, TRUE
        FROM scr_short_urls
        LEFT JOIN last_short_urls USING (url)
        WHERE active ISNULL OR NOT active
        ''')
    num_active = mst_conn.execute(f'''
        SELECT count(*)
        FROM his_short_urls
        WHERE timestamp = '{ts}' AND active
        ''').fetchone()[0]
    # Register short-url's that are no longer active
    mst_conn.execute(cte + f'''
        INSERT INTO his_short_urls (timestamp, url, active)
        SELECT '{ts}', url, FALSE
        FROM last_short_urls
        LEFT JOIN scr_short_urls AS s USING (url)
        WHERE active AND s.url ISNULL
        ''')
    num_inactive = mst_conn.execute(f'''
        SELECT count(*)
        FROM his_short_urls
        WHERE timestamp = '{ts}' AND NOT active
        ''').fetchone()[0]
    logger.info(f'his_short_urls table updated: '
                f'registered {num_active} active and {num_inactive} inactive')

    # Update his_pages_info table for pages with changed aspects
    sql = 'SELECT name FROM pragma_table_info("his_pages_info")'
    page_aspects = [row[0] for row in mst_conn.execute(sql).fetchall()
                    if row[0] != 'timestamp']
    fields = ', '.join(page_aspects)
    mst_conn.execute(f'''
        WITH
            last_pages_info_timestamps AS (
                SELECT DISTINCT
                page_id,
                last_value(timestamp) OVER (
                    PARTITION BY page_id
                    ORDER BY timestamp
                    RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                ) AS timestamp
                FROM his_pages_info
            ),
            last_pages_info AS (
                SELECT {fields}
                FROM last_pages_info_timestamps
                LEFT JOIN his_pages_info USING (timestamp, page_id)
            ),
            changed_pages_info AS (
                SELECT *
                FROM scr_pages_info
                    EXCEPT
                SELECT *
                FROM last_pages_info
            )
        INSERT INTO his_pages_info
        SELECT '{ts}', *
        FROM changed_pages_info
        ''')
    logger.info('his_pages_info table updated')

    # Add scrape status figures to the his_status_figures table
    mst_conn.execute(f'''
        INSERT INTO his_status_figures (timestamp, name, value)
        SELECT '{ts}', name, value
        FROM scr_status_figures
        ''')
    logger.info('his_status_figures table updated with scrape results')
    logger.info('concluded adding scrape results to master and history tables')
    return True


def check_storage_integrity(check_page_ids: bool = False) -> None:
    """
    **Check integrity of the scrapes storage.**

    Arguments:

        check_page_ids: check the page_id/path relation

    Next checks will be executed:

    - available zip files with page sources against valid scrapes
    - validity of timestamps of site reports
    - validity of timestamps of 'sitemap.xml' files
    - validity of timestamps of 'robots.txt' files
    - page_id/path relation of all zip sources (depending on actual value
      of `check_page_ids` parameter; this will take quite some time)

    All detected exceptions will be logged; no data will be altered.
    """

    # Setup logging
    global logger
    logger = logging.getLogger('check')
    logger.setLevel(logging.INFO)
    fh = logging.FileHandler(str(mst_dir / 'check.log'))
    fh.setLevel(logging.INFO)
    ch = logging.StreamHandler()
    ch.setLevel(logging.INFO)
    formatter = logging.Formatter(
        fmt='[%(asctime)s] %(levelname)-8s - %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S')
    fh.setFormatter(formatter)
    ch.setFormatter(formatter)
    logger.addHandler(fh)
    logger.addHandler(ch)
    
    # Check page sources against the valid scrapes
    valid_tss = set(valid_scrapes())
    src_dir = mst_dir / scrape_conf.src_dir_name
    src_tss = set()
    for file in src_dir.iterdir():
        if file.suffix == '.zip':
            ts = file.name[:11]
            src_tss.add(ts)
    if src_tss != valid_tss:
        if scrs_no_zip := valid_tss - src_tss:
            logger.warning(
                f'{len(scrs_no_zip)} scrapes without source zips: '
                f'{", ".join(scrs_no_zip)}')
        if zips_no_scr := src_tss - valid_tss:
            logger.warning(
                f'{len(zips_no_scr)} source zips without scrape: '
                f'{", ".join(zips_no_scr)}')
    logger.info('checked page source zip files')

    # Check site-reports against valid scrapes
    for report in report_conf.reports:
        report_conf.spec_report(report)
        rep_dir = mst_dir / report_conf.report_dir_name
        rep_tss = set()
        for item in rep_dir.iterdir():
            if item.is_dir():
                for file in item.iterdir():
                    ts = file.name[:11]
                    if ts not in valid_tss:
                        logger.warning(f'{report} report with invalid '
                                       f'timestamp: {file.name}')
                    else:
                        rep_tss.add(ts)
        if scrs_no_rep := valid_tss - rep_tss:
            logger.warning(
                f'scrapes without {report} reports: {", ".join(scrs_no_rep)}')
    logger.info('checked site reports')

    # Check sitemaps against valid scrapes
    map_dir = mst_dir / scrape_conf.sitemap_dir_name
    for file in map_dir.iterdir():
        ts = file.name[:11]
        if ts not in valid_tss:
            logger.warning(f'sitemap with invalid timestamp: {file.name}')
    logger.info('checked sitemaps files ')

    # Check robots files against valid scrapes
    robot_dir = mst_dir / scrape_conf.robots_dir_name
    for file in robot_dir.iterdir():
        ts = file.name[:11]
        if ts not in valid_tss:
            logger.warning(f'robots file with invalid timestamp: {file.name}')
    logger.info('checked robots files')

    # Check page_id/path relations of all source zip files
    if check_page_ids:
        logger.info('started checking page_id/path relations of all scrapes; '
                    'this will take some time')
        for ts in sorted(src_tss):
            src_zip = PageSourceZip(ts)
            for page_id, doc in src_zip.iter_pages():
                soup = BeautifulSoup(doc, features='lxml')
                mst_path = mst_conn.execute(
                    'SELECT path FROM mst_paths WHERE page_id = ?',
                    [page_id]).fetchone()[0]
                meta_tag = soup.find(
                    'meta', attrs={'name': 'DCTERMS.identifier'})
                if meta_tag:
                    full_src_path = meta_tag['content']
                    if full_src_path.startswith('/wps/wcm/connect'):
                        src_path = full_src_path[16:]
                        if src_path != mst_path:
                            logger.critical(
                                f'page {page_id} is registered as {mst_path}, '
                                f'but in source zip {src_zip.path.name} its '
                                f'path is {src_path}')
                    else:
                        # No consistency check possible
                        logger.warning(f'strange DCTERMS.identifier in '
                                       f'scrape {ts} / page {page_id}')
                else:
                    # No consistency check possible
                    logger.debug(
                        f'page without wcm url: scrape {ts} / page {page_id}')
            logger.info(f'page_id/path relation checked of scrape {ts}')

    return


def valid_scrapes(
        from_ts: str = '000101-0000', data_dir: Path = mst_dir) -> list[str]:
    """
    **Return a sorted list of valid scrapes.**

    Arguments:

        from_ts: skip scrapes with an earlier timestamp
        data_dir: master directory to contain the (sub)directory with scrape
            logs

    Returns:

        sorted list of timestamps of valid scrapes

    Scrapes are assumed to be valid if the text 'scrape finished' appears in
    the content of the related scrape log. When this text appears in the
    scrape log the essential scrape was successful, even when further
    processing of the scraped data is not executed for some reason.

    Presence of log files with duplicate timestamps as well as log files that
    do not contain the discriminating text wil be logged.
    """
    log_dir = data_dir / scrape_conf.log_dir_name
    valid_tss = set()
    for file in log_dir.iterdir():
        if 'README' in file.name:
            continue
        ts = file.name[:11]
        if ts < from_ts:
            continue
        valid = False
        with open(file) as scr_log:
            for line in scr_log:
                if 'scrape finished' in line:
                    valid = True
                    break
            if valid:
                if ts in valid_tss:
                    logger.warning(f'duplicate log for scrape {ts}')
                else:
                    valid_tss.add(ts)
            else:
                logger.warning(
                    f'scrape {ts} seems to be invalid c.q. unfinished')
    return sorted(valid_tss)


def reprocess_scrapes(
        from_ts: str = '000101-0000',
        reprocess_links: bool = False,
        copy_db: bool = True) -> None:
    """
    **Reprocess scraped data.**

    Arguments:

        from_ts: first timestamp of data to be reprocessed
        reprocess_links: reprocess content of his_ed_links table
        copy_db: work on a copy of the scrapes database

    ***Note:*** *The structure of the scrapes storage is detailed in the
    `create_scrapes_storage` function.*

    This function will reprocess the data in the history tables of all
    scrapes or starting from the scrape with timestamp `from_ts`. The
    reprocessing addresses all data that can be reproduced from the data that
    was gathered during the actual scrape. This 'basic' scrape data,
    which can not be reproduced itself, consists of:

    - page sources - available in source zip files
    - redirections - stored in the his_redirs table
    - unlinked pages - stored in the his_unlinked table (this information
      can be reproduced from link analysis of the the page sources, but since
      only editorial links are registered, it is treated as basic data)
    - status of links - stored in the his_links_status table

    Using this function assumes that the content of the master tables is 
    complete and correct. In case there is some doubt if this assumption is 
    correct, some consistency checks can be executed on these tables by using 
    the `check_storage_integrity` function. 
    
    While reprocessing, next tables are rebuilt with reprocessed data:

    - *his_ed_links* - using page sources (if `reprocess_links` is `True`)
    - *his_pages_info* - using page sources
    - *his_pages_life* - using page sources
    - *his_short_urls* - using his_redirs
    - *his_status_figures* - using his_pages_info

    Although the other history tables are rebuilt as well, it will be with
    the same (basic) data and as such produce tables with the same content.

    Since reprocessing can take quite some time, the reprocessing will by
    default be executed on a copy of the scrapes database in order to avoid a
    clash with a next scrape being added. Any scrape that is added after the
    execution of this function started, will be reprocessed as well.

    After reprocessing finishes, all original history tables will be saved
    with their original names prepended with 'svd_' and can be deleted from
    the database when the reprocessing was verified and successful.
    """

    global logger
    logger = logging.getLogger('reprocess')
    logger.setLevel(logging.INFO)
    ch = logging.StreamHandler()
    ch.setLevel(logging.INFO)
    formatter = logging.Formatter(
        fmt='[%(asctime)s] %(levelname)-8s - %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S')
    ch.setFormatter(formatter)
    logger.addHandler(ch)

    if copy_db:
        scrs_db = mst_dir / scrape_conf.scrs_db_name
        v = 0
        while True:
            db_name = scrs_db.stem + '_repr' + (f'_{v}' if v else '')
            repr_db = scrs_db.with_stem(db_name)
            if repr_db.exists():
                v += 1
                continue
            else:
                break
        shutil.copy(scrs_db, repr_db)
        mst_conn.switch_to(repr_db)

    # Save copies of the history tables remove data that will be renewed
    qry = '''
        SELECT name
        FROM sqlite_schema
        WHERE type = 'table' AND name LIKE 'his_%'
        '''
    for name in [r[0] for r in mst_conn.execute(qry)]:
        sql = mst_conn.execute(
            f'SELECT sql FROM sqlite_schema WHERE name = ?',
            [name]).fetchone()[0]
        svd_name = 'svd_' + name
        mst_conn.execute(sql.replace(name, svd_name))
        mst_conn.execute(f'INSERT INTO {svd_name} SELECT * FROM {name}')
        mst_conn.execute(f'DELETE FROM {name} WHERE timestamp >= "{from_ts}"')
    mst_conn.execute('VACUUM')

    # Cycle over all scrapes that need to be reprocessed. Then per scrape:
    # - reinstate all scrape tables as if that scrape just finished
    # - process these tables the same way as if the scrape was new
    scrapes = valid_scrapes(from_ts)
    while scrapes:
        ts = scrapes.pop(0)
        if not scrapes:
            # Add scrapes that were added in the meantime.
            scrapes = valid_scrapes(ts)[1:]

        logger.info(f'started reprocessing scrape {ts}')
        empty_scr_tables()
        src_zip = PageSourceZip(ts)

        # Save all page_id's to the scr_page_ids table
        mst_conn.execute('INSERT INTO scr_page_ids VALUES '
                         + ','.join(f'({pid})' for pid in src_zip.page_ids()))

        # Since redirections are determined at the moment of scraping,
        # the scr_redirs table can only be reinstated from the svd_his_redirs
        # table.
        mst_conn.execute(f'''
            WITH
                last_redirs_during_scrape AS (
                    SELECT DISTINCT
                        req_id, req_url,
                        last_value(redir_id) OVER win AS redir_id,
                        last_value(redir_url) OVER win AS redir_url,
                        last_value(redir_type) OVER win AS redir_type
                    FROM svd_his_redirs
                    WHERE timestamp <= '{ts}'
                    WINDOW win AS (
                        PARTITION BY req_id, req_url
                        ORDER BY timestamp
                        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                    )
                )
            INSERT INTO scr_redirs
            SELECT *
            FROM last_redirs_during_scrape
            WHERE redir_type != 'absent'
            ''')

        # Re-analysing unlinked pages can only be done when all links are
        # registered. Since only editorial links are registered,
        # the scr_unliked will be reinstated from the svd_his_unlinked table.
        mst_conn.execute(f'''
            WITH
                last_unlinked_during_scrape AS (
                    SELECT DISTINCT
                        page_id, url,
                        last_value(unl_type) OVER win AS unl_type
                    FROM svd_his_unlinked
                    WHERE timestamp <= '{ts}'
                    WINDOW win AS (
                        PARTITION BY page_id, url
                        ORDER BY timestamp
                        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                    )
                )
            INSERT INTO scr_unlinked
            SELECT *
            FROM last_unlinked_during_scrape
            WHERE unl_type != 'absent'
            ''')

        # Handle scr_links and scr_ed_links tables
        if reprocess_links:
            # Cycle over all pages
            for page_id, doc in src_zip.iter_pages():
                soup = BeautifulSoup(doc, features='lxml')
                mst_path = mst_conn.execute(
                    'SELECT path FROM mst_paths WHERE page_id = ?',
                    [page_id]).fetchone()[0]
                mst_url = ROOT_URL + mst_path

                # Re-process all editorial links
                ed_tree = split_tree(soup)[0]
                for a_tag in ed_tree.find_all('a', href=True):
                    link = a_tag['href'].strip()
                    # Some links are discarded
                    if link.startswith('#'):
                        # Plain #-links
                        continue
                    if link == '/':
                        # Home page links
                        continue
                    if ('readspeaker' in link
                            or 'adobe' in link
                            or 'java' in link):
                        # Non content links
                        continue
                    link = urljoin(mst_url, link)
                    # Separate anchor from in-scope url
                    if link.startswith(ROOT_URL) and '#' in link:
                        link, _, anchor = link.partition('#')
                    else:
                        anchor = None
                    if link == mst_url:
                        # Discard in-page link (if it was not a plain #-link)
                        continue
                    if not link:
                        print('Link is None or empty')
                        continue

                    # Check if the link is new for this scrape
                    qry = f'''
                        SELECT link_nr
                        FROM scr_links
                        WHERE url = '{link}'
                            AND anchor IS {f"'{anchor}'" if anchor else 'NULL'}
                        '''
                    qry_result = mst_conn.execute(qry).fetchone()
                    if qry_result:
                        # This link was already handled during this scrape
                        link_nr = qry_result[0]
                    else:
                        # Store the link for this scrape
                        link_nr = mst_conn.execute('''
                            INSERT INTO scr_links (url, anchor)
                            VALUES (?, ?)
                            ''', [link, anchor]).lastrowid
                    # Save link in relation to the page (can be a duplicate if
                    # the link is used more than once with the same text in the
                    # same page).
                    link_text = a_tag.text.strip()
                    mst_conn.execute('''
                        INSERT INTO scr_ed_links (page_id, text, link_nr)
                        VALUES (?, ?, ?)
                        ''', [page_id, link_text, link_nr])

            # Add link_type and status from saved history (since these can
            # not be reproduced otherwise).
            mst_conn.execute(f'''
                WITH
                    svd_links_status AS (
                        SELECT DISTINCT
                            link_id,
                            last_value(link_type) OVER win AS link_type,
                            last_value(status) OVER win AS status
                        FROM svd_his_links_status
                        WHERE timestamp <= {ts}
                        WINDOW win AS (
                            PARTITION BY link_id
                            ORDER BY timestamp
                            RANGE BETWEEN UNBOUNDED PRECEDING
                                AND UNBOUNDED FOLLOWING
                        )
                    )
                UPDATE scr_links AS s
                SET link_type = (
                        SELECT link_type
                        FROM svd_links_status AS t
                        WHERE t.link_id = s.link_id),
                    status = (
                        SELECT status
                        FROM svd_links_status AS t
                        WHERE t.link_id = s.link_id)
                ''')

            post_process_scrape()

        else:
            # Link related data does not need to be renewed: reinstate the
            # scr_links and scr_ed_links tables from the saved history.
            cte = f'''
                WITH
                    last_ed_links_during_scrape AS (
                        SELECT DISTINCT
                            page_id, text, link_id,
                            last_value(present) OVER win AS present
                        FROM svd_his_ed_links
                        WHERE timestamp <= '{ts}'
                        WINDOW win AS (
                            PARTITION BY page_id, text, link_id
                            ORDER BY timestamp
                            ROWS BETWEEN UNBOUNDED PRECEDING
                                AND UNBOUNDED FOLLOWING
                        )
                    ),
                '''
            mst_conn.execute(cte + f'''
                    last_links_status_during_scrape AS (
                        SELECT DISTINCT
                            link_id,
                            last_value(link_type) OVER win AS link_type,
                            last_value(status) OVER win AS status
                        FROM svd_his_links_status
                        WHERE timestamp <= '{ts}'
                        WINDOW win AS (
                            PARTITION BY link_id
                            ORDER BY timestamp
                            RANGE BETWEEN UNBOUNDED PRECEDING
                                AND UNBOUNDED FOLLOWING
                        )
                    )
                INSERT INTO scr_links
                    (link_nr, url, page_id, anchor, link_type, status)
                SELECT
                    link_id, url, page_id, anchor, link_type, status
                FROM mst_links
                LEFT JOIN last_links_status_during_scrape USING (link_id)
                WHERE link_id IN (SELECT DISTINCT link_id
                                  FROM last_ed_links_during_scrape
                                  WHERE present > 0)
                ''')
            mst_conn.execute(cte + f'''
                    -- Degroup same links per page
                    last_ed_links_ungrouped AS (
                        SELECT page_id, text, link_id, present - 1 AS present
                        FROM last_ed_links_during_scrape
                        WHERE present > 0
                            UNION ALL
                        SELECT page_id, text, link_id, present - 1 AS present
                        FROM last_ed_links_ungrouped
                        WHERE present > 0
                    )
                INSERT INTO scr_ed_links (page_id, text, link_nr)
                SELECT page_id, text, link_id
                FROM last_ed_links_ungrouped
                ''')

        extract_pages_info(ts)
        derive_pages_info()
        status_figures()
        add_scraped_data(ts)
        logger.info(f'concluded reprocessing scrape {ts}')

    if copy_db:
        mst_conn.switch_back()

    return


if __name__ == '__main__':
    scrape_site()
#   EXTRACTED_FIELDS = [('title', 'TEXT'), ('description', 'TEXT'), ('num_h1s', 'INTEGER'), ('first_h1', 'TEXT'), ('language', 'TEXT'), ('modified', 'DATE'), ('pagetype', 'TEXT'), ('classes', 'TEXT'), ('ed_text', 'TEXT'), ('aut_text', 'TEXT')]

Field definitions for the his_pages_info and scr_pages_info table (part 1 of 2).

#   DERIVED_FIELDS = [('business', 'TEXT'), ('category', 'TEXT')]

Field definitions for the his_pages_info and scr_pages_info table (part 2 of 2).

#   DV_TYPES = {'bld-campagne', 'bld-filter', 'bld-dv-content'}

Set of page-types that belong to the dv category.

#   BIB_TYPES = {'bld-landing', 'bld-bd', 'bld-overview', 'bld-direction', 'bld-cluster', 'bld-targetGroup', 'bld-target', 'bld-faq', 'bld-sitemap', 'bld-concept'}

Set of page-types that belong to the bib category.

#   ALG_TYPES = {'bld-outage', 'bld-response', 'bld-iahWrapper', 'bld-newsItem', 'bld-search', 'bld-error'}

Set of page-types that belong to the alg category (excluding 'bld_wrapper', which is handled specifically).

#   def create_new_data_store() -> None:
View Source
def create_new_data_store() -> None:
    """
    **Create a new skeleton data store.**

    ***Note:*** *Since this function will not be used very often,
    it is mainly meant as means to document the storage concept.*

    Driven by the configuration file `bd_www.constants.CONFIG_NAME`,
    (sub)directories are created in the master directory to contain the
    various scraped data, and an empty scrapes as well as a metrics database
    is created with all necessary tables.

    The next data will be stored as files in separate **subdirectories** of the
    master directory:

    - page sources in zip files per scrape
    - 'robots.txt' files when used for finding unscraped url's and if different
      from the previous version
    - 'sitemap.xml' files when used for finding unscraped url's and if
      different from the previous version
    - log files for all scrapes
    - site reports per report variation, with subdirectories per reporting
      period

    Next **files** are stored directly in the master directory:

    - web analytics log file
    - reporting log file
    - scrapes database
    - metrics database

    Various names of above mentioned files and (sub)directories are
    determined by the actual parameters of the configuration file. Please
    refer to the `bd_www.Config` class for further information.

    The ***scrapes database*** consists of the following tables and views:

    **History tables** - contain all basic, extracted and derived data from all
    scrapes. All stored data is labeled with the timestamp of the scrape.
    With some pragmatic exceptions the data is only saved if different from the
    previous scrape (time delta storage).

    - *his_pages_life* - time delta register when a page was alive or not
      (*alive* is FALSE when a page was not present)
    - *his_pages_info* - time delta register of extracted and derived data per
      page (full record saved when one or more values differ from previous
      scrape)
    - *his_ed_links* - time delta register with usage of editorial links
      (*present* is 0 when an editorial link was not used)
    - *his_links_status* - time delta register with type and status of
      (editorial) links
    - *his_redirs* - time delta register with redirects that occurred while
      scraping (*redir_type* is 'absent' when it did not occur)
    - *his_unlinked* - time delta register with pages that were not connected
      via editorial links (*unl_type* is 'absent' when no longer unlinked or
      not present)
    - *his_short_urls* - time delta register when short url's were active
    - *his_status_figures* - figures representing the status of the site for
      each scrape

    **Master tables** - contain unique and primary data

    - *mst_paths* - unique relation between *path* and *page_id*
    - *mst_links* - unique relation between *url* or *page_id*/*anchor* and
      *link_id*
    - *mst_scrapes* - primary reference for available scrapes and unique source
      for *scraper* and *primers* fields

    **Non tables/views** - collections of manually administered *page_id*'s
    to force special handling of these pages.

    - *non_scrape_ids* - table with manually administered *page_id*'s that will
      not be scraped
    - *non_orphan_ids* - table with manually administered *page_id*'s that will
      not be considered orphans when becoming unlinked (*unl_type*  will be set
      to 'special' in those cases)
    - *non_orphan_paths* - view of table *non_scrape_ids* with explicit *path*
    - *non_scrape_paths* - view of table *non_orphan_ids* with explicit *path*

    **Scrape tables** - operational stores for scrape results; purged when
    starting a new scrape.

    - *scr_page_ids* - id's of pages that were scraped
    - *scr_redirs* - redirects that occurred while scraping
    - *scr_pages_info* - extracted and derived data per page
    - *scr_links* - register of fully qualified (including *link_type* and
      *status*) links in the editorial content of the scraped pages
    - *scr_ed_links* - use of links in the editorial content each page
    - *scr_unlinked* - pages that can not be reached from other pages
    - *scr_status_figures* - figures representing the status of the site

    **Timestamped table and views** - after saving the timestamp in (the only
    record of) the *tsd* table, the timestamped views represent a dataset
    that is comparable to the scrape tables after the scrape with the
    specified timestamp finished.

    - *tsd* - scalar store to set the timestamp for the next timestamped views
    - *tsd_redirs* - identical to equivalent scrape table
    - *tsd_redirs_to_wcm* - all redirects related to the canonical url of the
      page they finally redirected to (not available as scrape table)
    - *tsd_pages_info* - identical to equivalent scrape table
    - *tsd_links* - identical to equivalent scrape table
    - *tsd_ed_links* - identical to equivalent scrape table
    - *tsd_unlinked* - identical to equivalent scrape table
    - *tsd_int_ref_cnt* - number of times that a pages is linked to from the
      editorial content of other pages (not available as scrape table)

    The ***metrics database*** consists of the following tables and views:

    **Basic tables** - contain unique and primary data

    - *shd_paths* - unique relation between *path* and *page_id*; acts as
      shadow subset of the *mst_paths* table in the scrape database as
      substitute for an inter-db relation that is not possible with SQLite
    - *shd_links* - unique relation between *link_id* and *url*; acts as
      shadow subset of the *mst_links* table in the scrape database as
      substitute for an inter-db relation that is not possible with SQLite
    - *dates* - unique relation between *date* and *date_id*

    **Analytics tables** - contain all historic analytics data

    - *daily* - numeric analytics data per date and page
    - *feedback* - numeric and textual feedback per date and page
    - *downloads* - numeric download data per date and download link

    **Analytics views** - views for the analytics tables with explicit relations

    - *daily_expl* - daily data table with explicit date and page path
    - *feedback_expl* - feedback data table with explicit date and page path
    - *downloads_expl* - downloads data table with explicit date and link path
    """

    # Create directories and database file
    for dir_name in (scrape_conf.src_dir_name,
                     scrape_conf.robots_dir_name,
                     scrape_conf.log_dir_name,
                     scrape_conf.sitemap_dir_name):
        (mst_dir / dir_name).mkdir(exist_ok=True)
    for report in report_conf.reports:
        report_conf.spec_report(report)
        report_dir = mst_dir / report_conf.report_dir_name
        report_dir.mkdir(exist_ok=True)
        for subdir_name in set(REPORT_SUBDIRS.values()):
            (report_dir / subdir_name).mkdir(exist_ok=True)
    log_readme = mst_dir / scrape_conf.log_dir_name / 'README.txt'
    with open(log_readme, 'w') as f:
        f.write(_README_TXT)

    # Scrape db: create all tables to contain the historical scrape data
    mst_conn.execute('''
        CREATE TABLE mst_scrapes (
            timestamp	TEXT PRIMARY KEY,
            date		AS ("20" || substr(timestamp,1,2)
                            || "-" || substr(timestamp,3,2)
                            || "-" || substr(timestamp,5,2)),
            scraper		TEXT,
            primers		TEXT
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE mst_paths (
            page_id	    INTEGER PRIMARY KEY AUTOINCREMENT,
            path        TEXT NOT NULL UNIQUE
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE mst_links (
            link_id		INTEGER PRIMARY KEY AUTOINCREMENT,
            url			TEXT,
            anchor      TEXT,
            page_id		INTEGER,
            UNIQUE (url),
            UNIQUE (page_id, anchor),
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE his_pages_life (
            timestamp	TEXT NOT NULL,
            page_id     INTEGER NOT NULL,
            alive       INTEGER NOT NULL,
            PRIMARY KEY (timestamp, page_id),
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    fields = EXTRACTED_FIELDS + DERIVED_FIELDS
    info_columns = ', '.join([f'{f[0]} {f[1]}' for f in fields])
    mst_conn.execute(f'''
        CREATE TABLE his_pages_info (
            timestamp	TEXT NOT NULL,
            page_id     INTEGER NOT NULL,
            {info_columns},
            PRIMARY KEY (timestamp, page_id),
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE his_links_status (
            timestamp	TEXT NOT NULL,
            link_id		INTEGER NOT NULL,
            link_type	TEXT,
            status		INTEGER,
            PRIMARY KEY (timestamp, link_id),
            FOREIGN KEY (link_id)
                REFERENCES mst_links (link_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE his_ed_links (
            timestamp	TEXT NOT NULL,
            page_id	    INTEGER NOT NULL,
            text 	    TEXT NOT NULL,
            link_id	    INTEGER NOT NULL,
            present		INTEGER NOT NULL,
            PRIMARY KEY (timestamp, page_id, text, link_id)
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id),
            FOREIGN KEY (link_id)
                REFERENCES mst_links (link_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE his_redirs (
            timestamp	TEXT NOT NULL,
            req_id      INTEGER,
            req_url     TEXT,
            redir_id    INTEGER,
            redir_url   TEXT,
            redir_type  TEXT NOT NULL,
            PRIMARY KEY (timestamp, req_id, req_url),
            UNIQUE (timestamp, req_id),
            UNIQUE (timestamp, req_url),
            FOREIGN KEY (req_id)
                REFERENCES mst_paths (page_id),
            FOREIGN KEY (redir_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE his_unlinked (
            timestamp	TEXT NOT NULL,
            page_id	    INTEGER,
            url		    TEXT,
            unl_type	TEXT,
            PRIMARY KEY (timestamp, page_id, url),
            UNIQUE (timestamp, page_id),
            UNIQUE (timestamp, url),
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE his_short_urls (
            timestamp	TEXT NOT NULL,
            url     	TEXT NOT NULL,
            active      INTEGER NOT NULL,
            PRIMARY KEY (timestamp, url)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE his_status_figures (
            timestamp	TEXT NOT NULL,
            name		TEXT NOT NULL,
            value		INTEGER NOT NULL,
            PRIMARY KEY (timestamp, name)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE non_orphan_ids (
            page_id	    INTEGER NOT NULL UNIQUE,
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE non_scrape_ids (
            page_id	    INTEGER NOT NULL UNIQUE,
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id)
        )
        ''')

    # Scrape db: create tables to contain operational results during a scrape
    mst_conn.execute('''
        CREATE TABLE scr_page_ids (
            page_id	    INTEGER NOT NULL UNIQUE,
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE scr_redirs (
            req_id      INTEGER UNIQUE,
            req_url     TEXT UNIQUE,
            redir_id    INTEGER,
            redir_url   TEXT,
            redir_type  TEXT NOT NULL,
            PRIMARY KEY (req_id, req_url),
            FOREIGN KEY (req_id)
                REFERENCES mst_paths (page_id),
            FOREIGN KEY (redir_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE scr_links (
            link_nr		INTEGER PRIMARY KEY AUTOINCREMENT,
            url			TEXT,
            page_id		INTEGER,
            anchor      TEXT,
            link_type	TEXT,
            status  	INTEGER,
            link_id	    INTEGER UNIQUE,
            UNIQUE (url, anchor),
            UNIQUE (page_id, anchor),
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE scr_ed_links (
            page_id	    INTEGER NOT NULL,
            text 	    TEXT NOT NULL,
            link_nr	    INTEGER NOT NULL,
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id),
            FOREIGN KEY (link_nr)
                REFERENCES scr_links (link_nr)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE scr_unlinked (
            page_id	    INTEGER UNIQUE,
            url		    TEXT UNIQUE,
            unl_type    TEXT,
            FOREIGN KEY (page_id)
                REFERENCES mst_paths (page_id)
        )
        ''')
    fields = EXTRACTED_FIELDS + DERIVED_FIELDS
    info_columns = ', '.join([f'{f[0]} {f[1]}' for f in fields])
    mst_conn.execute(f'''
        CREATE TABLE scr_pages_info (
            page_id INTEGER PRIMARY KEY,
            {info_columns},
            FOREIGN KEY (page_id)
            REFERENCES mst_paths (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE scr_status_figures (
            name		TEXT PRIMARY KEY,
            value		INTEGER NOT NULL
        )
        ''')

    # Scrape db: create views for non-orphan and non-scrape tables with added
    # paths.
    mst_conn.execute('''
        CREATE VIEW non_orphan_paths AS
        SELECT * FROM non_orphan_ids NATURAL JOIN mst_paths
        ''')
    mst_conn.execute('''
        CREATE VIEW non_scrape_paths AS
        SELECT * FROM non_scrape_ids NATURAL JOIN mst_paths
        ''')

    # Scrape db: create a simple table to contain the timestamp for a
    # specific scrape, together with the views to get the data for that scrape.
    mst_conn.executescript('CREATE TABLE tsd (timestamp TEXT)')
    mst_conn.execute('''
        CREATE VIEW tsd_unlinked AS
        WITH
            last_unlinked_during_scrape AS (
                SELECT DISTINCT
                    page_id, url,
                    last_value(unl_type) OVER win AS unl_type
                FROM his_unlinked AS u, tsd AS s
                WHERE u.timestamp <= s.timestamp
                WINDOW win AS (
                    PARTITION BY page_id, url
                    ORDER BY u.timestamp
                    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            )
        SELECT *
        FROM last_unlinked_during_scrape
        WHERE unl_type != 'absent'
        ''')
    mst_conn.execute('''
        CREATE VIEW tsd_redirs AS
        WITH
            last_redirs_during_scrape AS (
                SELECT DISTINCT
                    req_id, req_url,
                    last_value(redir_id) OVER win AS redir_id,
                    last_value(redir_url) OVER win AS redir_url,
                    last_value(redir_type) OVER win AS redir_type
                FROM his_redirs AS r, tsd AS s
                WHERE r.timestamp <= s.timestamp
                WINDOW win AS (
                    PARTITION BY req_id, req_url
                    ORDER BY r.timestamp
                    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            )
        SELECT *
        FROM last_redirs_during_scrape
        WHERE redir_type != 'absent'
        ''')
    mst_conn.execute('''
        CREATE VIEW tsd_redirs_to_wcm (req_url, wcm_id, final_type) AS
        WITH
            -- build url-based redir chains
            scr_chains (org_url, prev_url, next_id, next_url, next_type) AS (
                SELECT
                    req_url, NULL, redir_id, redir_url, redir_type
                FROM tsd_redirs
                    UNION ALL
                SELECT
                    org_url, next_url, r.redir_id, r.redir_url, r.redir_type
                FROM scr_chains AS c
                LEFT JOIN tsd_redirs AS r ON c.next_url = r.req_url
                -- avoid endless loops by excluding redirs of an alias
                WHERE c.next_type <> 'alias'
            )
        -- get all redir chains ending in a wcm url
        SELECT org_url, next_id, next_type
        FROM scr_chains
        WHERE next_id IS NOT NULL AND org_url IS NOT NULL
        ''')
    mst_conn.execute('''
        CREATE VIEW tsd_ed_links AS
        WITH RECURSIVE
            last_his_ed_links_records_during_scrape AS (
                SELECT DISTINCT
                    page_id, text, link_id,
                    last_value(present) OVER win AS present
                FROM his_ed_links AS e, tsd AS s
                WHERE e.timestamp <= s.timestamp
                WINDOW win AS (
                    PARTITION BY page_id, text, link_id
                    ORDER BY e.timestamp
                    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            ),
            -- Degroup same links per page
            scr_ed_links_singles AS (
                SELECT page_id, text, link_id, present - 1 AS present
                FROM last_his_ed_links_records_during_scrape
                WHERE present > 0
                    UNION ALL
                SELECT page_id, text, link_id, present - 1 AS present
                FROM scr_ed_links_singles
                WHERE present > 0
            )
        SELECT page_id, text, link_id
        FROM scr_ed_links_singles
        ''')
    mst_conn.execute('''
        CREATE VIEW tsd_links AS
        WITH
            tsd_links_status AS (
                SELECT DISTINCT
                    link_id,
                    last_value(link_type) OVER win AS link_type,
                    last_value(status) OVER win AS status
                FROM his_links_status AS l, tsd AS s
                WHERE l.timestamp <= s.timestamp
                WINDOW win AS (
                    PARTITION BY link_id
                    ORDER BY l.timestamp
                    RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            )
        SELECT link_id, url, page_id, anchor, link_type, status
        FROM mst_links
        LEFT JOIN tsd_links_status USING (link_id)
        WHERE link_id IN (SELECT DISTINCT link_id FROM tsd_ed_links)
        ''')
    mst_conn.execute('''
        CREATE VIEW tsd_int_ref_cnt (page_id, referral_cnt) AS
        WITH
            -- wcm id's of the internal links per page
            int_page_links (referring_page_id,
                            referral_cnt,
                            linked_page_id) AS (
                SELECT e.page_id, count(*), ifnull(l.page_id, s.wcm_id)
                FROM tsd_ed_links AS e
                LEFT JOIN mst_links AS l USING (link_id)
                LEFT JOIN tsd_redirs_to_wcm AS s 
                    ON iif(instr(l.url, '#'),
                           substr(l.url, 1, instr(l.url, '#') - 1),
                           l.url) = req_url
                WHERE (l.page_id OR s.wcm_id)
                GROUP BY e.page_id, e.text, e.link_id
            )
        SELECT linked_page_id, sum(referral_cnt)
        FROM int_page_links
        -- next clause removes some spurious in-page links
        WHERE referring_page_id != linked_page_id
        GROUP BY linked_page_id
        ''')
    mst_conn.execute('''
        CREATE VIEW tsd_pages_info AS
        WITH
            scr_pages AS (
                SELECT DISTINCT
                    page_id,
                    last_value(alive) OVER win AS alive
                FROM his_pages_life AS p, tsd AS s
                WHERE p.timestamp <= s.timestamp
                WINDOW win AS (
                    PARTITION BY page_id
                    ORDER BY p.timestamp
                    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            ),
            scr_last_info_ts (page_id, timestamp) AS (
                SELECT DISTINCT page_id, last_value(i.timestamp) OVER win
                FROM his_pages_info AS i, tsd AS s
                WHERE i.timestamp <= s.timestamp
                WINDOW win AS (
                    PARTITION BY page_id
                    ORDER BY i.timestamp
                    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            )
        SELECT pi.*
        FROM scr_pages
        LEFT JOIN scr_last_info_ts AS li USING (page_id)
        LEFT JOIN his_pages_info AS pi USING (page_id, timestamp)
        WHERE alive
        ''')

    # Metrics db: created basic tables
    mst_conn.execute('''
        CREATE TABLE mtrx.shd_paths (
            page_id	    INTEGER NOT NULL UNIQUE,
            path	    TEXT NOT NULL UNIQUE,
            PRIMARY KEY (page_id)
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE mtrx.shd_links (
            link_id     INTEGER PRIMARY KEY,
            url			TEXT NOT NULL
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE mtrx.dates (
            date_id	    INTEGER UNIQUE,
            date	    TEXT NOT NULL UNIQUE,
            PRIMARY KEY (date_id AUTOINCREMENT)
        )
        ''')

    # Metrics db: created tables to store analytic data
    mst_conn.execute('''
        CREATE TABLE mtrx.daily (
            date_id				INTEGER NOT NULL,
            page_id				INTEGER NOT NULL,
            nb_visits			INTEGER,
            nb_hits				INTEGER,
            entry_nb_visits		INTEGER,
            entry_bounce_count	INTEGER,
            bounce_rate			REAL AS
                (round(CAST(entry_bounce_count AS REAL) / entry_nb_visits, 3)),
            exit_nb_visits		INTEGER,
            exit_rate			REAL AS
                (round(CAST(exit_nb_visits AS REAL) / nb_visits, 3)),
            organic_entries		INTEGER,
            organic_entry_rate	REAL AS
                (round(CAST(organic_entries AS REAL) / nb_visits, 3)),
            call_visits			INTEGER,
            call_rate			REAL AS
                (round(CAST(call_visits AS REAL) / nb_visits, 3)),
            PRIMARY KEY (date_id, page_id),
            FOREIGN KEY (date_id) REFERENCES dates,
            FOREIGN KEY (page_id) REFERENCES shd_paths
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE mtrx.feedback (
            date_id		INTEGER NOT NULL,
            page_id		INTEGER NOT NULL,
            pos_cnt		INTEGER,
            neg_cnt		INTEGER,
            pos_txt		TEXT,
            neg_txt		TEXT,
            PRIMARY KEY (date_id, page_id),
            FOREIGN KEY (date_id) REFERENCES dates,
            FOREIGN KEY (page_id) REFERENCES shd_paths
        )
        ''')
    mst_conn.execute('''
        CREATE TABLE mtrx.downloads (
            date_id	    INTEGER NOT NULL REFERENCES dates,
            link_id	    INTEGER NOT NULL REFERENCES shd_links,
            nb_visits	INTEGER,
            nb_hits		INTEGER,
            PRIMARY KEY (date_id, link_id)
        )
        ''')

    # Metrics db: create views to explicitly show referenced data
    mst_conn.execute('''
        CREATE VIEW mtrx.daily_expl AS
        SELECT date, path, d.*
        FROM daily AS d
        NATURAL JOIN dates
        NATURAL JOIN shd_paths
        ''')
    mst_conn.execute('''
        CREATE VIEW mtrx.feedback_expl AS
        SELECT date, path, f.*
        FROM feedback AS f
        NATURAL JOIN dates
        NATURAL JOIN shd_paths
        ''')
    mst_conn.execute('''
        CREATE VIEW mtrx.downloads_expl AS
        SELECT date, url, dl.*
        FROM downloads AS dl
        NATURAL JOIN dates
        NATURAL JOIN shd_links
        ''')

Create a new skeleton data store.

Note: Since this function will not be used very often, it is mainly meant as means to document the storage concept.

Driven by the configuration file bd_www.constants.CONFIG_NAME, (sub)directories are created in the master directory to contain the various scraped data, and an empty scrapes as well as a metrics database is created with all necessary tables.

The next data will be stored as files in separate subdirectories of the master directory:

  • page sources in zip files per scrape
  • 'robots.txt' files when used for finding unscraped url's and if different from the previous version
  • 'sitemap.xml' files when used for finding unscraped url's and if different from the previous version
  • log files for all scrapes
  • site reports per report variation, with subdirectories per reporting period

Next files are stored directly in the master directory:

  • web analytics log file
  • reporting log file
  • scrapes database
  • metrics database

Various names of above mentioned files and (sub)directories are determined by the actual parameters of the configuration file. Please refer to the bd_www.Config class for further information.

The scrapes database consists of the following tables and views:

History tables - contain all basic, extracted and derived data from all scrapes. All stored data is labeled with the timestamp of the scrape. With some pragmatic exceptions the data is only saved if different from the previous scrape (time delta storage).

  • his_pages_life - time delta register when a page was alive or not (alive is FALSE when a page was not present)
  • his_pages_info - time delta register of extracted and derived data per page (full record saved when one or more values differ from previous scrape)
  • his_ed_links - time delta register with usage of editorial links (present is 0 when an editorial link was not used)
  • his_links_status - time delta register with type and status of (editorial) links
  • his_redirs - time delta register with redirects that occurred while scraping (redir_type is 'absent' when it did not occur)
  • his_unlinked - time delta register with pages that were not connected via editorial links (unl_type is 'absent' when no longer unlinked or not present)
  • his_short_urls - time delta register when short url's were active
  • his_status_figures - figures representing the status of the site for each scrape

Master tables - contain unique and primary data

  • mst_paths - unique relation between path and page_id
  • mst_links - unique relation between url or page_id/anchor and link_id
  • mst_scrapes - primary reference for available scrapes and unique source for scraper and primers fields

Non tables/views - collections of manually administered page_id's to force special handling of these pages.

  • non_scrape_ids - table with manually administered page_id's that will not be scraped
  • non_orphan_ids - table with manually administered page_id's that will not be considered orphans when becoming unlinked (unl_type will be set to 'special' in those cases)
  • non_orphan_paths - view of table non_scrape_ids with explicit path
  • non_scrape_paths - view of table non_orphan_ids with explicit path

Scrape tables - operational stores for scrape results; purged when starting a new scrape.

  • scr_page_ids - id's of pages that were scraped
  • scr_redirs - redirects that occurred while scraping
  • scr_pages_info - extracted and derived data per page
  • scr_links - register of fully qualified (including link_type and status) links in the editorial content of the scraped pages
  • scr_ed_links - use of links in the editorial content each page
  • scr_unlinked - pages that can not be reached from other pages
  • scr_status_figures - figures representing the status of the site

Timestamped table and views - after saving the timestamp in (the only record of) the tsd table, the timestamped views represent a dataset that is comparable to the scrape tables after the scrape with the specified timestamp finished.

  • tsd - scalar store to set the timestamp for the next timestamped views
  • tsd_redirs - identical to equivalent scrape table
  • tsd_redirs_to_wcm - all redirects related to the canonical url of the page they finally redirected to (not available as scrape table)
  • tsd_pages_info - identical to equivalent scrape table
  • tsd_links - identical to equivalent scrape table
  • tsd_ed_links - identical to equivalent scrape table
  • tsd_unlinked - identical to equivalent scrape table
  • tsd_int_ref_cnt - number of times that a pages is linked to from the editorial content of other pages (not available as scrape table)

The metrics database consists of the following tables and views:

Basic tables - contain unique and primary data

  • shd_paths - unique relation between path and page_id; acts as shadow subset of the mst_paths table in the scrape database as substitute for an inter-db relation that is not possible with SQLite
  • shd_links - unique relation between link_id and url; acts as shadow subset of the mst_links table in the scrape database as substitute for an inter-db relation that is not possible with SQLite
  • dates - unique relation between date and date_id

Analytics tables - contain all historic analytics data

  • daily - numeric analytics data per date and page
  • feedback - numeric and textual feedback per date and page
  • downloads - numeric download data per date and download link

Analytics views - views for the analytics tables with explicit relations

  • daily_expl - daily data table with explicit date and page path
  • feedback_expl - feedback data table with explicit date and page path
  • downloads_expl - downloads data table with explicit date and link path
#   def scrape_site() -> str | None:
View Source
def scrape_site() -> str | None:
    """
    **Scrape www.belastingdienst.nl.**

    Returns:

        timestamp of the scrape [yymmdd-hhmm] or None in case a
            'scrape_in_progress.flag' is encountered

    One complete scrape is done of `www.belastingdienst.nl` and stored in the
    scrape tables of the scrapes database (documented with the
    `create_scrapes_storage` function) and a zip-file with pages sources.

    The scrape tables will contain `page_id`'s of the scraped pages,
    redirects that occurred while retrieving the pages, detailed info per
    page, use of links from editorial content and figures representing the
    site status.

    After finishing the scrape and completing the scrape tables, the result of
    the scrape is added to the history tables of the scrapes database.

    Further process and status information of the scrape is retained in the
    next files:

    - log file with info, warnings and/or errors of the scrape
    - copies of the 'robots.txt' and 'sitemap.xml' files if they are used to
      find unscraped pages and if they differ from the previously saved version

    In case the scraper operates in an environment with normal external
    access, all editorial links will be checked. Otherwise only links to (a
    few configured) trusted sites are checked. The outcome of all checks,
    executed or not, will be reflected in the *status* field of the
    *scr_links* and *his_links_status* tables of the scrapes database.
    """

    global logger

    def ext_int_available() -> bool:
        """
        **Tests whether external internet is available.**

        Returns:

            Availability.

        Only tests if www.google.nl can be reached.
        """
        # noinspection PyBroadException
        try:
            ok = requests.get(
                'https://www.google.nl', timeout=5).status_code == 200
        except Exception:
            ok = False
        return ok

    # Some preliminaries
    ts = time.strftime('%y%m%d-%H%M')
    logger = logging.getLogger('scrape')
    log_dir = mst_dir / scrape_conf.log_dir_name
    log_file = str(log_dir / f'{ts}_{scrape_conf.log_name}')
    logger.setLevel(logging.INFO)
    fh = logging.FileHandler(log_file)
    fh.setLevel(logging.INFO)
    ch = logging.StreamHandler()
    ch.setLevel(logging.INFO)
    formatter = logging.Formatter(
        fmt='[%(asctime)s] %(levelname)-8s - %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S')
    fh.setFormatter(formatter)
    ch.setFormatter(formatter)
    logger.addHandler(fh)
    logger.addHandler(ch)
    logger.info(f'scrape {ts} started')

    # Check if another scrape is running
    flag = mst_dir / 'SCRAPE_IN_PROGRESS.flag'
    if flag.exists():
        ct = dt.datetime.fromtimestamp(flag.stat().st_ctime)
        logger.warning(f'Another scrape in progress since '
                       f'{ct.strftime("%a %d %b %Y / %H:%M")}')
        return ''
    else:
        flag.touch()

    # Load url's that should not be scraped
    qry_result = mst_conn.execute(
        'SELECT path FROM non_scrape_paths').fetchall()
    urls_done = {ROOT_URL + path for (path,) in qry_result}

    # Initialisations
    empty_scr_tables()
    api_ok = matomo_available()
    ext_ok = ext_int_available()
    if not ext_ok:
        logger.warning('www.google.nl not available: '
                       'assuming limited external access')
    urls_todo = {WWW_SITE}
    urls_scraped = 0
    pages_saved = 0
    unlinked_urls_checked = False
    mst_conn.execute('''
        INSERT INTO mst_scrapes (timestamp, scraper, primers)
        VALUES (?, ?, ?)
        ''', [ts, SCRAPER_VERSION, 'site url only'])
    start_time = time.time()

    # Do the scrape
    max_nr = 100_000 if scrape_conf.max_urls == 'all' else scrape_conf.max_urls
    src_zip = PageSourceZip(ts)
    while urls_todo and urls_scraped < max_nr:
        req_url = urls_todo.pop()
        wcm_url = scrape_page(req_url, urls_done, urls_todo, src_zip, ext_ok)
        urls_scraped += 1
        if wcm_url:
            pages_saved += 1

        # After initial link crawling, check any url's were missed
        if not urls_todo and not unlinked_urls_checked:
            logger.info(f'initial link crawling done: '
                        f'{urls_scraped} urls\'s visited / '
                        f'{pages_saved} pages saved')
            logger.info('finding url\'s that have not been crawled')
            urls_todo = find_unscraped_urls(ts, urls_done,
                                            historic=True, requested=api_ok)
            unlinked_urls_checked = True

        # Time cycles and print progress and prognosis
        num_todo = min(len(urls_todo), max_nr - urls_scraped)
        if urls_scraped % 25 == 0:
            page_time = (time.time() - start_time) / urls_scraped
            togo_time = int(num_todo * page_time)
            print(f'{urls_scraped:4} url\'s done, '
                  f'{page_time:.2f} secs per url / '
                  f'{num_todo:4} found todo, {togo_time // 60}:'
                  f'{togo_time % 60:02} min to scrape them')

    elapsed = int(time.time() - start_time)
    # Important: do not alter the next logging text; the presence of this
    # text in the log file validates the scrape (ref. check_storage_integrity
    # function).
    logger.info(f'scrape finished in {elapsed // 60}:{elapsed % 60:02} min')
    logger.info(f'{urls_scraped} urls\'s visited / {pages_saved} pages saved')

    # Processing the scraped data
    post_process_scrape()
    extract_pages_info(ts)
    derive_pages_info()
    status_figures()
    add_scraped_data(ts)

    elapsed = int(time.time() - start_time)
    logger.info(f'scraping and processing data finished in '
                f'{elapsed // 60}:{elapsed % 60:02} min')
    flag.unlink(missing_ok=True)
    return ts

Scrape www.belastingdienst.nl.

Returns:

timestamp of the scrape [yymmdd-hhmm] or None in case a
    'scrape_in_progress.flag' is encountered

One complete scrape is done of www.belastingdienst.nl and stored in the scrape tables of the scrapes database (documented with the create_scrapes_storage function) and a zip-file with pages sources.

The scrape tables will contain page_id's of the scraped pages, redirects that occurred while retrieving the pages, detailed info per page, use of links from editorial content and figures representing the site status.

After finishing the scrape and completing the scrape tables, the result of the scrape is added to the history tables of the scrapes database.

Further process and status information of the scrape is retained in the next files:

  • log file with info, warnings and/or errors of the scrape
  • copies of the 'robots.txt' and 'sitemap.xml' files if they are used to find unscraped pages and if they differ from the previously saved version

In case the scraper operates in an environment with normal external access, all editorial links will be checked. Otherwise only links to (a few configured) trusted sites are checked. The outcome of all checks, executed or not, will be reflected in the status field of the scr_links and his_links_status tables of the scrapes database.

#   def empty_scr_tables() -> None:
View Source
def empty_scr_tables() -> None:
    """
    **Delete all content from scrape tables.**
    """
    qry = '''
        SELECT name
        FROM sqlite_schema
        WHERE type = 'table' AND name LIKE 'scr_%'
        '''
    for name in [r[0] for r in mst_conn.execute(qry)]:
        mst_conn.execute(f'DELETE FROM {name}')

Delete all content from scrape tables.

#   def scrape_page( req_url: str, urls_done: Set[str], urls_todo: Set[str], page_zip: bd_www.PageSourceZip, check_all_links: bool = True ) -> Optional[str]:
View Source
def scrape_page(
        req_url: str,
        urls_done: Set[str],
        urls_todo: Set[str],
        page_zip: PageSourceZip,
        check_all_links: bool = True) -> Union[str, None]:
    """
    **Scrape an html page and save the results to the scrape tables.**

    Arguments:

        req_url: url of requested page
        urls_done: url's that have been scraped already
        urls_todo: url's that still need to be scraped
        page_zip: archive to store scraped page sources
        check_all_links: when false, only trusted links will be checked

    Returns:

        canonical wcm url (or None for mentioned cases)

    Get the final response from the requested url while collecting any
    redirect that occurs during the request. Register the page under the
    canonical wcm path and save the page source to the zip-archive. Further
    store page-related data, editorial links and redirects to the scrape
    tables of the scrapes database (documented with the
    `create_scrapes_storage` function). The `urls_done` and `urls_todo` sets
    are updated accordingly, driving the further scrape of the site.

    When `check_all_links` is `True`, all links within the page that are
    external (do not start with `bd_www.constants.ROOT_URL`) will be checked.
    When set to `False`, only those external links will be checked that go to
    a site in the `trusted_domains` configuration parameter (refer to
    `bd_www.Config`). This option can be used when this module runs on a
    platform that (for reasons of security) has limited external site access.

    All url's are absolute, also in the resulting *scr_links*, *scr_redirs*
    and *scr_unlinked* tables. The `post_process_scrape` should be used to
    normalise these url's and determine the type of the unlinked pages.

    The returned canonical wcm url is the content attribute of the `<meta
    name="DCTERMS.identifier">` tag. This is the url as generated by the WCM
    system.

    The returned value will be `None` when the requested url

    -   was already scraped (and saved)
    -   redirects to an out-of-scope page
    -   gave an unexpected response (which is logged)
    """

    def check_ext_link(
            lnk: str, trusted_only: bool = False) -> tuple[str, str]:
        """
        **Return type and response status of an external link.**

        Arguments:

            lnk: url to be checked
            trusted_only: when true, only links to trusted sites will will be
                checked; the returned response status will reflect this

        Returns:

            response status, link type

        The returned `response status` is the status code from the primary
        response upon requesting the given url.

        By default the returned `link type` will be 'ext', but when the
        response is a redirect, depending on the destination url, '-to-int'
        or '-to-ext' is added to this type.

        When requesting the url hits an error, it is returned via the
        `response status`.

        The sites to be trusted are taken from the `trusted_domains`
        configuration parameter in the [SCRAPE] section of the configuration
        file (see `bd_www.Config`).
        """
        l_type = 'ext'
        if trusted_only:
            re_result = re.match(r'^(.*\.)?(.*\..*)$', urlparse(lnk).netloc)
            link_site = re_result[2] if re_result else None
            if link_site not in scrape_conf.trusted_domains:
                return 'unable to check link', l_type
        try:
            response = requests.get(
                lnk, allow_redirects=False, verify=False, timeout=5)
        except Exception as err:
            resp_status = f'error while checking link: {err}'
        else:
            resp_status = response.status_code
            if resp_status in (301, 302, 303, 307):
                location = response.headers['Location']
                if location.startswith(ROOT_URL):
                    l_type += '-to-int'
                else:
                    l_type += '-to-ext'
        return str(resp_status), l_type

    def save_page(path: str, doc: str) -> Union[int, None]:
        """
        **Save scraped page to the page source zip.**

        Arguments:

            path: relative to the ROOT_URL of the scrape
            doc: page source

        Returns:

            page_id or None if already saved

        If the page was not saved before, it will be saved with the page_id
        requested from the `mst_paths` table. A new page_id wil be created if
        needed.
        """
        qry_res = mst_conn.execute(
            'SELECT page_id FROM mst_paths WHERE path = ?',
            [path]).fetchone()
        if qry_res:
            p_id = qry_res[0]
        else:
            # This is a brand new page
            p_id = mst_conn.execute(
                'INSERT INTO mst_paths (path) VALUES (?)',
                [path]).lastrowid
        if p_id in page_zip.page_ids():
            # Page already saved during this scrape
            return None
        else:
            # Page not saved yet during this scrape
            page_zip.add_page(p_id, doc)
            mst_conn.execute('INSERT INTO scr_page_ids VALUES (?)', [p_id])
        return p_id

    # Setup some variables
    page_as_string = ''
    soup = None
    redir_qry = '''
        INSERT OR IGNORE INTO scr_redirs
            (req_url, redir_url, redir_type)
        VALUES
            (?, ?, ?)
        '''

    # Request loop
    while True:
        # Keep requesting until no further rewrites or redirects occur,
        # while saving redirects to the temporary table and updating the
        # urls_done admin.
        # noinspection PyBroadException
        try:
            resp = requests.get(req_url, allow_redirects=False)
        except Exception:
            # The requester failed on this url; consider it done and continue.
            logger.error(f'failed to request {req_url}')
            urls_done.add(req_url)
            return None
        urls_done.add(req_url)

        # Check and handle status codes
        if resp.status_code == 200:
            pass
        elif resp.status_code in (301, 302):
            # This concerns a redirect response
            redir_url = resp.headers['Location']
            # If request or redir is in scope, save the redirect with full
            # url's to the temporary table.
            if (req_url.startswith(ROOT_URL)
                    or redir_url.startswith(ROOT_URL)):
                mst_conn.execute(
                    redir_qry, [req_url, redir_url, resp.status_code])
            if redir_url.startswith(ROOT_URL):
                # restart Request loop with redirected url
                req_url = redir_url
                continue
            else:
                # Redirect url is out of scope.
                return None
        else:
            # Returned status code is not 200, 301 or 302, so getting the page
            # did not succeed. Consider it done and return a None result.
            logger.error(f'unexpected response from {req_url}; '
                         f'status code is {resp.status_code}.')
            return None

        # Read and parse the response into a soup document
        page_as_string = resp.text
        soup = BeautifulSoup(page_as_string, features='lxml')

        # Has this page a client-side redirect by means of header tag
        # <meta http-equiv="refresh" content="0;url=...">?
        meta_tag = soup.find('meta', attrs={'http-equiv': 'refresh'})
        if meta_tag:
            redir_url = meta_tag['content'].split('url=')[1]
            redir_url = urljoin(req_url, redir_url)
            # If request or redir is in scope, save the redirect with full
            # url's to the temporary table.
            if (req_url.startswith(ROOT_URL)
                    or redir_url.startswith(ROOT_URL)):
                mst_conn.execute(redir_qry, [req_url, redir_url, 'client'])
            if redir_url.startswith(ROOT_URL):
                # Restart Request loop with redirected url
                req_url = redir_url
                continue
            else:
                # Redirect url is out of scope.
                return None

        # End of the Request loop.
        # Reaching this point means we're done requesting this page.
        break

    # Strip potential parameters and anchor in the response url
    resp_url = resp.url.split('?')[0].split('#')[0]
    # Determine the wcm url
    if resp_url.startswith(ROOT_URL):
        # Response url is in scope. Now the wcm url should be in the header
        # meta tag named 'DCTERMS.identifier'.
        meta_tag = soup.find(
            'meta', attrs={'name': 'DCTERMS.identifier'})
        if meta_tag:
            wcm_url = meta_tag['content']
            wcm_url = urljoin(resp_url, wcm_url)
            if wcm_url != resp_url:
                # Save this as an alias in the redirects table
                mst_conn.execute(redir_qry, [resp_url, wcm_url, 'alias'])
                # Consider also the wcm_url as done
                urls_done.add(wcm_url)
        else:
            logger.error(f'page without wcm url; '
                         f'falling back to: {resp_url}')
            wcm_url = resp_url
    else:
        # Response url is not in scope.
        # Because of the logic from the code above this situation should
        # never occur. Since this is not destructive, scraping continues.
        logger.error(f'out of scope response from {req_url}')
        return None

    # Save in-scope page under the wcm path.
    # It is possible that a page in the urls_todo set redirects to a wcm_url
    # of which the page/path combination is saved already. Especially some
    # short url's cause this situation. Those cases are not saved (again).
    wcm_path = wcm_url.split(ROOT_URL)[1]
    page_id = save_page(wcm_path, page_as_string)
    if not page_id:
        # Page was saved and handled earlier during this scrape
        return None

    # Fetch all in-page links, save the editorial links to the scrape db and
    # add links of in-scope pages to the urls_todo set.
    # Cycle over the editorial, automated en rest (partial) trees of the page
    for tree in split_tree(soup):
        editorial = True if tree.html['tree'] == 'editorial' else False

        # Cycle over all links in the tree
        for a_tag in tree.find_all('a', href=True):
            # Unquote the link to be sure (history taught this)
            link = unquote(a_tag['href'].strip())
            link_text = a_tag.text.strip()
            # Some links are discarded
            if link.startswith('#'):
                # Plain #-links
                continue
            if link == '/':
                # Home page links
                continue
            if 'readspeaker' in link or 'adobe' in link or 'java' in link:
                # Non content links
                continue
            link = urljoin(resp_url, link)
            # Separate anchor from in-scope url
            if link.startswith(ROOT_URL) and '#' in link:
                link, _, anchor = link.partition('#')
            else:
                anchor = None
            if link == resp_url:
                # Discard in-page link (in case it was not a plain #-link)
                continue
            if link and editorial:
                link_type = None
                link_status = None
                # Check if the link is new for this scrape
                qry = f'''
                    SELECT link_nr
                    FROM scr_links
                    WHERE url = '{link}'
                        AND anchor IS {f"'{anchor}'" if anchor else 'NULL'}
                    '''
                qry_result = mst_conn.execute(qry).fetchone()
                if qry_result:
                    # This link was already handled during this scrape
                    link_nr = qry_result[0]
                else:
                    if re.match('^https?://', link):
                        # Get type and status of the link
                        if link.startswith(ROOT_URL):
                            link_type = 'int'
                        else:
                            # The link is external
                            link_status, link_type = check_ext_link(
                                link + '#' + anchor if anchor else link,
                                not check_all_links)
                            # If an external link redirects to an internal
                            # page, it should be scraped too. This is true for
                            # short url's, since these are strictly external
                            # relative to the root url.
                            if link_type == 'ext-to-int':
                                if link not in urls_done:
                                    urls_todo.add(link)
                    # Store the link for this scrape
                    link_nr = mst_conn.execute('''
                        INSERT INTO scr_links
                            (url, anchor, link_type, status)
                        VALUES
                            (?, ?, ?, ?)
                        ''', [link, anchor, link_type, link_status]).lastrowid
                # Save link in relation to the page (can be a duplicate if
                # the link is used more than once with the same text in the
                # same page).
                mst_conn.execute('''
                    INSERT INTO scr_ed_links (page_id, text, link_nr)
                    VALUES (?, ?, ?)
                    ''', [page_id, link_text, link_nr])
            if link:
                # Discard links that do not need to be scraped
                if link.endswith('.xml'):
                    continue
                if not link.startswith(ROOT_URL):
                    # Link not in scope
                    continue
                # Strip potential parameters (anchor was separated from
                # in-scope url before).
                valid_scrape_url = link.split('?')[0]
                if valid_scrape_url in urls_done:
                    # Already handled
                    continue
                urls_todo.add(valid_scrape_url)
            else:
                # This condition is probably not possible
                logger.warning(f'empty link in page {page_id}')

    return wcm_url

Scrape an html page and save the results to the scrape tables.

Arguments:

req_url: url of requested page
urls_done: url's that have been scraped already
urls_todo: url's that still need to be scraped
page_zip: archive to store scraped page sources
check_all_links: when false, only trusted links will be checked

Returns:

canonical wcm url (or None for mentioned cases)

Get the final response from the requested url while collecting any redirect that occurs during the request. Register the page under the canonical wcm path and save the page source to the zip-archive. Further store page-related data, editorial links and redirects to the scrape tables of the scrapes database (documented with the create_scrapes_storage function). The urls_done and urls_todo sets are updated accordingly, driving the further scrape of the site.

When check_all_links is True, all links within the page that are external (do not start with bd_www.constants.ROOT_URL) will be checked. When set to False, only those external links will be checked that go to a site in the trusted_domains configuration parameter (refer to bd_www.Config). This option can be used when this module runs on a platform that (for reasons of security) has limited external site access.

All url's are absolute, also in the resulting scr_links, scr_redirs and scr_unlinked tables. The post_process_scrape should be used to normalise these url's and determine the type of the unlinked pages.

The returned canonical wcm url is the content attribute of the <meta name="DCTERMS.identifier"> tag. This is the url as generated by the WCM system.

The returned value will be None when the requested url

  • was already scraped (and saved)
  • redirects to an out-of-scope page
  • gave an unexpected response (which is logged)
#   def split_tree( soup: bs4.BeautifulSoup ) -> (<class 'bs4.BeautifulSoup'>, <class 'bs4.BeautifulSoup'>, <class 'bs4.BeautifulSoup'>):
View Source
def split_tree(
        soup: BeautifulSoup) -> (BeautifulSoup, BeautifulSoup, BeautifulSoup):
    """
    **Split a soup doc in three separate docs: editorial, automated and rest.**

    Arguments:

        soup: bs4 representation of the page to be split

    Returns:

        editorial doc, automated doc, rest doc

    The *editorial doc* is a copy of the originating doc with all tags
    removed that do not contain editorial content. The *automated doc* is a
    new html (root) tag containing all tags with automated content. The *rest
    doc* is an html (root) tag containing all tags that did not represent
    editorial or automated content. No tags are lost, so all three docs
    together contain the same tags as the originating doc, but only the
    hierarchical structure of the *editorial doc* remains intact.

    To identify the separate docs, a `tree` attribute is added to the `<html>`
    tag with values 'editorial', 'automated' and 'rest' respectively.

    More specifically, the three docs will contain the next tags from the
    originating doc.

    ***rest doc***:

    - `<head>`
    - `<header>`
    - `<footer>`
    - `<div id="bld-nojs">`: for situation that javascript is not active
    - `<div class="bld-subnavigatie">`: left side navigation of bib pages
    - `<div class="bld-feedback">`: bottom feedback of content page
    - `<div>` with readspeaker buttons
    - `<div>` with modal dialog for the virtual assistant
    - `<script>` en `<noscript>`
    - comments

    ***automated doc***:

    - `<div class="add_content">`: automatically added content
    - all remaining tags from pages with type bld-overview

    ***editorial doc***:

    - all tags that do not go to one of the other docs
    """

    # Make working copy from soup doc, because of destructive tag removal
    tree = copy.copy(soup)
    # Create rest doc as html trunk
    rst_tree = BeautifulSoup('<html tree="rest"></html>', features='lxml')

    # All that is not needed for the editorial or automated trees goes to rest:
    # - head, header and footer
    if tree.head:
        rst_tree.html.append(tree.head.extract())
    if tree.body.header:
        rst_tree.html.append(tree.body.header.extract())
    if tree.body.footer:
        rst_tree.html.append(tree.body.footer.extract())
    # - content for non active javascript
    div_nojs = tree.find('div', id='bld-nojs')
    if div_nojs:
        rst_tree.html.append(div_nojs.extract())
    # - sub-navigation
    div_subnav = tree.find(class_='bld-subnavigatie')
    if div_subnav:
        rst_tree.html.append(div_subnav.extract())
    # - feedback
    div_feedback = tree.find(class_='bld-feedback')
    if div_feedback:
        rst_tree.html.append(div_feedback.extract())
    # - readspeaker buttons
    for tag in tree.find_all('div', class_='rs_skip'):
        rst_tree.html.append(tag.extract())
    # - modal dialog for the virtual assistant
    for tag in tree.find_all('div', id='vaModal'):
        rst_tree.html.append(tag.extract())
    # - comments
    for tag in tree.find_all(string=lambda text: isinstance(text, Comment)):
        rst_tree.html.append(tag.extract())
    # - scripts
    for tag in tree.find_all(('script', 'noscript')):
        rst_tree.html.append(tag.extract())
    # - no-support div
    for tag in tree.find_all(id='bld-nosupport'):
        rst_tree.html.append(tag.extract())

    # Is the page generated without any editor intervention?
    if tree('body', attrs={'data-pagetype': 'bld-overview'}):
        # Then editor tree becomes trunk only
        ed_tree = BeautifulSoup(
            '<html tree="editorial"></html>', features='lxml')
        # And automated tree gets all remaining content
        aut_tree = tree
        aut_tree.html['tree'] = 'automated'
    else:
        # Make new html trunk and graft all automated content on it
        aut_tree = BeautifulSoup(
            '<html tree="automated"></html>', features='lxml')
        for tag in tree.find_all('div', class_='content_add'):
            aut_tree.html.append(tag.extract())
        # Editorial tree gets all remaining content
        ed_tree = tree
        ed_tree.html['tree'] = 'editorial'

    return ed_tree, aut_tree, rst_tree

Split a soup doc in three separate docs: editorial, automated and rest.

Arguments:

soup: bs4 representation of the page to be split

Returns:

editorial doc, automated doc, rest doc

The editorial doc is a copy of the originating doc with all tags removed that do not contain editorial content. The automated doc is a new html (root) tag containing all tags with automated content. The rest doc is an html (root) tag containing all tags that did not represent editorial or automated content. No tags are lost, so all three docs together contain the same tags as the originating doc, but only the hierarchical structure of the editorial doc remains intact.

To identify the separate docs, a tree attribute is added to the <html> tag with values 'editorial', 'automated' and 'rest' respectively.

More specifically, the three docs will contain the next tags from the originating doc.

rest doc:

  • <head>
  • <header>
  • <footer>
  • <div id="bld-nojs">: for situation that javascript is not active
  • <div class="bld-subnavigatie">: left side navigation of bib pages
  • <div class="bld-feedback">: bottom feedback of content page
  • <div> with readspeaker buttons
  • <div> with modal dialog for the virtual assistant
  • <script> en <noscript>
  • comments

automated doc:

  • <div class="add_content">: automatically added content
  • all remaining tags from pages with type bld-overview

editorial doc:

  • all tags that do not go to one of the other docs
#   def find_unscraped_urls( ts: str, urls_done: Set[str], historic: bool = True, requested: bool = False ) -> Set[str]:
View Source
def find_unscraped_urls(
        ts: str, urls_done: Set[str],
        historic: bool = True, requested: bool = False) -> Set[str]:
    """
    **Get url's that were not found via link crawling.**

    Arguments:

        ts: timestamp of the scrape [yymmdd-hhmm]
        urls_done: url's that have been scraped already
        historic: check all url's that were ever scraped before
        requested: check all url's that were requested during the last 8 days

    Returns:

        originating url's that responded with internal url's that were not
        scraped yet

    When scraping via link crawling is exhausted, this function should be
    used to find url's of pages that do exist, but were not scraped yet. The
    next url sets will be verified (partially depending on parameters from
    the configuration file; see further `bd_www.Config`):

    - latest unlinked url's (depending on `use_unlinked_urls` configuration
      parameter)
    - url's from the sitemap(s), located via the 'robots.txt' file (depending
      on `use_sitemap` configuration parameter)
    - url's that were scraped during all previous scrapes
    - url's that were requested during the last eight days (including today)
    - url's encountered during earlier scrapes that redirected to internal
      pages (implicitly including short url's)

    The 'robots.txt' and 'sitemap.xml' files that are used, will be saved in
    case they differ from the previously saved version.

    The (fully qualified) response url's of the non scraped pages are stored
    in the *scr_unlinked* table of the scrapes database.
    """

    def save_scrape_file(content: str, name: str, dest_dir: Path) -> None:
        """
        **Save content to file if it differs from the most recent version.**

        Arguments:

            content: content to be saved
            name: name of the file to save to
            dest_dir: directory to save in

        The name with which the content is saved will be preceded by the
        timestamp and an underscore character.

        The content will also be saved if `dest_dir` refers to an empty
        directory.
        """
        dir_files = sorted(dest_dir.iterdir())
        if dir_files:
            with open(dir_files[-1]) as lf:
                last_content = lf.read()
        if not dir_files or content != last_content:
            with open(dest_dir / f'{ts}_{name}', 'w') as nf:
                nf.write(content)

    def living(all_urls: Set[str], set_name: str) -> Set[str]:
        """
        **Return and register url's of living pages that were not scraped yet.**

        Arguments:

            all_urls: url's to check
            set_name: identification used for logging

        Returns:

            originating url's that responded with living pages

        Apart from returning the set of originating url's that responded with
        an active internal page that was not scraped yet, the respons url's
        of these pages are stored in the `scr_unlinked` table of the scrapes
        database.

        Duration and results be logged.
        """
        start = time.time()
        missed_urls = set()
        for u in all_urls - urls_done:
            # noinspection PyBroadException
            try:
                rsp = requests.get(u)
            except Exception:
                continue
            if rsp.status_code // 100 != 4 and rsp.url.startswith(ROOT_URL):
                # Url is active and responds with an internal page
                missed_urls.add(u)
                if rsp.url not in urls_done:
                    mst_conn.execute(
                        'INSERT OR IGNORE INTO scr_unlinked (url) '
                        'VALUES (?)', [rsp.url])
                    logger.debug(f"unlinked {set_name} url's: {rsp.url}")

        # Wrapping up
        secs = int(time.time() - start)
        msg = 'link crawling '
        if missed_urls:
            msg += f"missed {len(missed_urls)} {set_name} url's"
        else:
            msg += f"did not miss any {set_name} url's"
        msg += f'; checking took {secs // 60}:{secs % 60:02} min'
        logger.info(msg)
        return missed_urls

    still_to_scrape = set()

    # Verify previously unlinked url's
    if scrape_conf.use_unlinked_urls:
        qry = f'''
            WITH
                last_unlinked_pages AS (
                    SELECT DISTINCT
                        page_id,
                        last_value(unl_type) OVER (
                            PARTITION BY page_id
                            ORDER BY timestamp
                            ROWS BETWEEN UNBOUNDED PRECEDING
                                AND UNBOUNDED FOLLOWING
                        ) AS unl_type
                    FROM his_unlinked
                ),
                last_unlinked_urls AS (
                    SELECT DISTINCT
                        url,
                        last_value(unl_type) OVER (
                            PARTITION BY url
                            ORDER BY timestamp
                            ROWS BETWEEN UNBOUNDED PRECEDING
                                AND UNBOUNDED FOLLOWING
                        ) AS unl_type
                    FROM his_unlinked
                )
            SELECT '{ROOT_URL}' || path
            FROM last_unlinked_pages
            left JOIN mst_paths USING (page_id)
            WHERE path NOTNULL AND unl_type != 'absent'
                UNION
            SELECT url
            FROM last_unlinked_urls
            WHERE url NOTNULL AND unl_type != 'absent'
            '''
        unlinked_urls = {r[0] for r in mst_conn.execute(qry).fetchall()}
        still_to_scrape |= living(unlinked_urls, 'unlinked')

    # Verify url's from the sitemap(s)
    if scrape_conf.use_sitemap:
        # Find the sitemap via robots.txt
        robots_url = WWW_SITE + '/robots.txt'
        response = requests.get(robots_url)
        if response.status_code != 200:
            logger.info(
                f'getting {robots_url} returned status {response.status_code}')
        else:
            robots_txt = response.text
            robots_dir = mst_dir / scrape_conf.robots_dir_name
            save_scrape_file(robots_txt, 'robots.txt', robots_dir)
            sitemap_urls = re.findall(r'Sitemap:\s(https?://\S*)', robots_txt)
            if len(sitemap_urls) == 0:
                logger.info('no sitemap.xml declared in robots.txt')
            else:
                urls_from_sitemaps = set()
                sitemap_dir = mst_dir / scrape_conf.sitemap_dir_name
                for i, url in enumerate(sitemap_urls, start=1):
                    response = requests.get(url)
                    if response.status_code != 200:
                        logger.info(f'getting {url} returned status '
                                    f'{response.status_code}')
                    else:
                        logger.info(f'sitemap found at {url}')
                        if len(sitemap_urls) == 1:
                            filename = f'sitemap.xml'
                        else:
                            filename = f'sitemap({i}).xml'
                        save_scrape_file(response.text, filename, sitemap_dir)
                        soup = BeautifulSoup(response.text, features='lxml')
                        urls_from_sitemaps |= {loc.text for loc
                                               in soup.findAll('loc')}
                still_to_scrape |= living(urls_from_sitemaps, 'sitemap')

    # Verify all historic url's
    if historic:
        qry_result = mst_conn.execute(
            'SELECT path FROM mst_paths').fetchall()
        historic_urls = {ROOT_URL + path for (path,) in qry_result}
        still_to_scrape |= living(historic_urls, 'historic')

    # Verify url's that were requested during the last eight days
    if requested:
        # noinspection PyBroadException
        # This try clause is used since Matomo is very unpredictable
        try:
            requested_urls = urls_last_days()
        except Exception:
            logger.critical('using urls_last_days() failed')
        else:
            # Check which do exist but are not crawled yet
            still_to_scrape |= living(requested_urls, 'visited')

    # Verify all previously requested url's that initiated redirects
    qry_result = mst_conn.execute(
        'SELECT DISTINCT req_url FROM his_redirs').fetchall()
    redirected_urls = {url for (url,) in qry_result}
    still_to_scrape |= living(redirected_urls, 'redirected')

    return still_to_scrape

Get url's that were not found via link crawling.

Arguments:

ts: timestamp of the scrape [yymmdd-hhmm]
urls_done: url's that have been scraped already
historic: check all url's that were ever scraped before
requested: check all url's that were requested during the last 8 days

Returns:

originating url's that responded with internal url's that were not
scraped yet

When scraping via link crawling is exhausted, this function should be used to find url's of pages that do exist, but were not scraped yet. The next url sets will be verified (partially depending on parameters from the configuration file; see further bd_www.Config):

  • latest unlinked url's (depending on use_unlinked_urls configuration parameter)
  • url's from the sitemap(s), located via the 'robots.txt' file (depending on use_sitemap configuration parameter)
  • url's that were scraped during all previous scrapes
  • url's that were requested during the last eight days (including today)
  • url's encountered during earlier scrapes that redirected to internal pages (implicitly including short url's)

The 'robots.txt' and 'sitemap.xml' files that are used, will be saved in case they differ from the previously saved version.

The (fully qualified) response url's of the non scraped pages are stored in the scr_unlinked table of the scrapes database.

#   def post_process_scrape() -> None:
View Source
def post_process_scrape() -> None:
    """
    **Update tables after finishing a scrape.**

    After scraping all pages using the `scrape_page` function,
    this `post_process_scrape` function needs to be used to updates some
    tables in the scrapes database. This is essential before further
    processing can be done of the scraped data.

    This post-processing consists of two steps:

    **Step 1** - Url normalisation

    Convert the full url's in the *scr_links*, *scr_redirs* and
    *scr_unlinked* tables to `page_id`'s where possible.

    ***Remark:*** *This can result in `page_id`'s of url's/pages that were
    not alive during the scrape (especially in the `scr_links` table).*

    **Step 2** - Typing unlinked pages

    Determine the type of the pages in the *scr_unlinked* table by setting
    *unl_type* to:

    - 'alias' - when the page is an alias to a wcm page
    - 'special' - when the id is in the *non_orphan_ids* table of the scrapes
      database
    - 'orphan' - in all other cases

    **Finally**

    Because some of the transactions impact the size of the scrapes database,
    it is vacuumed before returning.
    """

    # Replace url's in the scr_links table with page_id's where possible and
    # combine separated anchors with remaining urls (during the scrape anchors
    # were separated form internal url's).
    mst_conn.execute(f'''
        UPDATE scr_links
        SET
            page_id = (
                SELECT page_id
                FROM mst_paths
                WHERE url = '{ROOT_URL}' || path),
            url = NULL
        WHERE url IN (SELECT '{ROOT_URL}' || path FROM mst_paths)
        ''')
    mst_conn.execute('''
        UPDATE scr_links
        SET url = url || '#' || anchor,
            anchor = NULL
        WHERE url NOTNULL AND anchor NOTNULL
        ''')

    # Replace request and redirect url's in the redirs table with page_id's
    # where possible.
    mst_conn.execute(f'''
        UPDATE scr_redirs
        SET
            req_id = (
                SELECT page_id
                FROM mst_paths
                WHERE req_url = '{ROOT_URL}' || path),
            req_url = NULL
        WHERE req_url IN
            (SELECT '{ROOT_URL}' || path FROM mst_paths)
        ''')
    mst_conn.execute(f'''
        UPDATE scr_redirs
        SET
            redir_id = (
                SELECT page_id
                FROM mst_paths
                WHERE redir_url = '{ROOT_URL}' || path),
            redir_url = NULL
        WHERE redir_url IN
            (SELECT '{ROOT_URL}' || path FROM mst_paths)
        ''')

    # Replace url's in the 'scr_unlinked' table with page_id's where possible
    # and qualify the unlinked url's.
    mst_conn.execute(f'''
        UPDATE scr_unlinked
        SET
            page_id = (
                SELECT page_id
                FROM mst_paths
                WHERE url = '{ROOT_URL}' || path),
            url = NULL
        WHERE url IN (SELECT '{ROOT_URL}' || path FROM mst_paths)
        ''')
    mst_conn.execute('''
        WITH
            -- Url-based redir chains
            redir_chains (org_url, prev_url, next_id, next_url, next_type) AS (
                SELECT
                    req_url, NULL, redir_id, redir_url, redir_type
                FROM scr_redirs
                    UNION ALL
                SELECT
                    org_url, next_url, r.redir_id, r.redir_url, r.redir_type
                FROM redir_chains AS c
                LEFT JOIN scr_redirs AS r ON c.next_url = r.req_url
                -- avoid endless loops by excluding redirs of an alias
                WHERE c.next_type <> 'alias'
            ),
            -- All redir chains ending in a wcm url
            scr_redirs_to_wcm (req_url, wcm_id, final_type) AS (
                SELECT org_url, next_id, next_type
                FROM redir_chains
                WHERE next_id IS NOT NULL AND org_url IS NOT NULL
            )
        UPDATE scr_unlinked
        SET unl_type =
            iif(page_id IN non_orphan_ids,
                'special',
                iif((SELECT final_type
                     FROM scr_redirs_to_wcm
                     WHERE url = req_url) = 'alias',
                    'alias',
                    'orphan'
                   )
               )
        ''')
    logger.info('scr_links, scr_redirs and scr_unlinked tables '
                'have been normalised')

    # Finally
    mst_conn.execute('VACUUM')

Update tables after finishing a scrape.

After scraping all pages using the scrape_page function, this post_process_scrape function needs to be used to updates some tables in the scrapes database. This is essential before further processing can be done of the scraped data.

This post-processing consists of two steps:

Step 1 - Url normalisation

Convert the full url's in the scr_links, scr_redirs and scr_unlinked tables to page_id's where possible.

Remark: This can result in page_id's of url's/pages that were not alive during the scrape (especially in the scr_links table).

Step 2 - Typing unlinked pages

Determine the type of the pages in the scr_unlinked table by setting unl_type to:

  • 'alias' - when the page is an alias to a wcm page
  • 'special' - when the id is in the non_orphan_ids table of the scrapes database
  • 'orphan' - in all other cases

Finally

Because some of the transactions impact the size of the scrapes database, it is vacuumed before returning.

#   def extract_pages_info(ts: str) -> None:
View Source
def extract_pages_info(ts: str) -> None:
    """
    **Renew content of *scr_pages_info* table with extracted information.**

    Arguments:

        ts: timestamp of the scrape [yymmdd-hhmm]

    After deleting any previous content from the *scr_pages_info*,
    this function fills the 'extracted' data fields of this table for each
    scraped page with the next content:

    - *title* - content of `<title>` tag
    - *description* - content of `<meta name="description" content="..." />`
    - *num_h1s* - number of `<h1>` tags
    - *first_h1* - content of first `<h1>` tag
    - *language* - content of `<meta name="language" content="xx" />`
    - *modified* - content of `<meta name="DCTERMS.modified" content="..." />`
    - *pagetype* - attribute value of `<body data-pageType="...">`
    - *classes* - attribute value of `<body class="...">`
    - *ed_text* - editorial text of the page
    - *aut_text* - automated text of the page

    The *scr_pages_info* table accommodates additional fields to contain
    derived information for each page. This is further detailed in the
    `derive_pages_info` function.

    It will be logged when tags or attributes are missing or values are
    invalid.

    ***Data note:*** *Extracted information concerns data that is readily
    available within the page source. Storing this data in a separate table
    is strictly redundant, but serves faster access.*

    ***Implementation note:*** *The fields of the* scr_pages_info *table are
    defined by the constants `EXTRACTED_FIELDS` and `DERIVED_FIELDS`.
    Although some groundwork is laid to adapt the number and content of these
    fields, the actual code does not fully support such an adaptation.*
    """

    logger.info('extracting pages info started')

    src_zip = PageSourceZip(ts)
    num_pages = len(src_zip.page_ids())
    start_time = time.time()
    mst_conn.execute('DELETE FROM scr_pages_info')

    # Cycle over all pages
    for page_num, (page_id, doc) in enumerate(src_zip.iter_pages(), start=1):
        qry = f'SELECT path FROM mst_paths WHERE page_id = {page_id}'
        path = mst_conn.execute(qry).fetchone()[0]
        soup = BeautifulSoup(doc, features='lxml')
        info = {'page_id': page_id}

        # Get title
        title = soup.title
        if not title:
            logger.warning(f'page has no <title> tag: {path}')
            title = None
        else:
            title = title.text
            if not title:
                logger.warning(f'page with empty title: {path}')
        info['title'] = title

        # Get description
        description = soup.find(attrs={'name': 'description'})
        if not description:
            # There are very much occurrences of this situation
            logger.debug(
                f'page has no <meta name="description"/> tag: {path}')
            description = None
        else:
            description = description['content']
            if not description:
                logger.warning(f'page with empty description: {path}')
        info['description'] = description

        # Get info from <h1> tags
        h1s = []
        for h1 in soup.find_all('h1'):
            h1s.append(h1.text)
        if not h1s:
            # h1 tag can be hidden in a script (e.g. search page)
            if h1s := re.findall(r'<h1>(.*)</h1>', doc):
                logger.warning(f'page with hidden h1-tag: {path}')
            else:
                logger.warning(f'page without h1: {path}')
        info['num_h1s'] = len(h1s)
        info['first_h1'] = h1s[0] if h1s else None

        # Get language
        language = soup.find('meta', attrs={'name': 'language'})
        if not language:
            logger.warning(
                f'page has no <meta name="language"/> tag: {path}')
            language = None
        else:
            language = language['content']
            if not language:
                logger.warning(f'page with empty language: {path}')
        info['language'] = language

        # Get date modified
        modified = soup.find('meta', attrs={'name': 'DCTERMS.modified'})
        if not modified:
            logger.warning(
                f'page has no tag <meta name="DCTERMS.modified"/>: {path}')
            modified = None
        else:
            try:
                modified = dt.date.fromisoformat(modified['content'])
            except ValueError:
                logger.warning(
                    f'page with improper modification date: {path}')
                modified = None
        info['modified'] = modified

        # Get type of page
        if 'data-pagetype' not in soup.body.attrs:
            logger.warning('page has no data-pagetype attribute in the '
                           f'<body> tag: {path}')
            pagetype = None
        else:
            pagetype = soup.body['data-pagetype']
            if not pagetype:
                logger.warning(
                    f'page with empty pagetype in <body> tag: {path}')
        info['pagetype'] = pagetype

        # Get classes
        if 'class' not in soup.body.attrs:
            logger.warning(
                f'page has no class attribute in the <body> tag: {path}')
            classes = None
        else:
            classes = soup.body['class']
            if not classes:
                logger.warning(
                    f'page with empty class in <body> tag: {path}')
        info['classes'] = ' '.join(classes) if classes else None

        # Get editorial and automated texts
        info['ed_text'], info['aut_text'] = get_text(soup)

        # add info to the database
        fields = ', '.join(info)
        q_marks = ('?, ' * len(info))[:-2]
        mst_conn.execute(
            f'INSERT INTO scr_pages_info ({fields}) VALUES ({q_marks})',
            list(info.values()))

        # print progress and prognosis
        if page_num % 250 == 0:
            page_time = (time.time() - start_time) / page_num
            togo_time = int((num_pages - page_num) * page_time)
            print(
                f'extracting pages info '
                f'- togo: {num_pages - page_num} pages / '
                f'{togo_time // 60}:{togo_time % 60:02} min')

    logger.info('extracting pages info finished')

Renew content of scr_pages_info table with extracted information.

Arguments:

ts: timestamp of the scrape [yymmdd-hhmm]

After deleting any previous content from the scr_pages_info, this function fills the 'extracted' data fields of this table for each scraped page with the next content:

  • title - content of <title> tag
  • description - content of <meta name="description" content="..." />
  • num_h1s - number of <h1> tags
  • first_h1 - content of first <h1> tag
  • language - content of <meta name="language" content="xx" />
  • modified - content of <meta name="DCTERMS.modified" content="..." />
  • pagetype - attribute value of <body data-pageType="...">
  • classes - attribute value of <body class="...">
  • ed_text - editorial text of the page
  • aut_text - automated text of the page

The scr_pages_info table accommodates additional fields to contain derived information for each page. This is further detailed in the derive_pages_info function.

It will be logged when tags or attributes are missing or values are invalid.

Data note: Extracted information concerns data that is readily available within the page source. Storing this data in a separate table is strictly redundant, but serves faster access.

Implementation note: The fields of the scr_pages_info table are defined by the constants EXTRACTED_FIELDS and DERIVED_FIELDS. Although some groundwork is laid to adapt the number and content of these fields, the actual code does not fully support such an adaptation.

#   def get_text(soup: bs4.BeautifulSoup) -> [<class 'str'>, <class 'str'>]:
View Source
def get_text(soup: BeautifulSoup) -> [str, str]:
    """
    **Retrieve essential editorial and automated text content of a page.**

    Arguments:

        soup: bs4 representation of a page

    Returns:

        editorial text, automated text

    The editorial and automated text of the page content is returned together
    as a twofold tuple. Basically the relevant texts are retrieved from
    partial trees containing only tags with editorial or automated content
    respectively. Whitespace within these texts is normalised and coherent
    chunks are separated by newlines.
    """

    result = []
    for tree in split_tree(soup):
        if tree.html['tree'] == 'rest':
            continue

        flatten_tagbranch_to_navstring(tree.html)

        # Replace non-breaking spaces with normal ones
        txt = tree.text.replace(b'\xc2\xa0'.decode(), ' ')

        # Substitute one space for any cluster of whitespace chars (getting rid
        # of returns, newlines, tabs, spaces, etc.; this is html, you know!).
        txt = re.sub(r'\s+', ' ', txt)

        # Change #br# markers (introduced while flattening the branches) to
        # newlines, while reducing multiples separated by whitespace only.
        # The final strip() removes potential trailing newlines.
        txt = re.sub(r'\s*(#br#\s*)+\s*', r'\n', txt).strip()

        result.append(txt)
        tree.decompose()

    return result

Retrieve essential editorial and automated text content of a page.

Arguments:

soup: bs4 representation of a page

Returns:

editorial text, automated text

The editorial and automated text of the page content is returned together as a twofold tuple. Basically the relevant texts are retrieved from partial trees containing only tags with editorial or automated content respectively. Whitespace within these texts is normalised and coherent chunks are separated by newlines.

#   def flatten_tagbranch_to_navstring(tag: bs4.element.Tag) -> None:
View Source
def flatten_tagbranch_to_navstring(tag: Tag) -> None:
    """
    **Reduce a complete tag branch to one NavigableString.**

    Arguments:

        tag: part of BeautifulSoup structure that will be reduced

    Returns:

        None (tag is replaced in place with one NavigableString)

    The reduction is realised within the BeautifulSoup data structure that
    the tag is part of. This means that the function replaces the tag branch
    (in place) into a single NavigableString containing all text of the
    complete tag branch.

    The function uses a recursive tree traversal algorithm with a
    NavigableString as leaf. Each instance of the function will combine the
    text content of all children into one NavigableString. Within this string
    all `<br>` tags are replaced by `#br#` markers. The text content of all
    former `<p>`, `<h1>`, `<h2>`, `<h3>`, `<li>` and `<div>` tags in the tag
    branch is enclosed between two `#br#` markers in the resulting
    NavigableString. As such `#br#` markers act as separators between logical
    chunks of text. Due to the recursive flattening process the resulting
    NavigableString may contain more consecutive `#br#` markers. Since
    lay-out is of no concern, this has no significance however.
    """

    # Final leaf cases; done with this branch
    if type(tag) in {NavigableString, Stylesheet}:
        return

    # Has this tag children other then NavigableStrings?
    tag_children = list(tag.children)
    child_types = {type(c) for c in tag_children}
    if tag_children and child_types != {NavigableString}:
        # Flatten recursively all child branches to NavigableStrings
        for c in tag_children:
            # noinspection PyTypeChecker
            flatten_tagbranch_to_navstring(c)

    # At this point all children (if any) of tag are NavigableStrings
    tag_name = tag.name
    if tag_name == 'br':
        tag.replace_with('#br#')
    elif tag_name == 'a':
        tag.replace_with(f' {tag.text}')  # The leading space is significant
    elif tag_name in {'p', 'h1', 'h2', 'h3', 'li', 'div'}:
        tag_text = tag.text
        tag.replace_with(f'#br#{tag_text}#br#')
    else:
        tag.replace_with(tag.text)

    return

Reduce a complete tag branch to one NavigableString.

Arguments:

tag: part of BeautifulSoup structure that will be reduced

Returns:

None (tag is replaced in place with one NavigableString)

The reduction is realised within the BeautifulSoup data structure that the tag is part of. This means that the function replaces the tag branch (in place) into a single NavigableString containing all text of the complete tag branch.

The function uses a recursive tree traversal algorithm with a NavigableString as leaf. Each instance of the function will combine the text content of all children into one NavigableString. Within this string all <br> tags are replaced by #br# markers. The text content of all former <p>, <h1>, <h2>, <h3>, <li> and <div> tags in the tag branch is enclosed between two #br# markers in the resulting NavigableString. As such #br# markers act as separators between logical chunks of text. Due to the recursive flattening process the resulting NavigableString may contain more consecutive #br# markers. Since lay-out is of no concern, this has no significance however.

#   def derive_pages_info() -> None:
View Source
def derive_pages_info() -> None:
    """
    **Add derived information for all pages to the *scr_pages_info* table.**

    Derived information as such is not available within a page,
    but calculated or interpreted from other information. To derive this
    information, the extracted information should already be available in the
    *scr_pages_info* table. This can be accomplished by using the
    `extract_pages_info` function.

    After deleting any previous content from he *derived* data fields of the
    *scr_pages_info* table, next new content is added for each of these fields:

    - *business* - 'belastingen', 'toeslagen' or 'douane', depending on the
      extracted *classes* field
    - *category* - 'dv', 'bib' or 'alg', depending on the extracted *pagetype*
      field, mapped on basis of the constants `DV_TYPES`, `BIB_TYPES` and
      `ALG_TYPES`; the *category* of the 'bld_wrapper' *pagetype* however is
      determined on the *category* of the majority of the pages that refer to
      the specific wrapper page

    It will be logged when info can not be derived due to inconsistent or
    unavailable information.
    """

    # Clear the derived field contents
    set_cols = ', '.join([f'{f[0]} = NULL' for f in DERIVED_FIELDS])
    mst_conn.execute(f'UPDATE scr_pages_info SET {set_cols}')

    # Prepare some queries
    bus_qry = 'UPDATE scr_pages_info SET business = ? WHERE page_id = ?'
    cat_qry = 'UPDATE scr_pages_info SET category = ? WHERE page_id = ?'
    for_qry = '''
        SELECT page_id, pagetype, classes
        FROM scr_pages_info
        ORDER BY CASE pagetype WHEN 'bld-wrapper' THEN 2 ELSE 1 END
        '''
    cat_groups_qry = '''
        SELECT category
        FROM scr_ed_links
        LEFT JOIN scr_links AS l USING (link_nr)
        LEFT JOIN scr_pages_info USING (page_id)
        WHERE l.page_id = ?
        GROUP BY category
        '''
    wrappers_without_cat = set()
    logger.info('deriving pages info started')

    # Cycle over all pages, with wrapper pages after all others, because
    # the wrapper category is determined by the categories of pages
    # linking to that wrapper page.
    for page_id, pagetype, classes in mst_conn.execute(for_qry):

        # Determine business
        if classes:
            if 'toeslagen' in classes or 'tsl' in classes:
                business = 'toeslagen'
            elif 'dou' in classes:
                business = 'douane'
            else:
                business = 'belastingen'
        else:
            business = None
        mst_conn.execute(bus_qry, [business, page_id])

        # Determine category: dv, bib or alg
        if pagetype in DV_TYPES:
            category = 'dv'
        elif pagetype in BIB_TYPES:
            category = 'bib'
        elif pagetype in ALG_TYPES:
            category = 'alg'
        elif pagetype == 'bld-wrapper':
            # Group categories of all pages that link to this wrapper page
            categories = mst_conn.execute(
                cat_groups_qry, [page_id]).fetchall()
            if len(categories) == 1:
                # All pages linking to this wrapper have the same category
                category = categories[0][0]
                if not category:
                    # Probably because other uncategorized wrappers are
                    # linking to this one. Save its page_id for
                    # post-processing.
                    wrappers_without_cat.add(page_id)
            else:
                category = 'alg'
        else:
            category = None
        mst_conn.execute(cat_qry, [category, page_id])

    # The main reason that wrappers do not get a category in the main loop,
    # is that pages that link to these wrappers are wrappers themselves that
    # had no category yet when asked for in the main loop. Repeating the loop
    # for the category of these pages will resolve most of them. To avoid
    # endless looping when wrappers without categories are linking to each
    # other, the number of cycles is maximized (crude but effective).
    max_cycles = len(wrappers_without_cat) * 3
    cycle = 0
    while wrappers_without_cat and cycle < max_cycles:
        cycle += 1
        page_id = wrappers_without_cat.pop()
        # Algorithm for category resolution is the same as in the main loop
        categories = mst_conn.execute(cat_groups_qry, [page_id]).fetchall()
        if len(categories) == 1:
            category = categories[0][0]
            if not category:
                wrappers_without_cat.add(page_id)
                continue
        else:
            category = 'alg'
        mst_conn.execute(cat_qry, [category, page_id])

    # Log the remaining cases and set their category to 'alg' to avoid NULLs
    # in the database.
    for page_id in wrappers_without_cat:
        path = mst_conn.execute(f'SELECT path FROM mst_paths '
                                f'WHERE page_id = {page_id}').fetchone()[0]
        logger.warning(
            f"unresolved category of wrapper set to 'alg': {path}")
        mst_conn.execute(cat_qry, ['alg', page_id])

    logger.info('deriving pages info finished')

Add derived information for all pages to the scr_pages_info table.

Derived information as such is not available within a page, but calculated or interpreted from other information. To derive this information, the extracted information should already be available in the scr_pages_info table. This can be accomplished by using the extract_pages_info function.

After deleting any previous content from he derived data fields of the scr_pages_info table, next new content is added for each of these fields:

  • business - 'belastingen', 'toeslagen' or 'douane', depending on the extracted classes field
  • category - 'dv', 'bib' or 'alg', depending on the extracted pagetype field, mapped on basis of the constants DV_TYPES, BIB_TYPES and ALG_TYPES; the category of the 'bld_wrapper' pagetype however is determined on the category of the majority of the pages that refer to the specific wrapper page

It will be logged when info can not be derived due to inconsistent or unavailable information.

#   def status_figures() -> None:
View Source
def status_figures() -> None:
    """
    **Renew typical scrape status figures of the *scr_status_figures* table.**

    After deleting any previous content from the *scr_status_figures* table,
    next number of pages, redirs, url-aliases or ed-links will be added to
    this table:

    - pages: all pages
    - pages_lang_`<language>`: per *language*
    - pages_buss_`<business>`: per *business*
    - pages_cat_`<category>`: per *category*
    - pages_type_`<pagetype>`: per *pagetype*
    - pages_h1_multi: with more than one h1-tag
    - pages_h1_multi_`<pagetype>`: with more than one h1-tag per *pagetype*
    - pages_h1_no: without h1-tag
    - pages_title_no: without or with empty title-tag
    - pages_title_dupl: with non-unique title-tag
    - pages_descr_no: without or with empty description meta-tag
    - pages_descr_long: with description meta-tag longer than 160 characters
    - redirs: total number of all redirects
    - redirs_`<type>`: number of redirects per *redir_type*
    - redirs_`<type>`_slash: redirects per *redir_type* with only differing a
      slash
    - redirs_wcm-url: wcm url's that get redirected
    - url-aliases: url's that alias an wcm url
    - url-aliases_`<num>`x: url's with `<num>` aliases
    - ed-links_`<int|ext>`: internal|external editorial links
    - ed-links_`<int|ext>`_uniq: unique internal|external editorial links
    - ed-links_`<int|ext>`_avg: internal|external editorial links per page
    - ed-links_int_redir: redirected internal editorial links
    - ed-links_int_non-wcm: internal editorial links to non-wcm url's
    - ed-links_int_orphan-pages: number of unlinked (i.e. orphan) pages
    """

    mst_conn.execute('DELETE FROM scr_status_figures')
    ins_qry = 'INSERT INTO scr_status_figures (name, value) VALUES (?, ?)'

    # Total pages
    qry = '''
        SELECT count(*)
        FROM scr_pages_info
        '''
    num_pages = mst_conn.execute(qry).fetchone()[0]
    mst_conn.execute(ins_qry, ['pages', num_pages])

    # Pages per language
    qry = '''
        SELECT language, count(*)
        FROM scr_pages_info
        GROUP BY language
        ORDER BY language DESC
        '''
    for language, count in mst_conn.execute(qry).fetchall():
        mst_conn.execute(ins_qry, [f'pages_lang_{language}', count])

    # Pages per business
    qry = '''
        SELECT business, count(*)
        FROM scr_pages_info
        GROUP BY business
        ORDER BY business
        '''
    for business, count in mst_conn.execute(qry).fetchall():
        mst_conn.execute(ins_qry, [f'pages_buss_{business}', count])

    # Pages per category
    qry = '''
        SELECT category, count(*)
        FROM scr_pages_info
        GROUP BY category
        ORDER BY business DESC
        '''
    for category, count in mst_conn.execute(qry).fetchall():
        mst_conn.execute(ins_qry, [f'pages_cat_{category}', count])

    # Pages per type
    qry = '''
        SELECT pagetype, count(*)
        FROM scr_pages_info
        GROUP BY pagetype
        ORDER BY category DESC, count(*) ASC
        '''
    for pagetype, count in mst_conn.execute(qry).fetchall():
        mst_conn.execute(ins_qry, [f'pages_type_{pagetype}', count])

    # Pages with more than one h1's
    qry = '''
        SELECT count(*)
        FROM scr_pages_info
        WHERE num_h1s > 1
        '''
    mst_conn.execute(ins_qry, ['pages_h1_multi',
                               mst_conn.execute(qry).fetchone()[0]])

    # Pages per type with more than one h1's
    qry = '''
        SELECT pagetype, count(*)
        FROM scr_pages_info
        WHERE num_h1s > 1
        GROUP BY pagetype
        '''
    for pagetype, count in mst_conn.execute(qry).fetchall():
        mst_conn.execute(ins_qry, [f'pages_h1_multi_{pagetype}', count])

    # Pages with no h1
    qry = '''
        SELECT count(*)
        FROM scr_pages_info
        WHERE num_h1s = 0
        '''
    mst_conn.execute(ins_qry, ['pages_h1_no',
                               mst_conn.execute(qry).fetchone()[0]])

    # Pages without title
    qry = '''
        SELECT count(*)
        FROM scr_pages_info
        WHERE title = '' or title is NULL
        '''
    mst_conn.execute(ins_qry, ['pages_title_no',
                               mst_conn.execute(qry).fetchone()[0]])

    # Pages with non unique title
    qry = '''
        WITH
            title_freq AS (
                SELECT count(*) as c
                FROM scr_pages_info
                GROUP BY title
            )
        SELECT ifnull(sum(c), 0)
        FROM title_freq
        WHERE c > 1
        '''
    mst_conn.execute(ins_qry, ['pages_title_dupl',
                               mst_conn.execute(qry).fetchone()[0]])

    # Pages without description
    qry = '''
        SELECT count(*)
        FROM scr_pages_info
        WHERE description = '' OR description IS NULL
        '''
    mst_conn.execute(ins_qry, ['pages_descr_no',
                               mst_conn.execute(qry).fetchone()[0]])

    # Pages with description longer than 160 characters
    qry = '''
        SELECT count(*)
        FROM scr_pages_info
        WHERE length(description) > 160
        '''
    mst_conn.execute(ins_qry, ['pages_descr_long',
                               mst_conn.execute(qry).fetchone()[0]])

    # Total redirects (an alias is no redirect)
    qry = '''
        SELECT count(*)
        FROM scr_redirs
        WHERE redir_type != 'alias'
        '''
    mst_conn.execute(ins_qry, ['redirs',
                               mst_conn.execute(qry).fetchone()[0]])

    # Redirects per type
    qry = '''
        SELECT redir_type, count(*)
        FROM scr_redirs
        WHERE redir_type != 'alias'
        GROUP BY redir_type
        '''
    for redir_type, count in mst_conn.execute(qry).fetchall():
        mst_conn.execute(ins_qry, [f'redirs_{redir_type}', count])

    # Redirects per type that only add or loose the last slash
    qry = '''
        SELECT redir_type, count(*)
        FROM scr_redirs
        WHERE req_url || '/' = redir_url OR req_url = redir_url || '/'
        GROUP BY redir_type
        '''
    for redir_type, count in mst_conn.execute(qry).fetchall():
        mst_conn.execute(ins_qry, [f'redirs_{redir_type}_slash', count])

    # Scrape wcm url's that are redirected
    qry = f'''
        SELECT count(*)
        FROM scr_redirs
        WHERE req_id IN scr_page_ids
        '''
    mst_conn.execute(ins_qry, ['redirs_wcm-url',
                               mst_conn.execute(qry).fetchone()[0]])

    # Total aliases
    qry = '''
        SELECT count(*)
        FROM scr_redirs
        WHERE redir_type = 'alias'
        '''
    mst_conn.execute(ins_qry, ['url-aliases',
                               mst_conn.execute(qry).fetchone()[0]])

    # Frequency of aliases
    qry = f'''
        WITH
            redirs_full_urls AS (
                SELECT
                    ifnull(req_url, '{ROOT_URL}' || p1.path) AS req_url,
                    ifnull(redir_url, '{ROOT_URL}' || p2.path) AS redir_url,
                    redir_type
                FROM scr_redirs
                LEFT JOIN mst_paths AS p1 ON req_id = p1.page_id
                LEFT JOIN mst_paths AS p2 ON redir_id = p2.page_id
            ),
            alias_freq AS (
                SELECT redir_url, count(*) AS aliases_per_url
                FROM redirs_full_urls
                WHERE redir_type = 'alias'
                GROUP BY redir_url
            )
        SELECT aliases_per_url, count(*)
        FROM alias_freq
        GROUP BY aliases_per_url
        '''
    for alias_per_url, count in mst_conn.execute(qry).fetchall():
        mst_conn.execute(ins_qry, [f'url-aliases_{alias_per_url}x', count])

    # Total, unique and average number of internal links
    qry = '''
        SELECT count(*)
        FROM scr_ed_links
        LEFT JOIN scr_links USING (link_nr)
        WHERE link_type = 'int' OR link_type = 'ext-to-int'
        '''
    ed_links_int = mst_conn.execute(qry).fetchone()[0]
    qry = '''
        SELECT count(*)
        FROM scr_links
        WHERE link_type = 'int' OR link_type = 'ext-to-int'
        '''
    ed_links_int_uniq = mst_conn.execute(qry).fetchone()[0]
    mst_conn.execute(ins_qry, ['ed-links_int', ed_links_int])
    mst_conn.execute(ins_qry, ['ed-links_int_uniq', ed_links_int_uniq])
    mst_conn.execute(ins_qry, ['ed-links_int_avg',
                               round(ed_links_int / num_pages, 3)])

    # Number of internal editorial links that are redirected. This count
    # includes wcm url's that are no longer alive but get redirected instead.
    qry = '''
        SELECT count(*)
        FROM scr_ed_links
        LEFT JOIN scr_links AS l USING (link_nr)
        LEFT JOIN scr_redirs ON l.url IS req_url AND l.page_id IS req_id
        WHERE (redir_id NOTNULL OR redir_url NOTNULL) AND redir_type != 'alias'
        '''
    links_redirected = mst_conn.execute(qry).fetchone()[0]
    mst_conn.execute(ins_qry, ['ed-links_int_redir', links_redirected])

    # Number of internal editorial links that do not refer to a wcm url
    qry = '''
        SELECT count(*)
        FROM scr_ed_links
        LEFT JOIN scr_links AS l USING (link_nr)
        LEFT JOIN scr_redirs ON l.url = req_url
        WHERE redir_type = 'alias'
        '''
    links_non_wcm = mst_conn.execute(qry).fetchone()[0]
    mst_conn.execute(ins_qry, ['ed-links_int_non-wcm', links_non_wcm])

    # Number of orphan pages
    qry = 'SELECT count(*) FROM scr_unlinked WHERE unl_type = "orphan"'
    num_orphans = mst_conn.execute(qry).fetchone()[0]
    mst_conn.execute(ins_qry, ['ed-links_int_orphan-pages', num_orphans])

    # Total, unique and average number of external links
    qry = '''
        SELECT count(*)
        FROM scr_ed_links
        LEFT JOIN scr_links USING (link_nr)
        WHERE link_type = 'ext' OR link_type = 'ext-to-ext'
        '''
    ed_links_ext = mst_conn.execute(qry).fetchone()[0]
    qry = '''
        SELECT count(*)
        FROM scr_links
        WHERE link_type = 'ext' OR link_type = 'ext-to-ext'
        '''
    ed_links_ext_uniq = mst_conn.execute(qry).fetchone()[0]
    mst_conn.execute(ins_qry, ['ed-links_ext', ed_links_ext])
    mst_conn.execute(ins_qry, ['ed-links_ext_uniq', ed_links_ext_uniq])
    mst_conn.execute(ins_qry, ['ed-links_ext_avg',
                               round(ed_links_ext / num_pages, 3)])

    logger.info(f'status figures saved to scr_status_figures table')

Renew typical scrape status figures of the scr_status_figures table.

After deleting any previous content from the scr_status_figures table, next number of pages, redirs, url-aliases or ed-links will be added to this table:

  • pages: all pages
  • pages_lang_<language>: per language
  • pages_buss_<business>: per business
  • pages_cat_<category>: per category
  • pages_type_<pagetype>: per pagetype
  • pages_h1_multi: with more than one h1-tag
  • pages_h1_multi_<pagetype>: with more than one h1-tag per pagetype
  • pages_h1_no: without h1-tag
  • pages_title_no: without or with empty title-tag
  • pages_title_dupl: with non-unique title-tag
  • pages_descr_no: without or with empty description meta-tag
  • pages_descr_long: with description meta-tag longer than 160 characters
  • redirs: total number of all redirects
  • redirs_<type>: number of redirects per redir_type
  • redirs_<type>_slash: redirects per redir_type with only differing a slash
  • redirs_wcm-url: wcm url's that get redirected
  • url-aliases: url's that alias an wcm url
  • url-aliases_<num>x: url's with <num> aliases
  • ed-links_<int|ext>: internal|external editorial links
  • ed-links_<int|ext>_uniq: unique internal|external editorial links
  • ed-links_<int|ext>_avg: internal|external editorial links per page
  • ed-links_int_redir: redirected internal editorial links
  • ed-links_int_non-wcm: internal editorial links to non-wcm url's
  • ed-links_int_orphan-pages: number of unlinked (i.e. orphan) pages
#   def add_scraped_data(ts: str) -> bool:
View Source
def add_scraped_data(ts: str) -> bool:
    """
    **Add a fresh scrape to the scrapes history pages.**

    Arguments:

        ts: timestamp of the scrape [yymmdd-hhmm]

    Returns:

        success of the operation

    While scraping all pages using the `scrape_page` function, new pages that 
    had no earlier existence have been registered in the *mst_paths* table.
    Apart from the scrape results that are stored as files, the essential 
    remaining results of the scrape reside in the next tables of the scrapes
    database: 

    - *scr_redirs*
    - *scr_links*
    - *scr_ed_links*
    - *scr_unlinked*

    Before adding the scraped results to the history tables, the data in
    these tables has to be normalised and completed using next functions in
    succession:

    - `post_process_scrape` - to normalise the *scr_redirs*, *scr_links*
      and *scr_unlinked* tables and qualify unlinked pages
    - `extract_pages_info` - to extract information from the page sources to
      the *scr_pages_info* table
    - `derive_pages_info` - to complete the content of the *scr_pages_info*
      table
    - `status_figures` - to refresh the *scr_status_figures* table

    After using this function all data of the scrape (as stored in the scrape
    tables) has been merged into the various history tables and new links
    have been added to the *mst_links* table. The scrape tables remain
    unaltered afterwards, but can be safely deleted/overwritten.

    This function can only be used for a scrape that has a later timestamp
    than any other scrape in history. Otherwise it will log an error and
    return unsuccessful.
    """

    # Check if this scrape is later then the youngest in the history tables.
    qry = '''
        SELECT name
        FROM sqlite_schema
        WHERE type = 'table' AND name LIKE 'his_%'
        '''
    last_ok = True
    for table in [r[0] for r in mst_conn.execute(qry)]:
        max_ts = mst_conn.execute(
            f'SELECT max(timestamp) FROM {table}').fetchone()[0]
        if max_ts and max_ts > ts:
            last_ok = False
            logger.error(
                f'scrape {ts} predates latest history in the {table} table')
    if not last_ok:
        logger.error(f'scrape {ts} not added to history')
        return False

    logger.info('started adding scrape results to master and history tables')

    # Add scrape redirs results to the his_redirs table
    cte = '''
        WITH
            last_redirs AS (
                SELECT DISTINCT
                    req_id, req_url,
                    last_value(redir_id) OVER win AS redir_id,
                    last_value(redir_url) OVER win AS redir_url,
                    last_value(redir_type) OVER win AS redir_type
                FROM his_redirs
                WINDOW win AS (
                    PARTITION BY req_id, req_url
                    ORDER BY timestamp
                    RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            )
        '''
    # Save redirs that are new or have been changed
    mst_conn.execute(cte + f'''
        INSERT INTO his_redirs
            (timestamp, req_id, req_url, redir_id, redir_url, redir_type)
        SELECT '{ts}', s.*
        FROM scr_redirs AS s
        LEFT JOIN last_redirs AS m
            ON s.req_id IS m.req_id
            AND s.req_url IS m.req_url
        WHERE s.redir_id IS NOT m.redir_id
            OR s.redir_url IS NOT m.redir_url
            OR s.redir_type IS NOT m.redir_type
        ''')
    # Save redirs that disappeared with redir_type 'absent'
    mst_conn.execute(cte + f'''
        INSERT INTO his_redirs
            (timestamp, req_id, req_url, redir_type)
        SELECT
            '{ts}', m.req_id, m.req_url, 'absent'
        FROM last_redirs AS m
        LEFT JOIN scr_redirs AS s
            ON m.req_id IS s.req_id
            AND m.req_url IS s.req_url
        WHERE m.redir_type != 'absent'
            AND s.req_id ISNULL
            AND s.req_url ISNULL
        ''')
    logger.info('his_redirs table updated')

    # Add new links to the mst_links table...
    mst_conn.executescript('''
        INSERT INTO mst_links (url, anchor, page_id)
        SELECT s.url, s.anchor, s.page_id
        FROM scr_links AS s
        LEFT JOIN mst_links AS m
            ON s.url IS m.url
            AND s.anchor IS m.anchor
            AND s.page_id IS m.page_id
        WHERE m.link_id ISNULL
        ''')
    # ... and save the link_id to the scr_links table for later use.
    mst_conn.execute('''
        UPDATE scr_links AS s
        SET link_id = (
            SELECT link_id
            FROM mst_links AS m
            WHERE m.url IS s.url
                AND m.page_id IS s.page_id
                AND m.anchor IS s.anchor
            )
        ''')
    logger.info('mst_links table updated')

    # Add status for new links or links of which the status changed
    mst_conn.execute(f'''
        WITH
            last_links_status AS (
                SELECT DISTINCT
                    link_id,
                    last_value(link_type) OVER win AS link_type,
                    last_value(status) OVER win AS status
                FROM his_links_status
                WINDOW win AS (
                    PARTITION BY link_id
                    ORDER BY timestamp
                    RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            )
        INSERT INTO his_links_status (timestamp, link_id, link_type, status)
        SELECT '{ts}', link_id, s.link_type, s.status
        FROM scr_links AS s
        LEFT JOIN last_links_status AS l USING (link_id)
        WHERE (l.link_id ISNULL 
               OR s.link_type IS NOT l.link_type 
               OR s.status IS NOT l.status)
            AND NOT
                (l.status NOTNULL AND l.status LIKE 'error%' 
                 AND s.status NOTNULL AND s.status LIKE 'error%')
        ''')
    logger.info('his_links_status table updated')

    # Add records for new editorial links and links that have a different
    # number of appearances in a page.
    cte = '''
        WITH
            scr_ed_links_normalised AS (
                SELECT e.page_id, text, link_id, count(*) AS present
                FROM scr_ed_links AS e
                LEFT JOIN scr_links USING (link_nr)
                GROUP BY e.page_id, text, link_id
            ),
            last_ed_links AS (
                SELECT DISTINCT
                    page_id, text, link_id,
                    last_value(present) OVER win AS present
                FROM his_ed_links
                WINDOW win AS (
                    PARTITION BY page_id, text, link_id
                    ORDER BY timestamp
                    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            )
        '''
    mst_conn.execute(cte + f'''
        INSERT INTO his_ed_links
            (timestamp, page_id, text, link_id, present)
        SELECT '{ts}', page_id, text, link_id, s.present
        FROM scr_ed_links_normalised AS s
        LEFT JOIN last_ed_links AS l USING (page_id, text, link_id)
        WHERE l.present ISNULL
            OR s.present != l.present
        ''')
    # Add records for editorial links that disappeared
    mst_conn.execute(cte + f'''
        INSERT INTO his_ed_links
            (timestamp, page_id, text, link_id, present)
        SELECT '{ts}', page_id, text, link_id, 0
        FROM last_ed_links AS l
        LEFT JOIN scr_ed_links_normalised AS s USING (page_id, text, link_id)
        WHERE l.present > 0 AND s.present ISNULL
        ''')
    logger.info('his_ed_links table updated')

    # Update his_unlinked table with the scrape results
    cte = f'''
        WITH
            last_unlinked AS (
                SELECT DISTINCT
                    page_id, url,
                    last_value(unl_type) OVER win AS unl_type
                FROM his_unlinked
                WINDOW win AS (
                    PARTITION BY page_id, url
                    ORDER BY timestamp
                    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
            )
        '''
    # Save unlinked pages that are new or whose type changed
    mst_conn.execute(cte + f'''
        INSERT INTO his_unlinked (timestamp, page_id, url, unl_type)
        SELECT '{ts}', s.page_id, s.url, s.unl_type
        FROM scr_unlinked AS s
        LEFT JOIN last_unlinked AS l
            ON s.page_id IS l.page_id
            AND s.url IS l.url
        WHERE s.unl_type IS NOT l.unl_type  -- covers new and changed
        ''')
    # Mark unlinked pages that disappeared with unl_type 'absent'
    mst_conn.execute(cte + f'''
        INSERT INTO his_unlinked (timestamp, page_id, url, unl_type)
        SELECT '{ts}', l.page_id, l.url, 'absent'
        FROM last_unlinked AS l
        LEFT JOIN scr_unlinked AS s
            ON l.page_id IS s.page_id
            AND l.url IS s.url
        WHERE l.unl_type != 'absent' AND s.unl_type ISNULL
        ''')
    logger.info('his_unlinked table updated')

    # Update his_pages_life table with the scrape results
    cte = '''
        WITH
            last_pages_life AS (
                SELECT DISTINCT
                    page_id,
                    last_value(alive) OVER (
                        PARTITION BY page_id
                        ORDER BY timestamp
                        RANGE BETWEEN UNBOUNDED PRECEDING
                            AND UNBOUNDED FOLLOWING
                    ) AS alive
                FROM his_pages_life
            )
        '''
    # Mark pages alive that are new or became alive again
    mst_conn.execute(cte + f'''
        INSERT INTO his_pages_life (timestamp, page_id, alive)
        SELECT '{ts}', page_id, TRUE
        FROM scr_page_ids
        LEFT JOIN last_pages_life AS l USING (page_id)
        WHERE l.page_id ISNULL OR NOT alive 
        ''')
    # Register pages that are no longer alive
    mst_conn.execute(cte + f'''
        INSERT INTO his_pages_life (timestamp, page_id, alive)
        SELECT '{ts}', page_id, FALSE
        FROM last_pages_life
        LEFT JOIN scr_page_ids AS s USING (page_id)
        WHERE alive AND s.page_id ISNULL
        ''')
    logger.info('his_pages_life table updated')

    # Update his_short_urls table with the scrape results
    cte = f'''
        WITH
            last_short_urls AS (
                SELECT DISTINCT
                    url,
                    last_value(active) OVER (
                        PARTITION BY url
                        ORDER BY timestamp
                        RANGE BETWEEN UNBOUNDED PRECEDING
                            AND UNBOUNDED FOLLOWING
                    ) AS active
                FROM his_short_urls
            ),
            scr_short_urls AS (
                SELECT req_url AS url
                FROM scr_redirs
                WHERE req_url NOT LIKE '%{ROOT_URL.partition('.')[2]}%'
            )
        '''
    # Mark short-url's active that are new or became active again
    mst_conn.execute(cte + f'''
        INSERT INTO his_short_urls (timestamp, url, active)
        SELECT '{ts}', url, TRUE
        FROM scr_short_urls
        LEFT JOIN last_short_urls USING (url)
        WHERE active ISNULL OR NOT active
        ''')
    num_active = mst_conn.execute(f'''
        SELECT count(*)
        FROM his_short_urls
        WHERE timestamp = '{ts}' AND active
        ''').fetchone()[0]
    # Register short-url's that are no longer active
    mst_conn.execute(cte + f'''
        INSERT INTO his_short_urls (timestamp, url, active)
        SELECT '{ts}', url, FALSE
        FROM last_short_urls
        LEFT JOIN scr_short_urls AS s USING (url)
        WHERE active AND s.url ISNULL
        ''')
    num_inactive = mst_conn.execute(f'''
        SELECT count(*)
        FROM his_short_urls
        WHERE timestamp = '{ts}' AND NOT active
        ''').fetchone()[0]
    logger.info(f'his_short_urls table updated: '
                f'registered {num_active} active and {num_inactive} inactive')

    # Update his_pages_info table for pages with changed aspects
    sql = 'SELECT name FROM pragma_table_info("his_pages_info")'
    page_aspects = [row[0] for row in mst_conn.execute(sql).fetchall()
                    if row[0] != 'timestamp']
    fields = ', '.join(page_aspects)
    mst_conn.execute(f'''
        WITH
            last_pages_info_timestamps AS (
                SELECT DISTINCT
                page_id,
                last_value(timestamp) OVER (
                    PARTITION BY page_id
                    ORDER BY timestamp
                    RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                ) AS timestamp
                FROM his_pages_info
            ),
            last_pages_info AS (
                SELECT {fields}
                FROM last_pages_info_timestamps
                LEFT JOIN his_pages_info USING (timestamp, page_id)
            ),
            changed_pages_info AS (
                SELECT *
                FROM scr_pages_info
                    EXCEPT
                SELECT *
                FROM last_pages_info
            )
        INSERT INTO his_pages_info
        SELECT '{ts}', *
        FROM changed_pages_info
        ''')
    logger.info('his_pages_info table updated')

    # Add scrape status figures to the his_status_figures table
    mst_conn.execute(f'''
        INSERT INTO his_status_figures (timestamp, name, value)
        SELECT '{ts}', name, value
        FROM scr_status_figures
        ''')
    logger.info('his_status_figures table updated with scrape results')
    logger.info('concluded adding scrape results to master and history tables')
    return True

Add a fresh scrape to the scrapes history pages.

Arguments:

ts: timestamp of the scrape [yymmdd-hhmm]

Returns:

success of the operation

While scraping all pages using the scrape_page function, new pages that had no earlier existence have been registered in the mst_paths table. Apart from the scrape results that are stored as files, the essential remaining results of the scrape reside in the next tables of the scrapes database:

  • scr_redirs
  • scr_links
  • scr_ed_links
  • scr_unlinked

Before adding the scraped results to the history tables, the data in these tables has to be normalised and completed using next functions in succession:

After using this function all data of the scrape (as stored in the scrape tables) has been merged into the various history tables and new links have been added to the mst_links table. The scrape tables remain unaltered afterwards, but can be safely deleted/overwritten.

This function can only be used for a scrape that has a later timestamp than any other scrape in history. Otherwise it will log an error and return unsuccessful.

#   def check_storage_integrity(check_page_ids: bool = False) -> None:
View Source
def check_storage_integrity(check_page_ids: bool = False) -> None:
    """
    **Check integrity of the scrapes storage.**

    Arguments:

        check_page_ids: check the page_id/path relation

    Next checks will be executed:

    - available zip files with page sources against valid scrapes
    - validity of timestamps of site reports
    - validity of timestamps of 'sitemap.xml' files
    - validity of timestamps of 'robots.txt' files
    - page_id/path relation of all zip sources (depending on actual value
      of `check_page_ids` parameter; this will take quite some time)

    All detected exceptions will be logged; no data will be altered.
    """

    # Setup logging
    global logger
    logger = logging.getLogger('check')
    logger.setLevel(logging.INFO)
    fh = logging.FileHandler(str(mst_dir / 'check.log'))
    fh.setLevel(logging.INFO)
    ch = logging.StreamHandler()
    ch.setLevel(logging.INFO)
    formatter = logging.Formatter(
        fmt='[%(asctime)s] %(levelname)-8s - %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S')
    fh.setFormatter(formatter)
    ch.setFormatter(formatter)
    logger.addHandler(fh)
    logger.addHandler(ch)
    
    # Check page sources against the valid scrapes
    valid_tss = set(valid_scrapes())
    src_dir = mst_dir / scrape_conf.src_dir_name
    src_tss = set()
    for file in src_dir.iterdir():
        if file.suffix == '.zip':
            ts = file.name[:11]
            src_tss.add(ts)
    if src_tss != valid_tss:
        if scrs_no_zip := valid_tss - src_tss:
            logger.warning(
                f'{len(scrs_no_zip)} scrapes without source zips: '
                f'{", ".join(scrs_no_zip)}')
        if zips_no_scr := src_tss - valid_tss:
            logger.warning(
                f'{len(zips_no_scr)} source zips without scrape: '
                f'{", ".join(zips_no_scr)}')
    logger.info('checked page source zip files')

    # Check site-reports against valid scrapes
    for report in report_conf.reports:
        report_conf.spec_report(report)
        rep_dir = mst_dir / report_conf.report_dir_name
        rep_tss = set()
        for item in rep_dir.iterdir():
            if item.is_dir():
                for file in item.iterdir():
                    ts = file.name[:11]
                    if ts not in valid_tss:
                        logger.warning(f'{report} report with invalid '
                                       f'timestamp: {file.name}')
                    else:
                        rep_tss.add(ts)
        if scrs_no_rep := valid_tss - rep_tss:
            logger.warning(
                f'scrapes without {report} reports: {", ".join(scrs_no_rep)}')
    logger.info('checked site reports')

    # Check sitemaps against valid scrapes
    map_dir = mst_dir / scrape_conf.sitemap_dir_name
    for file in map_dir.iterdir():
        ts = file.name[:11]
        if ts not in valid_tss:
            logger.warning(f'sitemap with invalid timestamp: {file.name}')
    logger.info('checked sitemaps files ')

    # Check robots files against valid scrapes
    robot_dir = mst_dir / scrape_conf.robots_dir_name
    for file in robot_dir.iterdir():
        ts = file.name[:11]
        if ts not in valid_tss:
            logger.warning(f'robots file with invalid timestamp: {file.name}')
    logger.info('checked robots files')

    # Check page_id/path relations of all source zip files
    if check_page_ids:
        logger.info('started checking page_id/path relations of all scrapes; '
                    'this will take some time')
        for ts in sorted(src_tss):
            src_zip = PageSourceZip(ts)
            for page_id, doc in src_zip.iter_pages():
                soup = BeautifulSoup(doc, features='lxml')
                mst_path = mst_conn.execute(
                    'SELECT path FROM mst_paths WHERE page_id = ?',
                    [page_id]).fetchone()[0]
                meta_tag = soup.find(
                    'meta', attrs={'name': 'DCTERMS.identifier'})
                if meta_tag:
                    full_src_path = meta_tag['content']
                    if full_src_path.startswith('/wps/wcm/connect'):
                        src_path = full_src_path[16:]
                        if src_path != mst_path:
                            logger.critical(
                                f'page {page_id} is registered as {mst_path}, '
                                f'but in source zip {src_zip.path.name} its '
                                f'path is {src_path}')
                    else:
                        # No consistency check possible
                        logger.warning(f'strange DCTERMS.identifier in '
                                       f'scrape {ts} / page {page_id}')
                else:
                    # No consistency check possible
                    logger.debug(
                        f'page without wcm url: scrape {ts} / page {page_id}')
            logger.info(f'page_id/path relation checked of scrape {ts}')

    return

Check integrity of the scrapes storage.

Arguments:

check_page_ids: check the page_id/path relation

Next checks will be executed:

  • available zip files with page sources against valid scrapes
  • validity of timestamps of site reports
  • validity of timestamps of 'sitemap.xml' files
  • validity of timestamps of 'robots.txt' files
  • page_id/path relation of all zip sources (depending on actual value of check_page_ids parameter; this will take quite some time)

All detected exceptions will be logged; no data will be altered.

#   def valid_scrapes( from_ts: str = '000101-0000', data_dir: pathlib.Path = WindowsPath('//PVPCOCC322/Users/diepj09/Documents/scrapes_prod') ) -> list[str]:
View Source
def valid_scrapes(
        from_ts: str = '000101-0000', data_dir: Path = mst_dir) -> list[str]:
    """
    **Return a sorted list of valid scrapes.**

    Arguments:

        from_ts: skip scrapes with an earlier timestamp
        data_dir: master directory to contain the (sub)directory with scrape
            logs

    Returns:

        sorted list of timestamps of valid scrapes

    Scrapes are assumed to be valid if the text 'scrape finished' appears in
    the content of the related scrape log. When this text appears in the
    scrape log the essential scrape was successful, even when further
    processing of the scraped data is not executed for some reason.

    Presence of log files with duplicate timestamps as well as log files that
    do not contain the discriminating text wil be logged.
    """
    log_dir = data_dir / scrape_conf.log_dir_name
    valid_tss = set()
    for file in log_dir.iterdir():
        if 'README' in file.name:
            continue
        ts = file.name[:11]
        if ts < from_ts:
            continue
        valid = False
        with open(file) as scr_log:
            for line in scr_log:
                if 'scrape finished' in line:
                    valid = True
                    break
            if valid:
                if ts in valid_tss:
                    logger.warning(f'duplicate log for scrape {ts}')
                else:
                    valid_tss.add(ts)
            else:
                logger.warning(
                    f'scrape {ts} seems to be invalid c.q. unfinished')
    return sorted(valid_tss)

Return a sorted list of valid scrapes.

Arguments:

from_ts: skip scrapes with an earlier timestamp
data_dir: master directory to contain the (sub)directory with scrape
    logs

Returns:

sorted list of timestamps of valid scrapes

Scrapes are assumed to be valid if the text 'scrape finished' appears in the content of the related scrape log. When this text appears in the scrape log the essential scrape was successful, even when further processing of the scraped data is not executed for some reason.

Presence of log files with duplicate timestamps as well as log files that do not contain the discriminating text wil be logged.

#   def reprocess_scrapes( from_ts: str = '000101-0000', reprocess_links: bool = False, copy_db: bool = True ) -> None:
View Source
def reprocess_scrapes(
        from_ts: str = '000101-0000',
        reprocess_links: bool = False,
        copy_db: bool = True) -> None:
    """
    **Reprocess scraped data.**

    Arguments:

        from_ts: first timestamp of data to be reprocessed
        reprocess_links: reprocess content of his_ed_links table
        copy_db: work on a copy of the scrapes database

    ***Note:*** *The structure of the scrapes storage is detailed in the
    `create_scrapes_storage` function.*

    This function will reprocess the data in the history tables of all
    scrapes or starting from the scrape with timestamp `from_ts`. The
    reprocessing addresses all data that can be reproduced from the data that
    was gathered during the actual scrape. This 'basic' scrape data,
    which can not be reproduced itself, consists of:

    - page sources - available in source zip files
    - redirections - stored in the his_redirs table
    - unlinked pages - stored in the his_unlinked table (this information
      can be reproduced from link analysis of the the page sources, but since
      only editorial links are registered, it is treated as basic data)
    - status of links - stored in the his_links_status table

    Using this function assumes that the content of the master tables is 
    complete and correct. In case there is some doubt if this assumption is 
    correct, some consistency checks can be executed on these tables by using 
    the `check_storage_integrity` function. 
    
    While reprocessing, next tables are rebuilt with reprocessed data:

    - *his_ed_links* - using page sources (if `reprocess_links` is `True`)
    - *his_pages_info* - using page sources
    - *his_pages_life* - using page sources
    - *his_short_urls* - using his_redirs
    - *his_status_figures* - using his_pages_info

    Although the other history tables are rebuilt as well, it will be with
    the same (basic) data and as such produce tables with the same content.

    Since reprocessing can take quite some time, the reprocessing will by
    default be executed on a copy of the scrapes database in order to avoid a
    clash with a next scrape being added. Any scrape that is added after the
    execution of this function started, will be reprocessed as well.

    After reprocessing finishes, all original history tables will be saved
    with their original names prepended with 'svd_' and can be deleted from
    the database when the reprocessing was verified and successful.
    """

    global logger
    logger = logging.getLogger('reprocess')
    logger.setLevel(logging.INFO)
    ch = logging.StreamHandler()
    ch.setLevel(logging.INFO)
    formatter = logging.Formatter(
        fmt='[%(asctime)s] %(levelname)-8s - %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S')
    ch.setFormatter(formatter)
    logger.addHandler(ch)

    if copy_db:
        scrs_db = mst_dir / scrape_conf.scrs_db_name
        v = 0
        while True:
            db_name = scrs_db.stem + '_repr' + (f'_{v}' if v else '')
            repr_db = scrs_db.with_stem(db_name)
            if repr_db.exists():
                v += 1
                continue
            else:
                break
        shutil.copy(scrs_db, repr_db)
        mst_conn.switch_to(repr_db)

    # Save copies of the history tables remove data that will be renewed
    qry = '''
        SELECT name
        FROM sqlite_schema
        WHERE type = 'table' AND name LIKE 'his_%'
        '''
    for name in [r[0] for r in mst_conn.execute(qry)]:
        sql = mst_conn.execute(
            f'SELECT sql FROM sqlite_schema WHERE name = ?',
            [name]).fetchone()[0]
        svd_name = 'svd_' + name
        mst_conn.execute(sql.replace(name, svd_name))
        mst_conn.execute(f'INSERT INTO {svd_name} SELECT * FROM {name}')
        mst_conn.execute(f'DELETE FROM {name} WHERE timestamp >= "{from_ts}"')
    mst_conn.execute('VACUUM')

    # Cycle over all scrapes that need to be reprocessed. Then per scrape:
    # - reinstate all scrape tables as if that scrape just finished
    # - process these tables the same way as if the scrape was new
    scrapes = valid_scrapes(from_ts)
    while scrapes:
        ts = scrapes.pop(0)
        if not scrapes:
            # Add scrapes that were added in the meantime.
            scrapes = valid_scrapes(ts)[1:]

        logger.info(f'started reprocessing scrape {ts}')
        empty_scr_tables()
        src_zip = PageSourceZip(ts)

        # Save all page_id's to the scr_page_ids table
        mst_conn.execute('INSERT INTO scr_page_ids VALUES '
                         + ','.join(f'({pid})' for pid in src_zip.page_ids()))

        # Since redirections are determined at the moment of scraping,
        # the scr_redirs table can only be reinstated from the svd_his_redirs
        # table.
        mst_conn.execute(f'''
            WITH
                last_redirs_during_scrape AS (
                    SELECT DISTINCT
                        req_id, req_url,
                        last_value(redir_id) OVER win AS redir_id,
                        last_value(redir_url) OVER win AS redir_url,
                        last_value(redir_type) OVER win AS redir_type
                    FROM svd_his_redirs
                    WHERE timestamp <= '{ts}'
                    WINDOW win AS (
                        PARTITION BY req_id, req_url
                        ORDER BY timestamp
                        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                    )
                )
            INSERT INTO scr_redirs
            SELECT *
            FROM last_redirs_during_scrape
            WHERE redir_type != 'absent'
            ''')

        # Re-analysing unlinked pages can only be done when all links are
        # registered. Since only editorial links are registered,
        # the scr_unliked will be reinstated from the svd_his_unlinked table.
        mst_conn.execute(f'''
            WITH
                last_unlinked_during_scrape AS (
                    SELECT DISTINCT
                        page_id, url,
                        last_value(unl_type) OVER win AS unl_type
                    FROM svd_his_unlinked
                    WHERE timestamp <= '{ts}'
                    WINDOW win AS (
                        PARTITION BY page_id, url
                        ORDER BY timestamp
                        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                    )
                )
            INSERT INTO scr_unlinked
            SELECT *
            FROM last_unlinked_during_scrape
            WHERE unl_type != 'absent'
            ''')

        # Handle scr_links and scr_ed_links tables
        if reprocess_links:
            # Cycle over all pages
            for page_id, doc in src_zip.iter_pages():
                soup = BeautifulSoup(doc, features='lxml')
                mst_path = mst_conn.execute(
                    'SELECT path FROM mst_paths WHERE page_id = ?',
                    [page_id]).fetchone()[0]
                mst_url = ROOT_URL + mst_path

                # Re-process all editorial links
                ed_tree = split_tree(soup)[0]
                for a_tag in ed_tree.find_all('a', href=True):
                    link = a_tag['href'].strip()
                    # Some links are discarded
                    if link.startswith('#'):
                        # Plain #-links
                        continue
                    if link == '/':
                        # Home page links
                        continue
                    if ('readspeaker' in link
                            or 'adobe' in link
                            or 'java' in link):
                        # Non content links
                        continue
                    link = urljoin(mst_url, link)
                    # Separate anchor from in-scope url
                    if link.startswith(ROOT_URL) and '#' in link:
                        link, _, anchor = link.partition('#')
                    else:
                        anchor = None
                    if link == mst_url:
                        # Discard in-page link (if it was not a plain #-link)
                        continue
                    if not link:
                        print('Link is None or empty')
                        continue

                    # Check if the link is new for this scrape
                    qry = f'''
                        SELECT link_nr
                        FROM scr_links
                        WHERE url = '{link}'
                            AND anchor IS {f"'{anchor}'" if anchor else 'NULL'}
                        '''
                    qry_result = mst_conn.execute(qry).fetchone()
                    if qry_result:
                        # This link was already handled during this scrape
                        link_nr = qry_result[0]
                    else:
                        # Store the link for this scrape
                        link_nr = mst_conn.execute('''
                            INSERT INTO scr_links (url, anchor)
                            VALUES (?, ?)
                            ''', [link, anchor]).lastrowid
                    # Save link in relation to the page (can be a duplicate if
                    # the link is used more than once with the same text in the
                    # same page).
                    link_text = a_tag.text.strip()
                    mst_conn.execute('''
                        INSERT INTO scr_ed_links (page_id, text, link_nr)
                        VALUES (?, ?, ?)
                        ''', [page_id, link_text, link_nr])

            # Add link_type and status from saved history (since these can
            # not be reproduced otherwise).
            mst_conn.execute(f'''
                WITH
                    svd_links_status AS (
                        SELECT DISTINCT
                            link_id,
                            last_value(link_type) OVER win AS link_type,
                            last_value(status) OVER win AS status
                        FROM svd_his_links_status
                        WHERE timestamp <= {ts}
                        WINDOW win AS (
                            PARTITION BY link_id
                            ORDER BY timestamp
                            RANGE BETWEEN UNBOUNDED PRECEDING
                                AND UNBOUNDED FOLLOWING
                        )
                    )
                UPDATE scr_links AS s
                SET link_type = (
                        SELECT link_type
                        FROM svd_links_status AS t
                        WHERE t.link_id = s.link_id),
                    status = (
                        SELECT status
                        FROM svd_links_status AS t
                        WHERE t.link_id = s.link_id)
                ''')

            post_process_scrape()

        else:
            # Link related data does not need to be renewed: reinstate the
            # scr_links and scr_ed_links tables from the saved history.
            cte = f'''
                WITH
                    last_ed_links_during_scrape AS (
                        SELECT DISTINCT
                            page_id, text, link_id,
                            last_value(present) OVER win AS present
                        FROM svd_his_ed_links
                        WHERE timestamp <= '{ts}'
                        WINDOW win AS (
                            PARTITION BY page_id, text, link_id
                            ORDER BY timestamp
                            ROWS BETWEEN UNBOUNDED PRECEDING
                                AND UNBOUNDED FOLLOWING
                        )
                    ),
                '''
            mst_conn.execute(cte + f'''
                    last_links_status_during_scrape AS (
                        SELECT DISTINCT
                            link_id,
                            last_value(link_type) OVER win AS link_type,
                            last_value(status) OVER win AS status
                        FROM svd_his_links_status
                        WHERE timestamp <= '{ts}'
                        WINDOW win AS (
                            PARTITION BY link_id
                            ORDER BY timestamp
                            RANGE BETWEEN UNBOUNDED PRECEDING
                                AND UNBOUNDED FOLLOWING
                        )
                    )
                INSERT INTO scr_links
                    (link_nr, url, page_id, anchor, link_type, status)
                SELECT
                    link_id, url, page_id, anchor, link_type, status
                FROM mst_links
                LEFT JOIN last_links_status_during_scrape USING (link_id)
                WHERE link_id IN (SELECT DISTINCT link_id
                                  FROM last_ed_links_during_scrape
                                  WHERE present > 0)
                ''')
            mst_conn.execute(cte + f'''
                    -- Degroup same links per page
                    last_ed_links_ungrouped AS (
                        SELECT page_id, text, link_id, present - 1 AS present
                        FROM last_ed_links_during_scrape
                        WHERE present > 0
                            UNION ALL
                        SELECT page_id, text, link_id, present - 1 AS present
                        FROM last_ed_links_ungrouped
                        WHERE present > 0
                    )
                INSERT INTO scr_ed_links (page_id, text, link_nr)
                SELECT page_id, text, link_id
                FROM last_ed_links_ungrouped
                ''')

        extract_pages_info(ts)
        derive_pages_info()
        status_figures()
        add_scraped_data(ts)
        logger.info(f'concluded reprocessing scrape {ts}')

    if copy_db:
        mst_conn.switch_back()

    return

Reprocess scraped data.

Arguments:

from_ts: first timestamp of data to be reprocessed
reprocess_links: reprocess content of his_ed_links table
copy_db: work on a copy of the scrapes database

Note: The structure of the scrapes storage is detailed in the create_scrapes_storage function.

This function will reprocess the data in the history tables of all scrapes or starting from the scrape with timestamp from_ts. The reprocessing addresses all data that can be reproduced from the data that was gathered during the actual scrape. This 'basic' scrape data, which can not be reproduced itself, consists of:

  • page sources - available in source zip files
  • redirections - stored in the his_redirs table
  • unlinked pages - stored in the his_unlinked table (this information can be reproduced from link analysis of the the page sources, but since only editorial links are registered, it is treated as basic data)
  • status of links - stored in the his_links_status table

Using this function assumes that the content of the master tables is complete and correct. In case there is some doubt if this assumption is correct, some consistency checks can be executed on these tables by using the check_storage_integrity function.

While reprocessing, next tables are rebuilt with reprocessed data:

  • his_ed_links - using page sources (if reprocess_links is True)
  • his_pages_info - using page sources
  • his_pages_life - using page sources
  • his_short_urls - using his_redirs
  • his_status_figures - using his_pages_info

Although the other history tables are rebuilt as well, it will be with the same (basic) data and as such produce tables with the same content.

Since reprocessing can take quite some time, the reprocessing will by default be executed on a copy of the scrapes database in order to avoid a clash with a next scrape being added. Any scrape that is added after the execution of this function started, will be reprocessed as well.

After reprocessing finishes, all original history tables will be saved with their original names prepended with 'svd_' and can be deleted from the database when the reprocessing was verified and successful.