bd_www

Package for scraping of and reporting about www.belastingdienst.nl.

The basic contents of this package are the next three modules:

bd_www.scrape.py for scraping the site
bd_www.matomo.py for getting usage statistics
bd_www.report.py for reporting about the site

The bd_www.constants.py module, containing the package constants, completes the set.

While importing (from) this package, configuration parameters are read from the configuration file with bd_www.constants.CONFIG_NAME as name. Please refer to the documentation of the Config class to learn more about this file.

All three main modules can be used for their separate respective purposes, but often these modules will be run consecutively and unattended. For this purpose there is a simple command line module scrape_and_report.py.

Release 410.123.201 (2-2-2023)
Contents

module	version	remarks
bd_www.constants.py	1.2
bd_www.scrape.py	4.1.0	changed
bd_www.matomo.py	1.2.3	changed
bd_www.report.py	2.0.1	changed
scrape_and_report.py	6.0	changed
revisit_scrapes.py	2.2	changed
bd_www.ini	2.3	changed
report_conf.json	2.4
keyfig_details.xlsx	2.2
data_legend.xlsx	1.2

Changes

General and bd_www

Improved multiline values file to allow multi word (space separated) lines in bd_www.ini
Changed user specific code by introducing local '_own_home' variable with runtime determined absolute path of the users home directory
Configuration parameter 'mst_dir' changed to 'rel_mst_dir' as path relative to the users home directory
Variable 'mst_dir' is constituted from '_own_home' and the configured 'rel_mst_dir'
Log settings changed from universal to independent per module and in some cases per function
Fixed various bugs to be able to use all the bd_www package after creating a new empty data store using bd_www.scrape.create_new_data_store

bd_www.scrape.py

Fixed the situation that bd_www.matomo.matomo_available returned True when undergoing maintenance
Added parameter 'data_dir' to bd_www.scrape.valid_scrapes to enable using it on other master directories
Renamed bd_www.scrape.create_scrapes_storage to bd_www.scrape.create_new_data_store
Included the metrics database in bd_www.scrape.create_new_data_store, which was still missing

bd_www.matomo.py

Changed waiting time for metrics from 24 to 20 hours to ensure that the run concludes within 24 hours
Fixed sql bug in creating the temporary 'dl_data' table in bd_www.matomo.period_downloads
Custom reports 136 removed because no longer available; was used in bd_www.matomo.period_feedback

bd_www.report.py

Fixed bug in bd_www.report.ReportWorkbook.close_and_publish
Removed 'sync_reports' function; with the new 'partner production' all partner VDI's receive all data

scrape_and_report.py

Introduced partner production (see the documentation of the scrape_and_report.py module)
Reports of the last days are no longer synced to the 'datasluis' since all partner VDI's will have all of them
Added bd_www.scrape_and_report.chosen_to_produce to decide if the calling VDI is the one to produce
Added bd_www.scrape_and_report.sync_master_directory to use for identical and up-to-date data stores in all partner VDI's
Added bd_www.scrape_and_report.sync_one_way to synchronise a directory tree from source to target

revisit_scrapes.py

Fixed bug by adding the omitted bd_www.matomo.period_downloads while renewing the analytics data

bd_www.ini

Multiline values with multi word (space separated) lines are now allowed
Added 'partner_homedirs' to the [MAIN] section with the full path home directories of all partner VDI's
Configuration parameter 'mst_dir' changed to 'rel_mst_dir' as path relative to the partner home directories
Added 'prod_log_name' to the [MAIN] section as name of the text file to record all production activity
Added 'sync_ignore' to the [MAIN] section with item names that will be ignored during data synchronisation
Removed 'publication_dir' from [REPORT] section since removing sync_reports
Added [UNUSED_COLOUR_PALETTES] section with colours that can be used for actual report settings

View Source

"""
***Package for scraping of and reporting about www.belastingdienst.nl.***

The basic contents of this package are the next three modules:

- `bd_www.scrape.py` for scraping the site
- `bd_www.matomo.py` for getting usage statistics
- `bd_www.report.py` for reporting about the site

The `bd_www.constants.py` module, containing the package constants, completes
the set.

While importing (from) this package, configuration parameters are read from
the configuration file with `bd_www.constants.CONFIG_NAME` as name. Please
refer to the documentation of the `Config` class to learn more about this file.

All three main modules can be used for their separate respective purposes,
but often these modules will be run consecutively and unattended. For this
purpose there is a simple command line module `scrape_and_report.py`.

.. include:: release.md
"""
__docformat__ = "restructuredtext"

import os
import sqlite3
import zipfile
from collections.abc import Iterable
from pathlib import Path

from bd_www.constants import CONFIG_NAME

_own_home = Path(
    '//' + os.getenv('computername') + '/Users/' + os.getenv('username'))


class Config:
    """
    **Container class for configuration parameters.**

    Instantiating from this class makes configuration parameters from the
    [MAIN] and specified `section` available via instance attributes.

    ***Instance methods:***

    - `spec_report`: read the specification of a report variation

    ***Instance attributes:***

    All fields and values that are read from the configuration file will be
    added as attributes to the instance.

    ***Configuration file:***

    The file from which the configuration parameters and their values are
    read (set by the constant `bd_www.constants.CONFIG_NAME`) supports the
    next sections and parameters:
    
    [MAIN]

    - `partner_homedirs` - full path home directories of all partner VDI's
      (`//<VDI name>/users/<userid>`)
    - `rel_mst_dir` - master directory for the complete scrapes storage
      structure [path relative to user home]
    - `scr_db_name` - name of the scrapes database to store all scraped data,
      except page sources
    - `mtrx_db_name` - name of the metrics database to store the usage data for
      each scrape
    - `prod_log_name` - name of the log file in the master directory to which
      all production activity will be recorded
    - `sync_ignore` - file and directory names that will be ignored while
      synchronising the master directories between partner VDI's

    [SCRAPE]

    - `src_dir_name` - directory within the master directory to store a
      zip-file for each scrape with all page sources
    - `robots_dir_name` - directory within the master directory to save copies
      of 'robots.txt' files after changes from previous versions are detected
    - `log_dir_name` - directory within the master directory to store all
      scrape logs
    - `sitemap_dir_name` - directory within the master directory to save copies
      of 'sitemap.xml' files after changes from previous versions are detected
    - `use_unlinked_urls` - use the latest set of unlinked pages to find
      unscraped pages [yes/no]
    - `use_sitemap` - use the url's from the sitemap to find unscraped pages
      [yes/no]
    - `log_name` - base name for each scrape log (will be prepended with the
      timestamp of the scrape)
    - `max_urls` - maximum number of url's that will be requested
      [all/`<number>`]
    - `trusted_domains` - sites to be trusted when checking links, given as
      one domain per line

    [MATOMO]

    - `server` - server url via which the metrics are requested using the
      Matomo API
    - `token` - authentication token for the Matomo API
    - `www_id` - Matomo site id of www.belastingdienst.nl
    - `log_name` - name of the log file in the master directory where all
      Matomo related logging will be recorded

    [REPORT]

    - `reports` - names of each variation of a site report, one per line; for
      each one a section should exist named
      [REPORT_`<capitalised variation name>`]
    - `rep_conf_json` - name of the json-file with the specifications for the
      site reports
    - `kf_details_name` - name of the xlsx-file used for organizing the key
      figures in the site report
    - `data_legend_name` - name of the xlsx-file with the descriptions of
      columns used in the various data sheets of the report
    - `page_groups_name` - name of the csv-file containing the export of the
      group details of Siteimprove
    - `log_name` - name of the log file in the master directory where all
      reporting related logging will be recorded
    - `publ_dir` - directory where duplicates of generated reports will be
      saved [full path]
    - `colour_brdr` - vertical borders of all data cells [hex RRGGBB]
    - `colour_btn_brdr` - button borders [hex RRGGBB]
    - `colour_btn_text` - button text [hex RRGGBB]

    [REPORT_`<VARIATION>`]

    - `incl_fb` - include textual feedback in the report [yes/no]
    - `report_name` - configurable part of the report name, as in
      '220124-0200 - weekly `<report_name>`.xlsx'
    - `report_dir_name` - directory within the master directory to store all
      reports of this variation
    - `colour_hdr_bg` - header background and release notes border [hex RRGGBB]
    - `colour_shade` - background of shaded cells and release notes [hex RRGGBB]
    - `colour_btn_fill` - button background [hex RRGGBB]

    [UNUSED_COLOUR_PALETTES]

    This section is used to save some colour palettes for report configuration.
    As such it does not act as configuration.
    """

    def __init__(self, conf_file: str = CONFIG_NAME,
                 section: str = None):
        """
        **Instantiate a configuration.**

        Arguments:

            conf_file: name of the configuration file
                (default `bd_www.contants.CONFIG_NAME`)
            section: section of the conf_file

        The configuration is read from the [MAIN] and specified `section` of
        the `conf_file` and added as attribute values of the instance. The
        attribute names will be equal to the parameter names of the
        configuration file.

        If the specified configuration file is not found in the current working
        directory, it will be read from the module directory ('bd_www').

        In case a configuration value is a valid integer or boolean,
        the value will be cast as such. Multiline values will be converted to
        a list of strings.
        """
        import configparser
        config = configparser.ConfigParser(inline_comment_prefixes=[';'])
        self._config = config
        if not Path(conf_file).exists():
            conf_file = 'bd_www/' + conf_file
        config.read(conf_file)
        sections = ['MAIN']
        if section:
            sections.append(section)
        for s in sections:
            self._read_section(s)

    def _read_section(self, section: str) -> None:
        """
        **Read a section of the configuration file.**

        Arguments:

            section: section of the configuration file

        Read a section of the configuration file and convert to typed
        attributes of the instance.
        """
        for name, value in self._config[section].items():
            if name.startswith('colour_'):
                value = '#' + value
            elif '\n' in value:
                # Split multiline value into list
                value = [v for v in value.split('\n') if v]
            elif value.isdigit():
                value = int(value)
            elif value == 'yes':
                value = True
            elif value == 'no':
                value = False
            setattr(self, name, value)

    def spec_report(self, report: str) -> None:
        """
        **Read the specification of a report variation.**

        Arguments:

            report: name of the variation

        Read the section of the configuration file for the specified report
        variation and add the parameters as typed attributes to the instance.
        """
        self._read_section('REPORT_' + report.upper())


class ScrapeConnection(sqlite3.Connection):
    """
    **Connection class for scraping and reporting.**

    Subclass to connect to the scrapes and metrics databases.

    ***Instance methods:***

    - `switch_to`: switch to another scrapes database
    - `switch_back`: switch back to the configured scrapes database

    ***Instance attributes:***

    - `scrs_db`: path to the configured scrapes database
    - `mtrx_db`: path to the configured metrics database
    """

    def __init__(self):
        """
        **Instantiate a master connection.**

        Open a connection to the configured scrapes database with the
        configured metrics database connected as 'mtrx'. These configurations
        come from the `[Main]` section of the configuration file as
        documented with the `Config` class.
        """
        if not mst_dir.exists():
            mst_dir.mkdir()
        self.scrs_db = mst_dir / main_conf.scrs_db_name
        if not self.scrs_db.exists():
            self.scrs_db.touch()
        self.mtrx_db = mst_dir / main_conf.mtrx_db_name
        if not self.mtrx_db.exists():
            self.mtrx_db.touch()
        super().__init__(self.scrs_db, isolation_level=None)
        self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')

    def switch_to(self, alt_db: Path) -> None:
        """
        **Switch to another scrapes database.**

        Arguments:

            alt_db: path to the alternative scrapes database

        Disconnect the configured scrapes database and connect to the
        specified alternative. The configured metrics database will be
        reconnected.
        """
        self.close()
        super().__init__(alt_db, isolation_level=None)
        self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')

    def switch_back(self) -> None:
        """
        **Switch back to the configured scrapes database.**

        Disconnect from the alternative scrapes database and connect to the
        configured one again. The configured metrics database will be
        reconnected.
        """
        self.close()
        super().__init__(self.scrs_db, isolation_level=None)
        self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')


class Scrape:
    """
    **Context manager for a specific scrape.**
    
    Enables scrape specific access to the scrapes database.

    ***Instance attributes:***

    - `ts`: timestamp of the selected scrape
    """

    def __init__(self, ts: str):
        """
        **Context for a specific scrape.**

        Arguments:

            ts: timestamp of a scrape [yymmdd-hhmm]

        Within the context all timestamped views (*tsd_...*) of the scrapes
        database are specific for the scrape that is identified by the
        timestamp. The views together represent a dataset that is comparable
        to the scrape tables (*scr_...*) after finishing a scrape.
        """
        self.ts = ts

    def __enter__(self):
        mst_conn.executescript(f'''
            DELETE FROM tsd;
            INSERT INTO tsd VALUES ('{self.ts}')
            ''')

    def __exit__(self, exc_type, exc_val, exc_tb):
        mst_conn.executescript('''
            DELETE FROM tsd
            ''')


class PageSourceZip:
    """
    **Encapsulation class to store and retrieve scraped page sources.**

    ***Instance methods:***

    - `add_page`: add page source
    - `get_page`: retrieve page source
    - `iter_pages`: iterable to retrieve all page sources
    - `page_ids`: return all page_id's

    ***Instance attributes:***

    - `path`: path of the zip-file

    """

    def __init__(self, ts: str):
        """
        **Instantiate reference to a page sources zip-file.**

        Arguments:

            ts: timestamp of a scrape [yymmdd-hhmm]

        The zip-file is specific for the scrape with the specified timestamp.

        In case the zip-file does not exist, an empty one is created with the
        name '`<ts>`.zip'.
        """
        src_dir = _own_home / scrape_conf.rel_mst_dir / scrape_conf.src_dir_name
        zip_path = src_dir / (ts + '.zip')
        self.path = zip_path
        if not zip_path.exists():
            self.zip = zipfile.ZipFile(zip_path, mode='w')
            self.zip.close()

    def add_page(self, page_id: int, page_src: str) -> None:
        """
        **Add page source.**

        Arguments:

            page_id: unique identification of the page
            page_src: html source of the page

        The page source will be added with the name '`<page_id>`.html'.

        An exception is raised when a page with `page_id` is already
        available in the zip-file.
        """
        page_name = f'{page_id}.html'
        with zipfile.ZipFile(self.path, mode='a',
                             compression=zipfile.ZIP_DEFLATED,
                             compresslevel=9) as zf:
            if page_name not in zf.namelist():
                zf.writestr(page_name, page_src)
            else:
                raise ValueError(f'{page_name} already stored in {self.path}')

    def get_page(self, page_id: int) -> str:
        """
        **Retrieve page source.**

        Arguments:

            page_id: unique identification of the page

        An exception is raised when the `page_id` is not available in the
        zip-file.
        """
        page_name = f'{page_id}.html'
        with zipfile.ZipFile(self.path) as zf:
            if page_name in zf.namelist():
                return zf.read(page_name).decode()
            else:
                raise ValueError(f'{page_name} not available in {self.path}')

    def iter_pages(self) -> Iterable[tuple[int, str]]:
        """
        **Retrieve all page sources.**

        Returns:

            successive (page_id, page source) tuples for all pages
        """
        with zipfile.ZipFile(self.path) as zf:
            for page_name in zf.namelist():
                yield int(page_name[:-5]), zf.read(page_name).decode()

    def page_ids(self) -> list[int]:
        """
        **Return all page_id's.**

        Returns:

            list with page_id's of all page sources
        """
        with zipfile.ZipFile(self.path) as zf:
            return [int(n.split('.')[0]) for n in zf.namelist()]


main_conf = Config()
"Configuration instance for general use, read from the [MAIN] section " \
    "of the configuration file."

scrape_conf = Config(section='SCRAPE')
"Configuration instance for scraping, read from the [MAIN] and " \
    "[SCRAPE] section of the configuration file."

matomo_conf = Config(section='MATOMO')
"Configuration instance for requesting metrics, read from the [MAIN] and " \
    "[MATOMO] section of the configuration file."

report_conf = Config(section='REPORT')
"Configuration instance for reporting, read from the [MAIN] and " \
    "[REPORT] section of the configuration file."

mst_dir = _own_home / main_conf.rel_mst_dir
"Master directory with the complete scrapes storage structure."

mst_conn = ScrapeConnection()
"Connection to the scrapes and metrics databases, via which all database " \
    "operations are executed. Remains open while using any of the modules " \
    "and is used when specific scrape data is needed."

# class Config:

View Source

class Config:
    """
    **Container class for configuration parameters.**

    Instantiating from this class makes configuration parameters from the
    [MAIN] and specified `section` available via instance attributes.

    ***Instance methods:***

    - `spec_report`: read the specification of a report variation

    ***Instance attributes:***

    All fields and values that are read from the configuration file will be
    added as attributes to the instance.

    ***Configuration file:***

    The file from which the configuration parameters and their values are
    read (set by the constant `bd_www.constants.CONFIG_NAME`) supports the
    next sections and parameters:
    
    [MAIN]

    - `partner_homedirs` - full path home directories of all partner VDI's
      (`//<VDI name>/users/<userid>`)
    - `rel_mst_dir` - master directory for the complete scrapes storage
      structure [path relative to user home]
    - `scr_db_name` - name of the scrapes database to store all scraped data,
      except page sources
    - `mtrx_db_name` - name of the metrics database to store the usage data for
      each scrape
    - `prod_log_name` - name of the log file in the master directory to which
      all production activity will be recorded
    - `sync_ignore` - file and directory names that will be ignored while
      synchronising the master directories between partner VDI's

    [SCRAPE]

    - `src_dir_name` - directory within the master directory to store a
      zip-file for each scrape with all page sources
    - `robots_dir_name` - directory within the master directory to save copies
      of 'robots.txt' files after changes from previous versions are detected
    - `log_dir_name` - directory within the master directory to store all
      scrape logs
    - `sitemap_dir_name` - directory within the master directory to save copies
      of 'sitemap.xml' files after changes from previous versions are detected
    - `use_unlinked_urls` - use the latest set of unlinked pages to find
      unscraped pages [yes/no]
    - `use_sitemap` - use the url's from the sitemap to find unscraped pages
      [yes/no]
    - `log_name` - base name for each scrape log (will be prepended with the
      timestamp of the scrape)
    - `max_urls` - maximum number of url's that will be requested
      [all/`<number>`]
    - `trusted_domains` - sites to be trusted when checking links, given as
      one domain per line

    [MATOMO]

    - `server` - server url via which the metrics are requested using the
      Matomo API
    - `token` - authentication token for the Matomo API
    - `www_id` - Matomo site id of www.belastingdienst.nl
    - `log_name` - name of the log file in the master directory where all
      Matomo related logging will be recorded

    [REPORT]

    - `reports` - names of each variation of a site report, one per line; for
      each one a section should exist named
      [REPORT_`<capitalised variation name>`]
    - `rep_conf_json` - name of the json-file with the specifications for the
      site reports
    - `kf_details_name` - name of the xlsx-file used for organizing the key
      figures in the site report
    - `data_legend_name` - name of the xlsx-file with the descriptions of
      columns used in the various data sheets of the report
    - `page_groups_name` - name of the csv-file containing the export of the
      group details of Siteimprove
    - `log_name` - name of the log file in the master directory where all
      reporting related logging will be recorded
    - `publ_dir` - directory where duplicates of generated reports will be
      saved [full path]
    - `colour_brdr` - vertical borders of all data cells [hex RRGGBB]
    - `colour_btn_brdr` - button borders [hex RRGGBB]
    - `colour_btn_text` - button text [hex RRGGBB]

    [REPORT_`<VARIATION>`]

    - `incl_fb` - include textual feedback in the report [yes/no]
    - `report_name` - configurable part of the report name, as in
      '220124-0200 - weekly `<report_name>`.xlsx'
    - `report_dir_name` - directory within the master directory to store all
      reports of this variation
    - `colour_hdr_bg` - header background and release notes border [hex RRGGBB]
    - `colour_shade` - background of shaded cells and release notes [hex RRGGBB]
    - `colour_btn_fill` - button background [hex RRGGBB]

    [UNUSED_COLOUR_PALETTES]

    This section is used to save some colour palettes for report configuration.
    As such it does not act as configuration.
    """

    def __init__(self, conf_file: str = CONFIG_NAME,
                 section: str = None):
        """
        **Instantiate a configuration.**

        Arguments:

            conf_file: name of the configuration file
                (default `bd_www.contants.CONFIG_NAME`)
            section: section of the conf_file

        The configuration is read from the [MAIN] and specified `section` of
        the `conf_file` and added as attribute values of the instance. The
        attribute names will be equal to the parameter names of the
        configuration file.

        If the specified configuration file is not found in the current working
        directory, it will be read from the module directory ('bd_www').

        In case a configuration value is a valid integer or boolean,
        the value will be cast as such. Multiline values will be converted to
        a list of strings.
        """
        import configparser
        config = configparser.ConfigParser(inline_comment_prefixes=[';'])
        self._config = config
        if not Path(conf_file).exists():
            conf_file = 'bd_www/' + conf_file
        config.read(conf_file)
        sections = ['MAIN']
        if section:
            sections.append(section)
        for s in sections:
            self._read_section(s)

    def _read_section(self, section: str) -> None:
        """
        **Read a section of the configuration file.**

        Arguments:

            section: section of the configuration file

        Read a section of the configuration file and convert to typed
        attributes of the instance.
        """
        for name, value in self._config[section].items():
            if name.startswith('colour_'):
                value = '#' + value
            elif '\n' in value:
                # Split multiline value into list
                value = [v for v in value.split('\n') if v]
            elif value.isdigit():
                value = int(value)
            elif value == 'yes':
                value = True
            elif value == 'no':
                value = False
            setattr(self, name, value)

    def spec_report(self, report: str) -> None:
        """
        **Read the specification of a report variation.**

        Arguments:

            report: name of the variation

        Read the section of the configuration file for the specified report
        variation and add the parameters as typed attributes to the instance.
        """
        self._read_section('REPORT_' + report.upper())

Container class for configuration parameters.

Instantiating from this class makes configuration parameters from the [MAIN] and specified section available via instance attributes.

Instance methods:

spec_report: read the specification of a report variation

Instance attributes:

All fields and values that are read from the configuration file will be added as attributes to the instance.

Configuration file:

The file from which the configuration parameters and their values are read (set by the constant bd_www.constants.CONFIG_NAME) supports the next sections and parameters:

[MAIN]

partner_homedirs - full path home directories of all partner VDI's (//<VDI name>/users/<userid>)
rel_mst_dir - master directory for the complete scrapes storage structure [path relative to user home]
scr_db_name - name of the scrapes database to store all scraped data, except page sources
mtrx_db_name - name of the metrics database to store the usage data for each scrape
prod_log_name - name of the log file in the master directory to which all production activity will be recorded
sync_ignore - file and directory names that will be ignored while synchronising the master directories between partner VDI's

[SCRAPE]

src_dir_name - directory within the master directory to store a zip-file for each scrape with all page sources
robots_dir_name - directory within the master directory to save copies of 'robots.txt' files after changes from previous versions are detected
log_dir_name - directory within the master directory to store all scrape logs
sitemap_dir_name - directory within the master directory to save copies of 'sitemap.xml' files after changes from previous versions are detected
use_unlinked_urls - use the latest set of unlinked pages to find unscraped pages [yes/no]
use_sitemap - use the url's from the sitemap to find unscraped pages [yes/no]
log_name - base name for each scrape log (will be prepended with the timestamp of the scrape)
max_urls - maximum number of url's that will be requested [all/<number>]
trusted_domains - sites to be trusted when checking links, given as one domain per line

[MATOMO]

server - server url via which the metrics are requested using the Matomo API
token - authentication token for the Matomo API
www_id - Matomo site id of www.belastingdienst.nl
log_name - name of the log file in the master directory where all Matomo related logging will be recorded

[REPORT]

reports - names of each variation of a site report, one per line; for each one a section should exist named [REPORT_<capitalised variation name>]
rep_conf_json - name of the json-file with the specifications for the site reports
kf_details_name - name of the xlsx-file used for organizing the key figures in the site report
data_legend_name - name of the xlsx-file with the descriptions of columns used in the various data sheets of the report
page_groups_name - name of the csv-file containing the export of the group details of Siteimprove
log_name - name of the log file in the master directory where all reporting related logging will be recorded
publ_dir - directory where duplicates of generated reports will be saved [full path]
colour_brdr - vertical borders of all data cells [hex RRGGBB]
colour_btn_brdr - button borders [hex RRGGBB]
colour_btn_text - button text [hex RRGGBB]

[REPORT_<VARIATION>]

incl_fb - include textual feedback in the report [yes/no]
report_name - configurable part of the report name, as in '220124-0200 - weekly <report_name>.xlsx'
report_dir_name - directory within the master directory to store all reports of this variation
colour_hdr_bg - header background and release notes border [hex RRGGBB]
colour_shade - background of shaded cells and release notes [hex RRGGBB]
colour_btn_fill - button background [hex RRGGBB]

[UNUSED_COLOUR_PALETTES]

This section is used to save some colour palettes for report configuration. As such it does not act as configuration.

# Config(conf_file: str = 'bd_www.ini', section: str = None)

View Source

    def __init__(self, conf_file: str = CONFIG_NAME,
                 section: str = None):
        """
        **Instantiate a configuration.**

        Arguments:

            conf_file: name of the configuration file
                (default `bd_www.contants.CONFIG_NAME`)
            section: section of the conf_file

        The configuration is read from the [MAIN] and specified `section` of
        the `conf_file` and added as attribute values of the instance. The
        attribute names will be equal to the parameter names of the
        configuration file.

        If the specified configuration file is not found in the current working
        directory, it will be read from the module directory ('bd_www').

        In case a configuration value is a valid integer or boolean,
        the value will be cast as such. Multiline values will be converted to
        a list of strings.
        """
        import configparser
        config = configparser.ConfigParser(inline_comment_prefixes=[';'])
        self._config = config
        if not Path(conf_file).exists():
            conf_file = 'bd_www/' + conf_file
        config.read(conf_file)
        sections = ['MAIN']
        if section:
            sections.append(section)
        for s in sections:
            self._read_section(s)

Instantiate a configuration.

Arguments:

conf_file: name of the configuration file
    (default `bd_www.contants.CONFIG_NAME`)
section: section of the conf_file

The configuration is read from the [MAIN] and specified section of the conf_file and added as attribute values of the instance. The attribute names will be equal to the parameter names of the configuration file.

If the specified configuration file is not found in the current working directory, it will be read from the module directory ('bd_www').

In case a configuration value is a valid integer or boolean, the value will be cast as such. Multiline values will be converted to a list of strings.

# def spec_report(self, report: str) -> None:

View Source

    def spec_report(self, report: str) -> None:
        """
        **Read the specification of a report variation.**

        Arguments:

            report: name of the variation

        Read the section of the configuration file for the specified report
        variation and add the parameters as typed attributes to the instance.
        """
        self._read_section('REPORT_' + report.upper())

Read the specification of a report variation.

Arguments:

report: name of the variation

Read the section of the configuration file for the specified report variation and add the parameters as typed attributes to the instance.

# class ScrapeConnection(sqlite3.Connection):

View Source

class ScrapeConnection(sqlite3.Connection):
    """
    **Connection class for scraping and reporting.**

    Subclass to connect to the scrapes and metrics databases.

    ***Instance methods:***

    - `switch_to`: switch to another scrapes database
    - `switch_back`: switch back to the configured scrapes database

    ***Instance attributes:***

    - `scrs_db`: path to the configured scrapes database
    - `mtrx_db`: path to the configured metrics database
    """

    def __init__(self):
        """
        **Instantiate a master connection.**

        Open a connection to the configured scrapes database with the
        configured metrics database connected as 'mtrx'. These configurations
        come from the `[Main]` section of the configuration file as
        documented with the `Config` class.
        """
        if not mst_dir.exists():
            mst_dir.mkdir()
        self.scrs_db = mst_dir / main_conf.scrs_db_name
        if not self.scrs_db.exists():
            self.scrs_db.touch()
        self.mtrx_db = mst_dir / main_conf.mtrx_db_name
        if not self.mtrx_db.exists():
            self.mtrx_db.touch()
        super().__init__(self.scrs_db, isolation_level=None)
        self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')

    def switch_to(self, alt_db: Path) -> None:
        """
        **Switch to another scrapes database.**

        Arguments:

            alt_db: path to the alternative scrapes database

        Disconnect the configured scrapes database and connect to the
        specified alternative. The configured metrics database will be
        reconnected.
        """
        self.close()
        super().__init__(alt_db, isolation_level=None)
        self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')

    def switch_back(self) -> None:
        """
        **Switch back to the configured scrapes database.**

        Disconnect from the alternative scrapes database and connect to the
        configured one again. The configured metrics database will be
        reconnected.
        """
        self.close()
        super().__init__(self.scrs_db, isolation_level=None)
        self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')

Connection class for scraping and reporting.

Subclass to connect to the scrapes and metrics databases.

Instance methods:

switch_to: switch to another scrapes database
switch_back: switch back to the configured scrapes database

Instance attributes:

scrs_db: path to the configured scrapes database
mtrx_db: path to the configured metrics database

# ScrapeConnection()

View Source

    def __init__(self):
        """
        **Instantiate a master connection.**

        Open a connection to the configured scrapes database with the
        configured metrics database connected as 'mtrx'. These configurations
        come from the `[Main]` section of the configuration file as
        documented with the `Config` class.
        """
        if not mst_dir.exists():
            mst_dir.mkdir()
        self.scrs_db = mst_dir / main_conf.scrs_db_name
        if not self.scrs_db.exists():
            self.scrs_db.touch()
        self.mtrx_db = mst_dir / main_conf.mtrx_db_name
        if not self.mtrx_db.exists():
            self.mtrx_db.touch()
        super().__init__(self.scrs_db, isolation_level=None)
        self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')

Instantiate a master connection.

Open a connection to the configured scrapes database with the configured metrics database connected as 'mtrx'. These configurations come from the [Main] section of the configuration file as documented with the Config class.

# def switch_to(self, alt_db: pathlib.Path) -> None:

View Source

    def switch_to(self, alt_db: Path) -> None:
        """
        **Switch to another scrapes database.**

        Arguments:

            alt_db: path to the alternative scrapes database

        Disconnect the configured scrapes database and connect to the
        specified alternative. The configured metrics database will be
        reconnected.
        """
        self.close()
        super().__init__(alt_db, isolation_level=None)
        self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')

Switch to another scrapes database.

Arguments:

alt_db: path to the alternative scrapes database

Disconnect the configured scrapes database and connect to the specified alternative. The configured metrics database will be reconnected.

# def switch_back(self) -> None:

View Source

    def switch_back(self) -> None:
        """
        **Switch back to the configured scrapes database.**

        Disconnect from the alternative scrapes database and connect to the
        configured one again. The configured metrics database will be
        reconnected.
        """
        self.close()
        super().__init__(self.scrs_db, isolation_level=None)
        self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')

Switch back to the configured scrapes database.

Disconnect from the alternative scrapes database and connect to the configured one again. The configured metrics database will be reconnected.

Inherited Members

sqlite3.Connection: backup; close; commit; create_aggregate; create_collation; create_function; cursor; enable_load_extension; executemany; executescript; execute; interrupt; iterdump; load_extension; rollback; set_authorizer; set_progress_handler; set_trace_callback; Warning; Error; InterfaceError; DatabaseError; DataError; OperationalError; IntegrityError; InternalError; ProgrammingError; NotSupportedError; row_factory; text_factory; isolation_level; total_changes; in_transaction

# class Scrape:

View Source

class Scrape:
    """
    **Context manager for a specific scrape.**
    
    Enables scrape specific access to the scrapes database.

    ***Instance attributes:***

    - `ts`: timestamp of the selected scrape
    """

    def __init__(self, ts: str):
        """
        **Context for a specific scrape.**

        Arguments:

            ts: timestamp of a scrape [yymmdd-hhmm]

        Within the context all timestamped views (*tsd_...*) of the scrapes
        database are specific for the scrape that is identified by the
        timestamp. The views together represent a dataset that is comparable
        to the scrape tables (*scr_...*) after finishing a scrape.
        """
        self.ts = ts

    def __enter__(self):
        mst_conn.executescript(f'''
            DELETE FROM tsd;
            INSERT INTO tsd VALUES ('{self.ts}')
            ''')

    def __exit__(self, exc_type, exc_val, exc_tb):
        mst_conn.executescript('''
            DELETE FROM tsd
            ''')

Context manager for a specific scrape.

Enables scrape specific access to the scrapes database.

Instance attributes:

ts: timestamp of the selected scrape

# Scrape(ts: str)

View Source

    def __init__(self, ts: str):
        """
        **Context for a specific scrape.**

        Arguments:

            ts: timestamp of a scrape [yymmdd-hhmm]

        Within the context all timestamped views (*tsd_...*) of the scrapes
        database are specific for the scrape that is identified by the
        timestamp. The views together represent a dataset that is comparable
        to the scrape tables (*scr_...*) after finishing a scrape.
        """
        self.ts = ts

Context for a specific scrape.

Arguments:

ts: timestamp of a scrape [yymmdd-hhmm]

Within the context all timestamped views (tsd_...) of the scrapes database are specific for the scrape that is identified by the timestamp. The views together represent a dataset that is comparable to the scrape tables (scr_...) after finishing a scrape.

# class PageSourceZip:

View Source

class PageSourceZip:
    """
    **Encapsulation class to store and retrieve scraped page sources.**

    ***Instance methods:***

    - `add_page`: add page source
    - `get_page`: retrieve page source
    - `iter_pages`: iterable to retrieve all page sources
    - `page_ids`: return all page_id's

    ***Instance attributes:***

    - `path`: path of the zip-file

    """

    def __init__(self, ts: str):
        """
        **Instantiate reference to a page sources zip-file.**

        Arguments:

            ts: timestamp of a scrape [yymmdd-hhmm]

        The zip-file is specific for the scrape with the specified timestamp.

        In case the zip-file does not exist, an empty one is created with the
        name '`<ts>`.zip'.
        """
        src_dir = _own_home / scrape_conf.rel_mst_dir / scrape_conf.src_dir_name
        zip_path = src_dir / (ts + '.zip')
        self.path = zip_path
        if not zip_path.exists():
            self.zip = zipfile.ZipFile(zip_path, mode='w')
            self.zip.close()

    def add_page(self, page_id: int, page_src: str) -> None:
        """
        **Add page source.**

        Arguments:

            page_id: unique identification of the page
            page_src: html source of the page

        The page source will be added with the name '`<page_id>`.html'.

        An exception is raised when a page with `page_id` is already
        available in the zip-file.
        """
        page_name = f'{page_id}.html'
        with zipfile.ZipFile(self.path, mode='a',
                             compression=zipfile.ZIP_DEFLATED,
                             compresslevel=9) as zf:
            if page_name not in zf.namelist():
                zf.writestr(page_name, page_src)
            else:
                raise ValueError(f'{page_name} already stored in {self.path}')

    def get_page(self, page_id: int) -> str:
        """
        **Retrieve page source.**

        Arguments:

            page_id: unique identification of the page

        An exception is raised when the `page_id` is not available in the
        zip-file.
        """
        page_name = f'{page_id}.html'
        with zipfile.ZipFile(self.path) as zf:
            if page_name in zf.namelist():
                return zf.read(page_name).decode()
            else:
                raise ValueError(f'{page_name} not available in {self.path}')

    def iter_pages(self) -> Iterable[tuple[int, str]]:
        """
        **Retrieve all page sources.**

        Returns:

            successive (page_id, page source) tuples for all pages
        """
        with zipfile.ZipFile(self.path) as zf:
            for page_name in zf.namelist():
                yield int(page_name[:-5]), zf.read(page_name).decode()

    def page_ids(self) -> list[int]:
        """
        **Return all page_id's.**

        Returns:

            list with page_id's of all page sources
        """
        with zipfile.ZipFile(self.path) as zf:
            return [int(n.split('.')[0]) for n in zf.namelist()]

Encapsulation class to store and retrieve scraped page sources.

Instance methods:

add_page: add page source
get_page: retrieve page source
iter_pages: iterable to retrieve all page sources
page_ids: return all page_id's

Instance attributes:

path: path of the zip-file

# PageSourceZip(ts: str)

View Source

    def __init__(self, ts: str):
        """
        **Instantiate reference to a page sources zip-file.**

        Arguments:

            ts: timestamp of a scrape [yymmdd-hhmm]

        The zip-file is specific for the scrape with the specified timestamp.

        In case the zip-file does not exist, an empty one is created with the
        name '`<ts>`.zip'.
        """
        src_dir = _own_home / scrape_conf.rel_mst_dir / scrape_conf.src_dir_name
        zip_path = src_dir / (ts + '.zip')
        self.path = zip_path
        if not zip_path.exists():
            self.zip = zipfile.ZipFile(zip_path, mode='w')
            self.zip.close()

Instantiate reference to a page sources zip-file.

Arguments:

ts: timestamp of a scrape [yymmdd-hhmm]

The zip-file is specific for the scrape with the specified timestamp.

In case the zip-file does not exist, an empty one is created with the name '<ts>.zip'.

# def add_page(self, page_id: int, page_src: str) -> None:

View Source

    def add_page(self, page_id: int, page_src: str) -> None:
        """
        **Add page source.**

        Arguments:

            page_id: unique identification of the page
            page_src: html source of the page

        The page source will be added with the name '`<page_id>`.html'.

        An exception is raised when a page with `page_id` is already
        available in the zip-file.
        """
        page_name = f'{page_id}.html'
        with zipfile.ZipFile(self.path, mode='a',
                             compression=zipfile.ZIP_DEFLATED,
                             compresslevel=9) as zf:
            if page_name not in zf.namelist():
                zf.writestr(page_name, page_src)
            else:
                raise ValueError(f'{page_name} already stored in {self.path}')

Add page source.

Arguments:

page_id: unique identification of the page
page_src: html source of the page

The page source will be added with the name '<page_id>.html'.

An exception is raised when a page with page_id is already available in the zip-file.

# def get_page(self, page_id: int) -> str:

View Source

    def get_page(self, page_id: int) -> str:
        """
        **Retrieve page source.**

        Arguments:

            page_id: unique identification of the page

        An exception is raised when the `page_id` is not available in the
        zip-file.
        """
        page_name = f'{page_id}.html'
        with zipfile.ZipFile(self.path) as zf:
            if page_name in zf.namelist():
                return zf.read(page_name).decode()
            else:
                raise ValueError(f'{page_name} not available in {self.path}')

Retrieve page source.

Arguments:

page_id: unique identification of the page

An exception is raised when the page_id is not available in the zip-file.

# def iter_pages(self) -> collections.abc.Iterable[tuple[int, str]]:

View Source

    def iter_pages(self) -> Iterable[tuple[int, str]]:
        """
        **Retrieve all page sources.**

        Returns:

            successive (page_id, page source) tuples for all pages
        """
        with zipfile.ZipFile(self.path) as zf:
            for page_name in zf.namelist():
                yield int(page_name[:-5]), zf.read(page_name).decode()

Retrieve all page sources.

Returns:

successive (page_id, page source) tuples for all pages

# def page_ids(self) -> list[int]:

View Source

    def page_ids(self) -> list[int]:
        """
        **Return all page_id's.**

        Returns:

            list with page_id's of all page sources
        """
        with zipfile.ZipFile(self.path) as zf:
            return [int(n.split('.')[0]) for n in zf.namelist()]

Return all page_id's.

Returns:

list with page_id's of all page sources

# main_conf = <bd_www.Config object>

Configuration instance for general use, read from the [MAIN] section of the configuration file.

# scrape_conf = <bd_www.Config object>

Configuration instance for scraping, read from the [MAIN] and [SCRAPE] section of the configuration file.

# matomo_conf = <bd_www.Config object>

Configuration instance for requesting metrics, read from the [MAIN] and [MATOMO] section of the configuration file.

# report_conf = <bd_www.Config object>

Configuration instance for reporting, read from the [MAIN] and [REPORT] section of the configuration file.

# mst_dir = WindowsPath('//PVPCOCC322/Users/diepj09/Documents/scrapes_prod')

Master directory with the complete scrapes storage structure.

# mst_conn = <bd_www.ScrapeConnection object>

Connection to the scrapes and metrics databases, via which all database operations are executed. Remains open while using any of the modules and is used when specific scrape data is needed.