bd_www

Package for scraping of and reporting about www.belastingdienst.nl.

The basic contents of this package are the next three modules:

The bd_www.constants.py module, containing the package constants, completes the set.

While importing (from) this package, configuration parameters are read from the configuration file with bd_www.constants.CONFIG_NAME as name. Please refer to the documentation of the Config class to learn more about this file.

All three main modules can be used for their separate respective purposes, but often these modules will be run consecutively and unattended. For this purpose there is a simple command line module scrape_and_report.py.

Release 410.123.201 (2-2-2023)
Contents

module version remarks
bd_www.constants.py 1.2
bd_www.scrape.py 4.1.0 changed
bd_www.matomo.py 1.2.3 changed
bd_www.report.py 2.0.1 changed
scrape_and_report.py 6.0 changed
revisit_scrapes.py 2.2 changed
bd_www.ini 2.3 changed
report_conf.json 2.4
keyfig_details.xlsx 2.2
data_legend.xlsx 1.2

Changes

General and bd_www
  • Improved multiline values file to allow multi word (space separated) lines in bd_www.ini
  • Changed user specific code by introducing local '_own_home' variable with runtime determined absolute path of the users home directory
  • Configuration parameter 'mst_dir' changed to 'rel_mst_dir' as path relative to the users home directory
  • Variable 'mst_dir' is constituted from '_own_home' and the configured 'rel_mst_dir'
  • Log settings changed from universal to independent per module and in some cases per function
  • Fixed various bugs to be able to use all the bd_www package after creating a new empty data store using bd_www.scrape.create_new_data_store
bd_www.scrape.py
bd_www.matomo.py
bd_www.report.py
scrape_and_report.py
revisit_scrapes.py
bd_www.ini
  • Multiline values with multi word (space separated) lines are now allowed
  • Added 'partner_homedirs' to the [MAIN] section with the full path home directories of all partner VDI's
  • Configuration parameter 'mst_dir' changed to 'rel_mst_dir' as path relative to the partner home directories
  • Added 'prod_log_name' to the [MAIN] section as name of the text file to record all production activity
  • Added 'sync_ignore' to the [MAIN] section with item names that will be ignored during data synchronisation
  • Removed 'publication_dir' from [REPORT] section since removing sync_reports
  • Added [UNUSED_COLOUR_PALETTES] section with colours that can be used for actual report settings
View Source
"""
***Package for scraping of and reporting about www.belastingdienst.nl.***

The basic contents of this package are the next three modules:

- `bd_www.scrape.py` for scraping the site
- `bd_www.matomo.py` for getting usage statistics
- `bd_www.report.py` for reporting about the site

The `bd_www.constants.py` module, containing the package constants, completes
the set.

While importing (from) this package, configuration parameters are read from
the configuration file with `bd_www.constants.CONFIG_NAME` as name. Please
refer to the documentation of the `Config` class to learn more about this file.

All three main modules can be used for their separate respective purposes,
but often these modules will be run consecutively and unattended. For this
purpose there is a simple command line module `scrape_and_report.py`.

.. include:: release.md
"""
__docformat__ = "restructuredtext"

import os
import sqlite3
import zipfile
from collections.abc import Iterable
from pathlib import Path

from bd_www.constants import CONFIG_NAME

_own_home = Path(
    '//' + os.getenv('computername') + '/Users/' + os.getenv('username'))


class Config:
    """
    **Container class for configuration parameters.**

    Instantiating from this class makes configuration parameters from the
    [MAIN] and specified `section` available via instance attributes.

    ***Instance methods:***

    - `spec_report`: read the specification of a report variation

    ***Instance attributes:***

    All fields and values that are read from the configuration file will be
    added as attributes to the instance.

    ***Configuration file:***

    The file from which the configuration parameters and their values are
    read (set by the constant `bd_www.constants.CONFIG_NAME`) supports the
    next sections and parameters:
    
    [MAIN]

    - `partner_homedirs` - full path home directories of all partner VDI's
      (`//<VDI name>/users/<userid>`)
    - `rel_mst_dir` - master directory for the complete scrapes storage
      structure [path relative to user home]
    - `scr_db_name` - name of the scrapes database to store all scraped data,
      except page sources
    - `mtrx_db_name` - name of the metrics database to store the usage data for
      each scrape
    - `prod_log_name` - name of the log file in the master directory to which
      all production activity will be recorded
    - `sync_ignore` - file and directory names that will be ignored while
      synchronising the master directories between partner VDI's

    [SCRAPE]

    - `src_dir_name` - directory within the master directory to store a
      zip-file for each scrape with all page sources
    - `robots_dir_name` - directory within the master directory to save copies
      of 'robots.txt' files after changes from previous versions are detected
    - `log_dir_name` - directory within the master directory to store all
      scrape logs
    - `sitemap_dir_name` - directory within the master directory to save copies
      of 'sitemap.xml' files after changes from previous versions are detected
    - `use_unlinked_urls` - use the latest set of unlinked pages to find
      unscraped pages [yes/no]
    - `use_sitemap` - use the url's from the sitemap to find unscraped pages
      [yes/no]
    - `log_name` - base name for each scrape log (will be prepended with the
      timestamp of the scrape)
    - `max_urls` - maximum number of url's that will be requested
      [all/`<number>`]
    - `trusted_domains` - sites to be trusted when checking links, given as
      one domain per line

    [MATOMO]

    - `server` - server url via which the metrics are requested using the
      Matomo API
    - `token` - authentication token for the Matomo API
    - `www_id` - Matomo site id of www.belastingdienst.nl
    - `log_name` - name of the log file in the master directory where all
      Matomo related logging will be recorded

    [REPORT]

    - `reports` - names of each variation of a site report, one per line; for
      each one a section should exist named
      [REPORT_`<capitalised variation name>`]
    - `rep_conf_json` - name of the json-file with the specifications for the
      site reports
    - `kf_details_name` - name of the xlsx-file used for organizing the key
      figures in the site report
    - `data_legend_name` - name of the xlsx-file with the descriptions of
      columns used in the various data sheets of the report
    - `page_groups_name` - name of the csv-file containing the export of the
      group details of Siteimprove
    - `log_name` - name of the log file in the master directory where all
      reporting related logging will be recorded
    - `publ_dir` - directory where duplicates of generated reports will be
      saved [full path]
    - `colour_brdr` - vertical borders of all data cells [hex RRGGBB]
    - `colour_btn_brdr` - button borders [hex RRGGBB]
    - `colour_btn_text` - button text [hex RRGGBB]

    [REPORT_`<VARIATION>`]

    - `incl_fb` - include textual feedback in the report [yes/no]
    - `report_name` - configurable part of the report name, as in
      '220124-0200 - weekly `<report_name>`.xlsx'
    - `report_dir_name` - directory within the master directory to store all
      reports of this variation
    - `colour_hdr_bg` - header background and release notes border [hex RRGGBB]
    - `colour_shade` - background of shaded cells and release notes [hex RRGGBB]
    - `colour_btn_fill` - button background [hex RRGGBB]

    [UNUSED_COLOUR_PALETTES]

    This section is used to save some colour palettes for report configuration.
    As such it does not act as configuration.
    """

    def __init__(self, conf_file: str = CONFIG_NAME,
                 section: str = None):
        """
        **Instantiate a configuration.**

        Arguments:

            conf_file: name of the configuration file
                (default `bd_www.contants.CONFIG_NAME`)
            section: section of the conf_file

        The configuration is read from the [MAIN] and specified `section` of
        the `conf_file` and added as attribute values of the instance. The
        attribute names will be equal to the parameter names of the
        configuration file.

        If the specified configuration file is not found in the current working
        directory, it will be read from the module directory ('bd_www').

        In case a configuration value is a valid integer or boolean,
        the value will be cast as such. Multiline values will be converted to
        a list of strings.
        """
        import configparser
        config = configparser.ConfigParser(inline_comment_prefixes=[';'])
        self._config = config
        if not Path(conf_file).exists():
            conf_file = 'bd_www/' + conf_file
        config.read(conf_file)
        sections = ['MAIN']
        if section:
            sections.append(section)
        for s in sections:
            self._read_section(s)

    def _read_section(self, section: str) -> None:
        """
        **Read a section of the configuration file.**

        Arguments:

            section: section of the configuration file

        Read a section of the configuration file and convert to typed
        attributes of the instance.
        """
        for name, value in self._config[section].items():
            if name.startswith('colour_'):
                value = '#' + value
            elif '\n' in value:
                # Split multiline value into list
                value = [v for v in value.split('\n') if v]
            elif value.isdigit():
                value = int(value)
            elif value == 'yes':
                value = True
            elif value == 'no':
                value = False
            setattr(self, name, value)

    def spec_report(self, report: str) -> None:
        """
        **Read the specification of a report variation.**

        Arguments:

            report: name of the variation

        Read the section of the configuration file for the specified report
        variation and add the parameters as typed attributes to the instance.
        """
        self._read_section('REPORT_' + report.upper())


class ScrapeConnection(sqlite3.Connection):
    """
    **Connection class for scraping and reporting.**

    Subclass to connect to the scrapes and metrics databases.

    ***Instance methods:***

    - `switch_to`: switch to another scrapes database
    - `switch_back`: switch back to the configured scrapes database

    ***Instance attributes:***

    - `scrs_db`: path to the configured scrapes database
    - `mtrx_db`: path to the configured metrics database
    """

    def __init__(self):
        """
        **Instantiate a master connection.**

        Open a connection to the configured scrapes database with the
        configured metrics database connected as 'mtrx'. These configurations
        come from the `[Main]` section of the configuration file as
        documented with the `Config` class.
        """
        if not mst_dir.exists():
            mst_dir.mkdir()
        self.scrs_db = mst_dir / main_conf.scrs_db_name
        if not self.scrs_db.exists():
            self.scrs_db.touch()
        self.mtrx_db = mst_dir / main_conf.mtrx_db_name
        if not self.mtrx_db.exists():
            self.mtrx_db.touch()
        super().__init__(self.scrs_db, isolation_level=None)
        self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')

    def switch_to(self, alt_db: Path) -> None:
        """
        **Switch to another scrapes database.**

        Arguments:

            alt_db: path to the alternative scrapes database

        Disconnect the configured scrapes database and connect to the
        specified alternative. The configured metrics database will be
        reconnected.
        """
        self.close()
        super().__init__(alt_db, isolation_level=None)
        self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')

    def switch_back(self) -> None:
        """
        **Switch back to the configured scrapes database.**

        Disconnect from the alternative scrapes database and connect to the
        configured one again. The configured metrics database will be
        reconnected.
        """
        self.close()
        super().__init__(self.scrs_db, isolation_level=None)
        self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')


class Scrape:
    """
    **Context manager for a specific scrape.**
    
    Enables scrape specific access to the scrapes database.

    ***Instance attributes:***

    - `ts`: timestamp of the selected scrape
    """

    def __init__(self, ts: str):
        """
        **Context for a specific scrape.**

        Arguments:

            ts: timestamp of a scrape [yymmdd-hhmm]

        Within the context all timestamped views (*tsd_...*) of the scrapes
        database are specific for the scrape that is identified by the
        timestamp. The views together represent a dataset that is comparable
        to the scrape tables (*scr_...*) after finishing a scrape.
        """
        self.ts = ts

    def __enter__(self):
        mst_conn.executescript(f'''
            DELETE FROM tsd;
            INSERT INTO tsd VALUES ('{self.ts}')
            ''')

    def __exit__(self, exc_type, exc_val, exc_tb):
        mst_conn.executescript('''
            DELETE FROM tsd
            ''')


class PageSourceZip:
    """
    **Encapsulation class to store and retrieve scraped page sources.**

    ***Instance methods:***

    - `add_page`: add page source
    - `get_page`: retrieve page source
    - `iter_pages`: iterable to retrieve all page sources
    - `page_ids`: return all page_id's

    ***Instance attributes:***

    - `path`: path of the zip-file

    """

    def __init__(self, ts: str):
        """
        **Instantiate reference to a page sources zip-file.**

        Arguments:

            ts: timestamp of a scrape [yymmdd-hhmm]

        The zip-file is specific for the scrape with the specified timestamp.

        In case the zip-file does not exist, an empty one is created with the
        name '`<ts>`.zip'.
        """
        src_dir = _own_home / scrape_conf.rel_mst_dir / scrape_conf.src_dir_name
        zip_path = src_dir / (ts + '.zip')
        self.path = zip_path
        if not zip_path.exists():
            self.zip = zipfile.ZipFile(zip_path, mode='w')
            self.zip.close()

    def add_page(self, page_id: int, page_src: str) -> None:
        """
        **Add page source.**

        Arguments:

            page_id: unique identification of the page
            page_src: html source of the page

        The page source will be added with the name '`<page_id>`.html'.

        An exception is raised when a page with `page_id` is already
        available in the zip-file.
        """
        page_name = f'{page_id}.html'
        with zipfile.ZipFile(self.path, mode='a',
                             compression=zipfile.ZIP_DEFLATED,
                             compresslevel=9) as zf:
            if page_name not in zf.namelist():
                zf.writestr(page_name, page_src)
            else:
                raise ValueError(f'{page_name} already stored in {self.path}')

    def get_page(self, page_id: int) -> str:
        """
        **Retrieve page source.**

        Arguments:

            page_id: unique identification of the page

        An exception is raised when the `page_id` is not available in the
        zip-file.
        """
        page_name = f'{page_id}.html'
        with zipfile.ZipFile(self.path) as zf:
            if page_name in zf.namelist():
                return zf.read(page_name).decode()
            else:
                raise ValueError(f'{page_name} not available in {self.path}')

    def iter_pages(self) -> Iterable[tuple[int, str]]:
        """
        **Retrieve all page sources.**

        Returns:

            successive (page_id, page source) tuples for all pages
        """
        with zipfile.ZipFile(self.path) as zf:
            for page_name in zf.namelist():
                yield int(page_name[:-5]), zf.read(page_name).decode()

    def page_ids(self) -> list[int]:
        """
        **Return all page_id's.**

        Returns:

            list with page_id's of all page sources
        """
        with zipfile.ZipFile(self.path) as zf:
            return [int(n.split('.')[0]) for n in zf.namelist()]


main_conf = Config()
"Configuration instance for general use, read from the [MAIN] section " \
    "of the configuration file."

scrape_conf = Config(section='SCRAPE')
"Configuration instance for scraping, read from the [MAIN] and " \
    "[SCRAPE] section of the configuration file."

matomo_conf = Config(section='MATOMO')
"Configuration instance for requesting metrics, read from the [MAIN] and " \
    "[MATOMO] section of the configuration file."

report_conf = Config(section='REPORT')
"Configuration instance for reporting, read from the [MAIN] and " \
    "[REPORT] section of the configuration file."

mst_dir = _own_home / main_conf.rel_mst_dir
"Master directory with the complete scrapes storage structure."

mst_conn = ScrapeConnection()
"Connection to the scrapes and metrics databases, via which all database " \
    "operations are executed. Remains open while using any of the modules " \
    "and is used when specific scrape data is needed."
#   class Config:
View Source
class Config:
    """
    **Container class for configuration parameters.**

    Instantiating from this class makes configuration parameters from the
    [MAIN] and specified `section` available via instance attributes.

    ***Instance methods:***

    - `spec_report`: read the specification of a report variation

    ***Instance attributes:***

    All fields and values that are read from the configuration file will be
    added as attributes to the instance.

    ***Configuration file:***

    The file from which the configuration parameters and their values are
    read (set by the constant `bd_www.constants.CONFIG_NAME`) supports the
    next sections and parameters:
    
    [MAIN]

    - `partner_homedirs` - full path home directories of all partner VDI's
      (`//<VDI name>/users/<userid>`)
    - `rel_mst_dir` - master directory for the complete scrapes storage
      structure [path relative to user home]
    - `scr_db_name` - name of the scrapes database to store all scraped data,
      except page sources
    - `mtrx_db_name` - name of the metrics database to store the usage data for
      each scrape
    - `prod_log_name` - name of the log file in the master directory to which
      all production activity will be recorded
    - `sync_ignore` - file and directory names that will be ignored while
      synchronising the master directories between partner VDI's

    [SCRAPE]

    - `src_dir_name` - directory within the master directory to store a
      zip-file for each scrape with all page sources
    - `robots_dir_name` - directory within the master directory to save copies
      of 'robots.txt' files after changes from previous versions are detected
    - `log_dir_name` - directory within the master directory to store all
      scrape logs
    - `sitemap_dir_name` - directory within the master directory to save copies
      of 'sitemap.xml' files after changes from previous versions are detected
    - `use_unlinked_urls` - use the latest set of unlinked pages to find
      unscraped pages [yes/no]
    - `use_sitemap` - use the url's from the sitemap to find unscraped pages
      [yes/no]
    - `log_name` - base name for each scrape log (will be prepended with the
      timestamp of the scrape)
    - `max_urls` - maximum number of url's that will be requested
      [all/`<number>`]
    - `trusted_domains` - sites to be trusted when checking links, given as
      one domain per line

    [MATOMO]

    - `server` - server url via which the metrics are requested using the
      Matomo API
    - `token` - authentication token for the Matomo API
    - `www_id` - Matomo site id of www.belastingdienst.nl
    - `log_name` - name of the log file in the master directory where all
      Matomo related logging will be recorded

    [REPORT]

    - `reports` - names of each variation of a site report, one per line; for
      each one a section should exist named
      [REPORT_`<capitalised variation name>`]
    - `rep_conf_json` - name of the json-file with the specifications for the
      site reports
    - `kf_details_name` - name of the xlsx-file used for organizing the key
      figures in the site report
    - `data_legend_name` - name of the xlsx-file with the descriptions of
      columns used in the various data sheets of the report
    - `page_groups_name` - name of the csv-file containing the export of the
      group details of Siteimprove
    - `log_name` - name of the log file in the master directory where all
      reporting related logging will be recorded
    - `publ_dir` - directory where duplicates of generated reports will be
      saved [full path]
    - `colour_brdr` - vertical borders of all data cells [hex RRGGBB]
    - `colour_btn_brdr` - button borders [hex RRGGBB]
    - `colour_btn_text` - button text [hex RRGGBB]

    [REPORT_`<VARIATION>`]

    - `incl_fb` - include textual feedback in the report [yes/no]
    - `report_name` - configurable part of the report name, as in
      '220124-0200 - weekly `<report_name>`.xlsx'
    - `report_dir_name` - directory within the master directory to store all
      reports of this variation
    - `colour_hdr_bg` - header background and release notes border [hex RRGGBB]
    - `colour_shade` - background of shaded cells and release notes [hex RRGGBB]
    - `colour_btn_fill` - button background [hex RRGGBB]

    [UNUSED_COLOUR_PALETTES]

    This section is used to save some colour palettes for report configuration.
    As such it does not act as configuration.
    """

    def __init__(self, conf_file: str = CONFIG_NAME,
                 section: str = None):
        """
        **Instantiate a configuration.**

        Arguments:

            conf_file: name of the configuration file
                (default `bd_www.contants.CONFIG_NAME`)
            section: section of the conf_file

        The configuration is read from the [MAIN] and specified `section` of
        the `conf_file` and added as attribute values of the instance. The
        attribute names will be equal to the parameter names of the
        configuration file.

        If the specified configuration file is not found in the current working
        directory, it will be read from the module directory ('bd_www').

        In case a configuration value is a valid integer or boolean,
        the value will be cast as such. Multiline values will be converted to
        a list of strings.
        """
        import configparser
        config = configparser.ConfigParser(inline_comment_prefixes=[';'])
        self._config = config
        if not Path(conf_file).exists():
            conf_file = 'bd_www/' + conf_file
        config.read(conf_file)
        sections = ['MAIN']
        if section:
            sections.append(section)
        for s in sections:
            self._read_section(s)

    def _read_section(self, section: str) -> None:
        """
        **Read a section of the configuration file.**

        Arguments:

            section: section of the configuration file

        Read a section of the configuration file and convert to typed
        attributes of the instance.
        """
        for name, value in self._config[section].items():
            if name.startswith('colour_'):
                value = '#' + value
            elif '\n' in value:
                # Split multiline value into list
                value = [v for v in value.split('\n') if v]
            elif value.isdigit():
                value = int(value)
            elif value == 'yes':
                value = True
            elif value == 'no':
                value = False
            setattr(self, name, value)

    def spec_report(self, report: str) -> None:
        """
        **Read the specification of a report variation.**

        Arguments:

            report: name of the variation

        Read the section of the configuration file for the specified report
        variation and add the parameters as typed attributes to the instance.
        """
        self._read_section('REPORT_' + report.upper())

Container class for configuration parameters.

Instantiating from this class makes configuration parameters from the [MAIN] and specified section available via instance attributes.

Instance methods:

  • spec_report: read the specification of a report variation

Instance attributes:

All fields and values that are read from the configuration file will be added as attributes to the instance.

Configuration file:

The file from which the configuration parameters and their values are read (set by the constant bd_www.constants.CONFIG_NAME) supports the next sections and parameters:

[MAIN]

  • partner_homedirs - full path home directories of all partner VDI's (//<VDI name>/users/<userid>)
  • rel_mst_dir - master directory for the complete scrapes storage structure [path relative to user home]
  • scr_db_name - name of the scrapes database to store all scraped data, except page sources
  • mtrx_db_name - name of the metrics database to store the usage data for each scrape
  • prod_log_name - name of the log file in the master directory to which all production activity will be recorded
  • sync_ignore - file and directory names that will be ignored while synchronising the master directories between partner VDI's

[SCRAPE]

  • src_dir_name - directory within the master directory to store a zip-file for each scrape with all page sources
  • robots_dir_name - directory within the master directory to save copies of 'robots.txt' files after changes from previous versions are detected
  • log_dir_name - directory within the master directory to store all scrape logs
  • sitemap_dir_name - directory within the master directory to save copies of 'sitemap.xml' files after changes from previous versions are detected
  • use_unlinked_urls - use the latest set of unlinked pages to find unscraped pages [yes/no]
  • use_sitemap - use the url's from the sitemap to find unscraped pages [yes/no]
  • log_name - base name for each scrape log (will be prepended with the timestamp of the scrape)
  • max_urls - maximum number of url's that will be requested [all/<number>]
  • trusted_domains - sites to be trusted when checking links, given as one domain per line

[MATOMO]

  • server - server url via which the metrics are requested using the Matomo API
  • token - authentication token for the Matomo API
  • www_id - Matomo site id of www.belastingdienst.nl
  • log_name - name of the log file in the master directory where all Matomo related logging will be recorded

[REPORT]

  • reports - names of each variation of a site report, one per line; for each one a section should exist named [REPORT_<capitalised variation name>]
  • rep_conf_json - name of the json-file with the specifications for the site reports
  • kf_details_name - name of the xlsx-file used for organizing the key figures in the site report
  • data_legend_name - name of the xlsx-file with the descriptions of columns used in the various data sheets of the report
  • page_groups_name - name of the csv-file containing the export of the group details of Siteimprove
  • log_name - name of the log file in the master directory where all reporting related logging will be recorded
  • publ_dir - directory where duplicates of generated reports will be saved [full path]
  • colour_brdr - vertical borders of all data cells [hex RRGGBB]
  • colour_btn_brdr - button borders [hex RRGGBB]
  • colour_btn_text - button text [hex RRGGBB]

[REPORT_<VARIATION>]

  • incl_fb - include textual feedback in the report [yes/no]
  • report_name - configurable part of the report name, as in '220124-0200 - weekly <report_name>.xlsx'
  • report_dir_name - directory within the master directory to store all reports of this variation
  • colour_hdr_bg - header background and release notes border [hex RRGGBB]
  • colour_shade - background of shaded cells and release notes [hex RRGGBB]
  • colour_btn_fill - button background [hex RRGGBB]

[UNUSED_COLOUR_PALETTES]

This section is used to save some colour palettes for report configuration. As such it does not act as configuration.

#   Config(conf_file: str = 'bd_www.ini', section: str = None)
View Source
    def __init__(self, conf_file: str = CONFIG_NAME,
                 section: str = None):
        """
        **Instantiate a configuration.**

        Arguments:

            conf_file: name of the configuration file
                (default `bd_www.contants.CONFIG_NAME`)
            section: section of the conf_file

        The configuration is read from the [MAIN] and specified `section` of
        the `conf_file` and added as attribute values of the instance. The
        attribute names will be equal to the parameter names of the
        configuration file.

        If the specified configuration file is not found in the current working
        directory, it will be read from the module directory ('bd_www').

        In case a configuration value is a valid integer or boolean,
        the value will be cast as such. Multiline values will be converted to
        a list of strings.
        """
        import configparser
        config = configparser.ConfigParser(inline_comment_prefixes=[';'])
        self._config = config
        if not Path(conf_file).exists():
            conf_file = 'bd_www/' + conf_file
        config.read(conf_file)
        sections = ['MAIN']
        if section:
            sections.append(section)
        for s in sections:
            self._read_section(s)

Instantiate a configuration.

Arguments:

conf_file: name of the configuration file
    (default `bd_www.contants.CONFIG_NAME`)
section: section of the conf_file

The configuration is read from the [MAIN] and specified section of the conf_file and added as attribute values of the instance. The attribute names will be equal to the parameter names of the configuration file.

If the specified configuration file is not found in the current working directory, it will be read from the module directory ('bd_www').

In case a configuration value is a valid integer or boolean, the value will be cast as such. Multiline values will be converted to a list of strings.

#   def spec_report(self, report: str) -> None:
View Source
    def spec_report(self, report: str) -> None:
        """
        **Read the specification of a report variation.**

        Arguments:

            report: name of the variation

        Read the section of the configuration file for the specified report
        variation and add the parameters as typed attributes to the instance.
        """
        self._read_section('REPORT_' + report.upper())

Read the specification of a report variation.

Arguments:

report: name of the variation

Read the section of the configuration file for the specified report variation and add the parameters as typed attributes to the instance.

#   class ScrapeConnection(sqlite3.Connection):
View Source
class ScrapeConnection(sqlite3.Connection):
    """
    **Connection class for scraping and reporting.**

    Subclass to connect to the scrapes and metrics databases.

    ***Instance methods:***

    - `switch_to`: switch to another scrapes database
    - `switch_back`: switch back to the configured scrapes database

    ***Instance attributes:***

    - `scrs_db`: path to the configured scrapes database
    - `mtrx_db`: path to the configured metrics database
    """

    def __init__(self):
        """
        **Instantiate a master connection.**

        Open a connection to the configured scrapes database with the
        configured metrics database connected as 'mtrx'. These configurations
        come from the `[Main]` section of the configuration file as
        documented with the `Config` class.
        """
        if not mst_dir.exists():
            mst_dir.mkdir()
        self.scrs_db = mst_dir / main_conf.scrs_db_name
        if not self.scrs_db.exists():
            self.scrs_db.touch()
        self.mtrx_db = mst_dir / main_conf.mtrx_db_name
        if not self.mtrx_db.exists():
            self.mtrx_db.touch()
        super().__init__(self.scrs_db, isolation_level=None)
        self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')

    def switch_to(self, alt_db: Path) -> None:
        """
        **Switch to another scrapes database.**

        Arguments:

            alt_db: path to the alternative scrapes database

        Disconnect the configured scrapes database and connect to the
        specified alternative. The configured metrics database will be
        reconnected.
        """
        self.close()
        super().__init__(alt_db, isolation_level=None)
        self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')

    def switch_back(self) -> None:
        """
        **Switch back to the configured scrapes database.**

        Disconnect from the alternative scrapes database and connect to the
        configured one again. The configured metrics database will be
        reconnected.
        """
        self.close()
        super().__init__(self.scrs_db, isolation_level=None)
        self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')

Connection class for scraping and reporting.

Subclass to connect to the scrapes and metrics databases.

Instance methods:

  • switch_to: switch to another scrapes database
  • switch_back: switch back to the configured scrapes database

Instance attributes:

  • scrs_db: path to the configured scrapes database
  • mtrx_db: path to the configured metrics database
#   ScrapeConnection()
View Source
    def __init__(self):
        """
        **Instantiate a master connection.**

        Open a connection to the configured scrapes database with the
        configured metrics database connected as 'mtrx'. These configurations
        come from the `[Main]` section of the configuration file as
        documented with the `Config` class.
        """
        if not mst_dir.exists():
            mst_dir.mkdir()
        self.scrs_db = mst_dir / main_conf.scrs_db_name
        if not self.scrs_db.exists():
            self.scrs_db.touch()
        self.mtrx_db = mst_dir / main_conf.mtrx_db_name
        if not self.mtrx_db.exists():
            self.mtrx_db.touch()
        super().__init__(self.scrs_db, isolation_level=None)
        self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')

Instantiate a master connection.

Open a connection to the configured scrapes database with the configured metrics database connected as 'mtrx'. These configurations come from the [Main] section of the configuration file as documented with the Config class.

#   def switch_to(self, alt_db: pathlib.Path) -> None:
View Source
    def switch_to(self, alt_db: Path) -> None:
        """
        **Switch to another scrapes database.**

        Arguments:

            alt_db: path to the alternative scrapes database

        Disconnect the configured scrapes database and connect to the
        specified alternative. The configured metrics database will be
        reconnected.
        """
        self.close()
        super().__init__(alt_db, isolation_level=None)
        self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')

Switch to another scrapes database.

Arguments:

alt_db: path to the alternative scrapes database

Disconnect the configured scrapes database and connect to the specified alternative. The configured metrics database will be reconnected.

#   def switch_back(self) -> None:
View Source
    def switch_back(self) -> None:
        """
        **Switch back to the configured scrapes database.**

        Disconnect from the alternative scrapes database and connect to the
        configured one again. The configured metrics database will be
        reconnected.
        """
        self.close()
        super().__init__(self.scrs_db, isolation_level=None)
        self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')

Switch back to the configured scrapes database.

Disconnect from the alternative scrapes database and connect to the configured one again. The configured metrics database will be reconnected.

Inherited Members
sqlite3.Connection
backup
close
commit
create_aggregate
create_collation
create_function
cursor
enable_load_extension
executemany
executescript
execute
interrupt
iterdump
load_extension
rollback
set_authorizer
set_progress_handler
set_trace_callback
Warning
Error
InterfaceError
DatabaseError
DataError
OperationalError
IntegrityError
InternalError
ProgrammingError
NotSupportedError
row_factory
text_factory
isolation_level
total_changes
in_transaction
#   class Scrape:
View Source
class Scrape:
    """
    **Context manager for a specific scrape.**
    
    Enables scrape specific access to the scrapes database.

    ***Instance attributes:***

    - `ts`: timestamp of the selected scrape
    """

    def __init__(self, ts: str):
        """
        **Context for a specific scrape.**

        Arguments:

            ts: timestamp of a scrape [yymmdd-hhmm]

        Within the context all timestamped views (*tsd_...*) of the scrapes
        database are specific for the scrape that is identified by the
        timestamp. The views together represent a dataset that is comparable
        to the scrape tables (*scr_...*) after finishing a scrape.
        """
        self.ts = ts

    def __enter__(self):
        mst_conn.executescript(f'''
            DELETE FROM tsd;
            INSERT INTO tsd VALUES ('{self.ts}')
            ''')

    def __exit__(self, exc_type, exc_val, exc_tb):
        mst_conn.executescript('''
            DELETE FROM tsd
            ''')

Context manager for a specific scrape.

Enables scrape specific access to the scrapes database.

Instance attributes:

  • ts: timestamp of the selected scrape
#   Scrape(ts: str)
View Source
    def __init__(self, ts: str):
        """
        **Context for a specific scrape.**

        Arguments:

            ts: timestamp of a scrape [yymmdd-hhmm]

        Within the context all timestamped views (*tsd_...*) of the scrapes
        database are specific for the scrape that is identified by the
        timestamp. The views together represent a dataset that is comparable
        to the scrape tables (*scr_...*) after finishing a scrape.
        """
        self.ts = ts

Context for a specific scrape.

Arguments:

ts: timestamp of a scrape [yymmdd-hhmm]

Within the context all timestamped views (tsd_...) of the scrapes database are specific for the scrape that is identified by the timestamp. The views together represent a dataset that is comparable to the scrape tables (scr_...) after finishing a scrape.

#   class PageSourceZip:
View Source
class PageSourceZip:
    """
    **Encapsulation class to store and retrieve scraped page sources.**

    ***Instance methods:***

    - `add_page`: add page source
    - `get_page`: retrieve page source
    - `iter_pages`: iterable to retrieve all page sources
    - `page_ids`: return all page_id's

    ***Instance attributes:***

    - `path`: path of the zip-file

    """

    def __init__(self, ts: str):
        """
        **Instantiate reference to a page sources zip-file.**

        Arguments:

            ts: timestamp of a scrape [yymmdd-hhmm]

        The zip-file is specific for the scrape with the specified timestamp.

        In case the zip-file does not exist, an empty one is created with the
        name '`<ts>`.zip'.
        """
        src_dir = _own_home / scrape_conf.rel_mst_dir / scrape_conf.src_dir_name
        zip_path = src_dir / (ts + '.zip')
        self.path = zip_path
        if not zip_path.exists():
            self.zip = zipfile.ZipFile(zip_path, mode='w')
            self.zip.close()

    def add_page(self, page_id: int, page_src: str) -> None:
        """
        **Add page source.**

        Arguments:

            page_id: unique identification of the page
            page_src: html source of the page

        The page source will be added with the name '`<page_id>`.html'.

        An exception is raised when a page with `page_id` is already
        available in the zip-file.
        """
        page_name = f'{page_id}.html'
        with zipfile.ZipFile(self.path, mode='a',
                             compression=zipfile.ZIP_DEFLATED,
                             compresslevel=9) as zf:
            if page_name not in zf.namelist():
                zf.writestr(page_name, page_src)
            else:
                raise ValueError(f'{page_name} already stored in {self.path}')

    def get_page(self, page_id: int) -> str:
        """
        **Retrieve page source.**

        Arguments:

            page_id: unique identification of the page

        An exception is raised when the `page_id` is not available in the
        zip-file.
        """
        page_name = f'{page_id}.html'
        with zipfile.ZipFile(self.path) as zf:
            if page_name in zf.namelist():
                return zf.read(page_name).decode()
            else:
                raise ValueError(f'{page_name} not available in {self.path}')

    def iter_pages(self) -> Iterable[tuple[int, str]]:
        """
        **Retrieve all page sources.**

        Returns:

            successive (page_id, page source) tuples for all pages
        """
        with zipfile.ZipFile(self.path) as zf:
            for page_name in zf.namelist():
                yield int(page_name[:-5]), zf.read(page_name).decode()

    def page_ids(self) -> list[int]:
        """
        **Return all page_id's.**

        Returns:

            list with page_id's of all page sources
        """
        with zipfile.ZipFile(self.path) as zf:
            return [int(n.split('.')[0]) for n in zf.namelist()]

Encapsulation class to store and retrieve scraped page sources.

Instance methods:

Instance attributes:

  • path: path of the zip-file
#   PageSourceZip(ts: str)
View Source
    def __init__(self, ts: str):
        """
        **Instantiate reference to a page sources zip-file.**

        Arguments:

            ts: timestamp of a scrape [yymmdd-hhmm]

        The zip-file is specific for the scrape with the specified timestamp.

        In case the zip-file does not exist, an empty one is created with the
        name '`<ts>`.zip'.
        """
        src_dir = _own_home / scrape_conf.rel_mst_dir / scrape_conf.src_dir_name
        zip_path = src_dir / (ts + '.zip')
        self.path = zip_path
        if not zip_path.exists():
            self.zip = zipfile.ZipFile(zip_path, mode='w')
            self.zip.close()

Instantiate reference to a page sources zip-file.

Arguments:

ts: timestamp of a scrape [yymmdd-hhmm]

The zip-file is specific for the scrape with the specified timestamp.

In case the zip-file does not exist, an empty one is created with the name '<ts>.zip'.

#   def add_page(self, page_id: int, page_src: str) -> None:
View Source
    def add_page(self, page_id: int, page_src: str) -> None:
        """
        **Add page source.**

        Arguments:

            page_id: unique identification of the page
            page_src: html source of the page

        The page source will be added with the name '`<page_id>`.html'.

        An exception is raised when a page with `page_id` is already
        available in the zip-file.
        """
        page_name = f'{page_id}.html'
        with zipfile.ZipFile(self.path, mode='a',
                             compression=zipfile.ZIP_DEFLATED,
                             compresslevel=9) as zf:
            if page_name not in zf.namelist():
                zf.writestr(page_name, page_src)
            else:
                raise ValueError(f'{page_name} already stored in {self.path}')

Add page source.

Arguments:

page_id: unique identification of the page
page_src: html source of the page

The page source will be added with the name '<page_id>.html'.

An exception is raised when a page with page_id is already available in the zip-file.

#   def get_page(self, page_id: int) -> str:
View Source
    def get_page(self, page_id: int) -> str:
        """
        **Retrieve page source.**

        Arguments:

            page_id: unique identification of the page

        An exception is raised when the `page_id` is not available in the
        zip-file.
        """
        page_name = f'{page_id}.html'
        with zipfile.ZipFile(self.path) as zf:
            if page_name in zf.namelist():
                return zf.read(page_name).decode()
            else:
                raise ValueError(f'{page_name} not available in {self.path}')

Retrieve page source.

Arguments:

page_id: unique identification of the page

An exception is raised when the page_id is not available in the zip-file.

#   def iter_pages(self) -> collections.abc.Iterable[tuple[int, str]]:
View Source
    def iter_pages(self) -> Iterable[tuple[int, str]]:
        """
        **Retrieve all page sources.**

        Returns:

            successive (page_id, page source) tuples for all pages
        """
        with zipfile.ZipFile(self.path) as zf:
            for page_name in zf.namelist():
                yield int(page_name[:-5]), zf.read(page_name).decode()

Retrieve all page sources.

Returns:

successive (page_id, page source) tuples for all pages
#   def page_ids(self) -> list[int]:
View Source
    def page_ids(self) -> list[int]:
        """
        **Return all page_id's.**

        Returns:

            list with page_id's of all page sources
        """
        with zipfile.ZipFile(self.path) as zf:
            return [int(n.split('.')[0]) for n in zf.namelist()]

Return all page_id's.

Returns:

list with page_id's of all page sources
#   main_conf = <bd_www.Config object>

Configuration instance for general use, read from the [MAIN] section of the configuration file.

#   scrape_conf = <bd_www.Config object>

Configuration instance for scraping, read from the [MAIN] and [SCRAPE] section of the configuration file.

#   matomo_conf = <bd_www.Config object>

Configuration instance for requesting metrics, read from the [MAIN] and [MATOMO] section of the configuration file.

#   report_conf = <bd_www.Config object>

Configuration instance for reporting, read from the [MAIN] and [REPORT] section of the configuration file.

#   mst_dir = WindowsPath('//PVPCOCC322/Users/diepj09/Documents/scrapes_prod')

Master directory with the complete scrapes storage structure.

#   mst_conn = <bd_www.ScrapeConnection object>

Connection to the scrapes and metrics databases, via which all database operations are executed. Remains open while using any of the modules and is used when specific scrape data is needed.