bd_www
Package for scraping of and reporting about www.belastingdienst.nl.
The basic contents of this package are the next three modules:
bd_www.scrape.pyfor scraping the sitebd_www.matomo.pyfor getting usage statisticsbd_www.report.pyfor reporting about the site
The bd_www.constants.py module, containing the package constants, completes
the set.
While importing (from) this package, configuration parameters are read from
the configuration file with bd_www.constants.CONFIG_NAME as name. Please
refer to the documentation of the Config class to learn more about this file.
All three main modules can be used for their separate respective purposes,
but often these modules will be run consecutively and unattended. For this
purpose there is a simple command line module scrape_and_report.py.
Release 410.123.201 (2-2-2023)
Contents
| module | version | remarks |
|---|---|---|
| bd_www.constants.py | 1.2 | |
| bd_www.scrape.py | 4.1.0 | changed |
| bd_www.matomo.py | 1.2.3 | changed |
| bd_www.report.py | 2.0.1 | changed |
| scrape_and_report.py | 6.0 | changed |
| revisit_scrapes.py | 2.2 | changed |
| bd_www.ini | 2.3 | changed |
| report_conf.json | 2.4 | |
| keyfig_details.xlsx | 2.2 | |
| data_legend.xlsx | 1.2 |
Changes
General and bd_www
- Improved multiline values file to allow multi word (space separated) lines in bd_www.ini
- Changed user specific code by introducing local '_own_home' variable with runtime determined absolute path of the users home directory
- Configuration parameter 'mst_dir' changed to 'rel_mst_dir' as path relative to the users home directory
- Variable 'mst_dir' is constituted from '_own_home' and the configured 'rel_mst_dir'
- Log settings changed from universal to independent per module and in some cases per function
- Fixed various bugs to be able to use all the
bd_wwwpackage after creating a new empty data store usingbd_www.scrape.create_new_data_store
bd_www.scrape.py
- Fixed the situation that
bd_www.matomo.matomo_availablereturned True when undergoing maintenance - Added parameter 'data_dir' to
bd_www.scrape.valid_scrapesto enable using it on other master directories - Renamed
bd_www.scrape.create_scrapes_storagetobd_www.scrape.create_new_data_store - Included the metrics database in
bd_www.scrape.create_new_data_store, which was still missing
bd_www.matomo.py
- Changed waiting time for metrics from 24 to 20 hours to ensure that the run concludes within 24 hours
- Fixed sql bug in creating the temporary 'dl_data' table in
bd_www.matomo.period_downloads - Custom reports 136 removed because no longer available; was used in
bd_www.matomo.period_feedback
bd_www.report.py
- Fixed bug in
bd_www.report.ReportWorkbook.close_and_publish - Removed 'sync_reports' function; with the new 'partner production' all partner VDI's receive all data
scrape_and_report.py
- Introduced partner production (see the documentation of the
scrape_and_report.pymodule) - Reports of the last days are no longer synced to the 'datasluis' since all partner VDI's will have all of them
- Added
bd_www.scrape_and_report.chosen_to_produceto decide if the calling VDI is the one to produce - Added
bd_www.scrape_and_report.sync_master_directoryto use for identical and up-to-date data stores in all partner VDI's - Added
bd_www.scrape_and_report.sync_one_wayto synchronise a directory tree from source to target
revisit_scrapes.py
- Fixed bug by adding the omitted
bd_www.matomo.period_downloadswhile renewing the analytics data
bd_www.ini
- Multiline values with multi word (space separated) lines are now allowed
- Added 'partner_homedirs' to the [MAIN] section with the full path home directories of all partner VDI's
- Configuration parameter 'mst_dir' changed to 'rel_mst_dir' as path relative to the partner home directories
- Added 'prod_log_name' to the [MAIN] section as name of the text file to record all production activity
- Added 'sync_ignore' to the [MAIN] section with item names that will be ignored during data synchronisation
- Removed 'publication_dir' from [REPORT] section since removing
sync_reports - Added [UNUSED_COLOUR_PALETTES] section with colours that can be used for actual report settings
View Source
""" ***Package for scraping of and reporting about www.belastingdienst.nl.*** The basic contents of this package are the next three modules: - `bd_www.scrape.py` for scraping the site - `bd_www.matomo.py` for getting usage statistics - `bd_www.report.py` for reporting about the site The `bd_www.constants.py` module, containing the package constants, completes the set. While importing (from) this package, configuration parameters are read from the configuration file with `bd_www.constants.CONFIG_NAME` as name. Please refer to the documentation of the `Config` class to learn more about this file. All three main modules can be used for their separate respective purposes, but often these modules will be run consecutively and unattended. For this purpose there is a simple command line module `scrape_and_report.py`. .. include:: release.md """ __docformat__ = "restructuredtext" import os import sqlite3 import zipfile from collections.abc import Iterable from pathlib import Path from bd_www.constants import CONFIG_NAME _own_home = Path( '//' + os.getenv('computername') + '/Users/' + os.getenv('username')) class Config: """ **Container class for configuration parameters.** Instantiating from this class makes configuration parameters from the [MAIN] and specified `section` available via instance attributes. ***Instance methods:*** - `spec_report`: read the specification of a report variation ***Instance attributes:*** All fields and values that are read from the configuration file will be added as attributes to the instance. ***Configuration file:*** The file from which the configuration parameters and their values are read (set by the constant `bd_www.constants.CONFIG_NAME`) supports the next sections and parameters: [MAIN] - `partner_homedirs` - full path home directories of all partner VDI's (`//<VDI name>/users/<userid>`) - `rel_mst_dir` - master directory for the complete scrapes storage structure [path relative to user home] - `scr_db_name` - name of the scrapes database to store all scraped data, except page sources - `mtrx_db_name` - name of the metrics database to store the usage data for each scrape - `prod_log_name` - name of the log file in the master directory to which all production activity will be recorded - `sync_ignore` - file and directory names that will be ignored while synchronising the master directories between partner VDI's [SCRAPE] - `src_dir_name` - directory within the master directory to store a zip-file for each scrape with all page sources - `robots_dir_name` - directory within the master directory to save copies of 'robots.txt' files after changes from previous versions are detected - `log_dir_name` - directory within the master directory to store all scrape logs - `sitemap_dir_name` - directory within the master directory to save copies of 'sitemap.xml' files after changes from previous versions are detected - `use_unlinked_urls` - use the latest set of unlinked pages to find unscraped pages [yes/no] - `use_sitemap` - use the url's from the sitemap to find unscraped pages [yes/no] - `log_name` - base name for each scrape log (will be prepended with the timestamp of the scrape) - `max_urls` - maximum number of url's that will be requested [all/`<number>`] - `trusted_domains` - sites to be trusted when checking links, given as one domain per line [MATOMO] - `server` - server url via which the metrics are requested using the Matomo API - `token` - authentication token for the Matomo API - `www_id` - Matomo site id of www.belastingdienst.nl - `log_name` - name of the log file in the master directory where all Matomo related logging will be recorded [REPORT] - `reports` - names of each variation of a site report, one per line; for each one a section should exist named [REPORT_`<capitalised variation name>`] - `rep_conf_json` - name of the json-file with the specifications for the site reports - `kf_details_name` - name of the xlsx-file used for organizing the key figures in the site report - `data_legend_name` - name of the xlsx-file with the descriptions of columns used in the various data sheets of the report - `page_groups_name` - name of the csv-file containing the export of the group details of Siteimprove - `log_name` - name of the log file in the master directory where all reporting related logging will be recorded - `publ_dir` - directory where duplicates of generated reports will be saved [full path] - `colour_brdr` - vertical borders of all data cells [hex RRGGBB] - `colour_btn_brdr` - button borders [hex RRGGBB] - `colour_btn_text` - button text [hex RRGGBB] [REPORT_`<VARIATION>`] - `incl_fb` - include textual feedback in the report [yes/no] - `report_name` - configurable part of the report name, as in '220124-0200 - weekly `<report_name>`.xlsx' - `report_dir_name` - directory within the master directory to store all reports of this variation - `colour_hdr_bg` - header background and release notes border [hex RRGGBB] - `colour_shade` - background of shaded cells and release notes [hex RRGGBB] - `colour_btn_fill` - button background [hex RRGGBB] [UNUSED_COLOUR_PALETTES] This section is used to save some colour palettes for report configuration. As such it does not act as configuration. """ def __init__(self, conf_file: str = CONFIG_NAME, section: str = None): """ **Instantiate a configuration.** Arguments: conf_file: name of the configuration file (default `bd_www.contants.CONFIG_NAME`) section: section of the conf_file The configuration is read from the [MAIN] and specified `section` of the `conf_file` and added as attribute values of the instance. The attribute names will be equal to the parameter names of the configuration file. If the specified configuration file is not found in the current working directory, it will be read from the module directory ('bd_www'). In case a configuration value is a valid integer or boolean, the value will be cast as such. Multiline values will be converted to a list of strings. """ import configparser config = configparser.ConfigParser(inline_comment_prefixes=[';']) self._config = config if not Path(conf_file).exists(): conf_file = 'bd_www/' + conf_file config.read(conf_file) sections = ['MAIN'] if section: sections.append(section) for s in sections: self._read_section(s) def _read_section(self, section: str) -> None: """ **Read a section of the configuration file.** Arguments: section: section of the configuration file Read a section of the configuration file and convert to typed attributes of the instance. """ for name, value in self._config[section].items(): if name.startswith('colour_'): value = '#' + value elif '\n' in value: # Split multiline value into list value = [v for v in value.split('\n') if v] elif value.isdigit(): value = int(value) elif value == 'yes': value = True elif value == 'no': value = False setattr(self, name, value) def spec_report(self, report: str) -> None: """ **Read the specification of a report variation.** Arguments: report: name of the variation Read the section of the configuration file for the specified report variation and add the parameters as typed attributes to the instance. """ self._read_section('REPORT_' + report.upper()) class ScrapeConnection(sqlite3.Connection): """ **Connection class for scraping and reporting.** Subclass to connect to the scrapes and metrics databases. ***Instance methods:*** - `switch_to`: switch to another scrapes database - `switch_back`: switch back to the configured scrapes database ***Instance attributes:*** - `scrs_db`: path to the configured scrapes database - `mtrx_db`: path to the configured metrics database """ def __init__(self): """ **Instantiate a master connection.** Open a connection to the configured scrapes database with the configured metrics database connected as 'mtrx'. These configurations come from the `[Main]` section of the configuration file as documented with the `Config` class. """ if not mst_dir.exists(): mst_dir.mkdir() self.scrs_db = mst_dir / main_conf.scrs_db_name if not self.scrs_db.exists(): self.scrs_db.touch() self.mtrx_db = mst_dir / main_conf.mtrx_db_name if not self.mtrx_db.exists(): self.mtrx_db.touch() super().__init__(self.scrs_db, isolation_level=None) self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx') def switch_to(self, alt_db: Path) -> None: """ **Switch to another scrapes database.** Arguments: alt_db: path to the alternative scrapes database Disconnect the configured scrapes database and connect to the specified alternative. The configured metrics database will be reconnected. """ self.close() super().__init__(alt_db, isolation_level=None) self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx') def switch_back(self) -> None: """ **Switch back to the configured scrapes database.** Disconnect from the alternative scrapes database and connect to the configured one again. The configured metrics database will be reconnected. """ self.close() super().__init__(self.scrs_db, isolation_level=None) self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx') class Scrape: """ **Context manager for a specific scrape.** Enables scrape specific access to the scrapes database. ***Instance attributes:*** - `ts`: timestamp of the selected scrape """ def __init__(self, ts: str): """ **Context for a specific scrape.** Arguments: ts: timestamp of a scrape [yymmdd-hhmm] Within the context all timestamped views (*tsd_...*) of the scrapes database are specific for the scrape that is identified by the timestamp. The views together represent a dataset that is comparable to the scrape tables (*scr_...*) after finishing a scrape. """ self.ts = ts def __enter__(self): mst_conn.executescript(f''' DELETE FROM tsd; INSERT INTO tsd VALUES ('{self.ts}') ''') def __exit__(self, exc_type, exc_val, exc_tb): mst_conn.executescript(''' DELETE FROM tsd ''') class PageSourceZip: """ **Encapsulation class to store and retrieve scraped page sources.** ***Instance methods:*** - `add_page`: add page source - `get_page`: retrieve page source - `iter_pages`: iterable to retrieve all page sources - `page_ids`: return all page_id's ***Instance attributes:*** - `path`: path of the zip-file """ def __init__(self, ts: str): """ **Instantiate reference to a page sources zip-file.** Arguments: ts: timestamp of a scrape [yymmdd-hhmm] The zip-file is specific for the scrape with the specified timestamp. In case the zip-file does not exist, an empty one is created with the name '`<ts>`.zip'. """ src_dir = _own_home / scrape_conf.rel_mst_dir / scrape_conf.src_dir_name zip_path = src_dir / (ts + '.zip') self.path = zip_path if not zip_path.exists(): self.zip = zipfile.ZipFile(zip_path, mode='w') self.zip.close() def add_page(self, page_id: int, page_src: str) -> None: """ **Add page source.** Arguments: page_id: unique identification of the page page_src: html source of the page The page source will be added with the name '`<page_id>`.html'. An exception is raised when a page with `page_id` is already available in the zip-file. """ page_name = f'{page_id}.html' with zipfile.ZipFile(self.path, mode='a', compression=zipfile.ZIP_DEFLATED, compresslevel=9) as zf: if page_name not in zf.namelist(): zf.writestr(page_name, page_src) else: raise ValueError(f'{page_name} already stored in {self.path}') def get_page(self, page_id: int) -> str: """ **Retrieve page source.** Arguments: page_id: unique identification of the page An exception is raised when the `page_id` is not available in the zip-file. """ page_name = f'{page_id}.html' with zipfile.ZipFile(self.path) as zf: if page_name in zf.namelist(): return zf.read(page_name).decode() else: raise ValueError(f'{page_name} not available in {self.path}') def iter_pages(self) -> Iterable[tuple[int, str]]: """ **Retrieve all page sources.** Returns: successive (page_id, page source) tuples for all pages """ with zipfile.ZipFile(self.path) as zf: for page_name in zf.namelist(): yield int(page_name[:-5]), zf.read(page_name).decode() def page_ids(self) -> list[int]: """ **Return all page_id's.** Returns: list with page_id's of all page sources """ with zipfile.ZipFile(self.path) as zf: return [int(n.split('.')[0]) for n in zf.namelist()] main_conf = Config() "Configuration instance for general use, read from the [MAIN] section " \ "of the configuration file." scrape_conf = Config(section='SCRAPE') "Configuration instance for scraping, read from the [MAIN] and " \ "[SCRAPE] section of the configuration file." matomo_conf = Config(section='MATOMO') "Configuration instance for requesting metrics, read from the [MAIN] and " \ "[MATOMO] section of the configuration file." report_conf = Config(section='REPORT') "Configuration instance for reporting, read from the [MAIN] and " \ "[REPORT] section of the configuration file." mst_dir = _own_home / main_conf.rel_mst_dir "Master directory with the complete scrapes storage structure." mst_conn = ScrapeConnection() "Connection to the scrapes and metrics databases, via which all database " \ "operations are executed. Remains open while using any of the modules " \ "and is used when specific scrape data is needed."
View Source
class Config: """ **Container class for configuration parameters.** Instantiating from this class makes configuration parameters from the [MAIN] and specified `section` available via instance attributes. ***Instance methods:*** - `spec_report`: read the specification of a report variation ***Instance attributes:*** All fields and values that are read from the configuration file will be added as attributes to the instance. ***Configuration file:*** The file from which the configuration parameters and their values are read (set by the constant `bd_www.constants.CONFIG_NAME`) supports the next sections and parameters: [MAIN] - `partner_homedirs` - full path home directories of all partner VDI's (`//<VDI name>/users/<userid>`) - `rel_mst_dir` - master directory for the complete scrapes storage structure [path relative to user home] - `scr_db_name` - name of the scrapes database to store all scraped data, except page sources - `mtrx_db_name` - name of the metrics database to store the usage data for each scrape - `prod_log_name` - name of the log file in the master directory to which all production activity will be recorded - `sync_ignore` - file and directory names that will be ignored while synchronising the master directories between partner VDI's [SCRAPE] - `src_dir_name` - directory within the master directory to store a zip-file for each scrape with all page sources - `robots_dir_name` - directory within the master directory to save copies of 'robots.txt' files after changes from previous versions are detected - `log_dir_name` - directory within the master directory to store all scrape logs - `sitemap_dir_name` - directory within the master directory to save copies of 'sitemap.xml' files after changes from previous versions are detected - `use_unlinked_urls` - use the latest set of unlinked pages to find unscraped pages [yes/no] - `use_sitemap` - use the url's from the sitemap to find unscraped pages [yes/no] - `log_name` - base name for each scrape log (will be prepended with the timestamp of the scrape) - `max_urls` - maximum number of url's that will be requested [all/`<number>`] - `trusted_domains` - sites to be trusted when checking links, given as one domain per line [MATOMO] - `server` - server url via which the metrics are requested using the Matomo API - `token` - authentication token for the Matomo API - `www_id` - Matomo site id of www.belastingdienst.nl - `log_name` - name of the log file in the master directory where all Matomo related logging will be recorded [REPORT] - `reports` - names of each variation of a site report, one per line; for each one a section should exist named [REPORT_`<capitalised variation name>`] - `rep_conf_json` - name of the json-file with the specifications for the site reports - `kf_details_name` - name of the xlsx-file used for organizing the key figures in the site report - `data_legend_name` - name of the xlsx-file with the descriptions of columns used in the various data sheets of the report - `page_groups_name` - name of the csv-file containing the export of the group details of Siteimprove - `log_name` - name of the log file in the master directory where all reporting related logging will be recorded - `publ_dir` - directory where duplicates of generated reports will be saved [full path] - `colour_brdr` - vertical borders of all data cells [hex RRGGBB] - `colour_btn_brdr` - button borders [hex RRGGBB] - `colour_btn_text` - button text [hex RRGGBB] [REPORT_`<VARIATION>`] - `incl_fb` - include textual feedback in the report [yes/no] - `report_name` - configurable part of the report name, as in '220124-0200 - weekly `<report_name>`.xlsx' - `report_dir_name` - directory within the master directory to store all reports of this variation - `colour_hdr_bg` - header background and release notes border [hex RRGGBB] - `colour_shade` - background of shaded cells and release notes [hex RRGGBB] - `colour_btn_fill` - button background [hex RRGGBB] [UNUSED_COLOUR_PALETTES] This section is used to save some colour palettes for report configuration. As such it does not act as configuration. """ def __init__(self, conf_file: str = CONFIG_NAME, section: str = None): """ **Instantiate a configuration.** Arguments: conf_file: name of the configuration file (default `bd_www.contants.CONFIG_NAME`) section: section of the conf_file The configuration is read from the [MAIN] and specified `section` of the `conf_file` and added as attribute values of the instance. The attribute names will be equal to the parameter names of the configuration file. If the specified configuration file is not found in the current working directory, it will be read from the module directory ('bd_www'). In case a configuration value is a valid integer or boolean, the value will be cast as such. Multiline values will be converted to a list of strings. """ import configparser config = configparser.ConfigParser(inline_comment_prefixes=[';']) self._config = config if not Path(conf_file).exists(): conf_file = 'bd_www/' + conf_file config.read(conf_file) sections = ['MAIN'] if section: sections.append(section) for s in sections: self._read_section(s) def _read_section(self, section: str) -> None: """ **Read a section of the configuration file.** Arguments: section: section of the configuration file Read a section of the configuration file and convert to typed attributes of the instance. """ for name, value in self._config[section].items(): if name.startswith('colour_'): value = '#' + value elif '\n' in value: # Split multiline value into list value = [v for v in value.split('\n') if v] elif value.isdigit(): value = int(value) elif value == 'yes': value = True elif value == 'no': value = False setattr(self, name, value) def spec_report(self, report: str) -> None: """ **Read the specification of a report variation.** Arguments: report: name of the variation Read the section of the configuration file for the specified report variation and add the parameters as typed attributes to the instance. """ self._read_section('REPORT_' + report.upper())
Container class for configuration parameters.
Instantiating from this class makes configuration parameters from the
[MAIN] and specified section available via instance attributes.
Instance methods:
spec_report: read the specification of a report variation
Instance attributes:
All fields and values that are read from the configuration file will be added as attributes to the instance.
Configuration file:
The file from which the configuration parameters and their values are
read (set by the constant bd_www.constants.CONFIG_NAME) supports the
next sections and parameters:
[MAIN]
partner_homedirs- full path home directories of all partner VDI's (//<VDI name>/users/<userid>)rel_mst_dir- master directory for the complete scrapes storage structure [path relative to user home]scr_db_name- name of the scrapes database to store all scraped data, except page sourcesmtrx_db_name- name of the metrics database to store the usage data for each scrapeprod_log_name- name of the log file in the master directory to which all production activity will be recordedsync_ignore- file and directory names that will be ignored while synchronising the master directories between partner VDI's
[SCRAPE]
src_dir_name- directory within the master directory to store a zip-file for each scrape with all page sourcesrobots_dir_name- directory within the master directory to save copies of 'robots.txt' files after changes from previous versions are detectedlog_dir_name- directory within the master directory to store all scrape logssitemap_dir_name- directory within the master directory to save copies of 'sitemap.xml' files after changes from previous versions are detecteduse_unlinked_urls- use the latest set of unlinked pages to find unscraped pages [yes/no]use_sitemap- use the url's from the sitemap to find unscraped pages [yes/no]log_name- base name for each scrape log (will be prepended with the timestamp of the scrape)max_urls- maximum number of url's that will be requested [all/<number>]trusted_domains- sites to be trusted when checking links, given as one domain per line
[MATOMO]
server- server url via which the metrics are requested using the Matomo APItoken- authentication token for the Matomo APIwww_id- Matomo site id of www.belastingdienst.nllog_name- name of the log file in the master directory where all Matomo related logging will be recorded
[REPORT]
reports- names of each variation of a site report, one per line; for each one a section should exist named [REPORT_<capitalised variation name>]rep_conf_json- name of the json-file with the specifications for the site reportskf_details_name- name of the xlsx-file used for organizing the key figures in the site reportdata_legend_name- name of the xlsx-file with the descriptions of columns used in the various data sheets of the reportpage_groups_name- name of the csv-file containing the export of the group details of Siteimprovelog_name- name of the log file in the master directory where all reporting related logging will be recordedpubl_dir- directory where duplicates of generated reports will be saved [full path]colour_brdr- vertical borders of all data cells [hex RRGGBB]colour_btn_brdr- button borders [hex RRGGBB]colour_btn_text- button text [hex RRGGBB]
[REPORT_<VARIATION>]
incl_fb- include textual feedback in the report [yes/no]report_name- configurable part of the report name, as in '220124-0200 - weekly<report_name>.xlsx'report_dir_name- directory within the master directory to store all reports of this variationcolour_hdr_bg- header background and release notes border [hex RRGGBB]colour_shade- background of shaded cells and release notes [hex RRGGBB]colour_btn_fill- button background [hex RRGGBB]
[UNUSED_COLOUR_PALETTES]
This section is used to save some colour palettes for report configuration. As such it does not act as configuration.
View Source
def __init__(self, conf_file: str = CONFIG_NAME, section: str = None): """ **Instantiate a configuration.** Arguments: conf_file: name of the configuration file (default `bd_www.contants.CONFIG_NAME`) section: section of the conf_file The configuration is read from the [MAIN] and specified `section` of the `conf_file` and added as attribute values of the instance. The attribute names will be equal to the parameter names of the configuration file. If the specified configuration file is not found in the current working directory, it will be read from the module directory ('bd_www'). In case a configuration value is a valid integer or boolean, the value will be cast as such. Multiline values will be converted to a list of strings. """ import configparser config = configparser.ConfigParser(inline_comment_prefixes=[';']) self._config = config if not Path(conf_file).exists(): conf_file = 'bd_www/' + conf_file config.read(conf_file) sections = ['MAIN'] if section: sections.append(section) for s in sections: self._read_section(s)
Instantiate a configuration.
Arguments:
conf_file: name of the configuration file
(default `bd_www.contants.CONFIG_NAME`)
section: section of the conf_file
The configuration is read from the [MAIN] and specified section of
the conf_file and added as attribute values of the instance. The
attribute names will be equal to the parameter names of the
configuration file.
If the specified configuration file is not found in the current working directory, it will be read from the module directory ('bd_www').
In case a configuration value is a valid integer or boolean, the value will be cast as such. Multiline values will be converted to a list of strings.
View Source
def spec_report(self, report: str) -> None: """ **Read the specification of a report variation.** Arguments: report: name of the variation Read the section of the configuration file for the specified report variation and add the parameters as typed attributes to the instance. """ self._read_section('REPORT_' + report.upper())
Read the specification of a report variation.
Arguments:
report: name of the variation
Read the section of the configuration file for the specified report variation and add the parameters as typed attributes to the instance.
View Source
class ScrapeConnection(sqlite3.Connection): """ **Connection class for scraping and reporting.** Subclass to connect to the scrapes and metrics databases. ***Instance methods:*** - `switch_to`: switch to another scrapes database - `switch_back`: switch back to the configured scrapes database ***Instance attributes:*** - `scrs_db`: path to the configured scrapes database - `mtrx_db`: path to the configured metrics database """ def __init__(self): """ **Instantiate a master connection.** Open a connection to the configured scrapes database with the configured metrics database connected as 'mtrx'. These configurations come from the `[Main]` section of the configuration file as documented with the `Config` class. """ if not mst_dir.exists(): mst_dir.mkdir() self.scrs_db = mst_dir / main_conf.scrs_db_name if not self.scrs_db.exists(): self.scrs_db.touch() self.mtrx_db = mst_dir / main_conf.mtrx_db_name if not self.mtrx_db.exists(): self.mtrx_db.touch() super().__init__(self.scrs_db, isolation_level=None) self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx') def switch_to(self, alt_db: Path) -> None: """ **Switch to another scrapes database.** Arguments: alt_db: path to the alternative scrapes database Disconnect the configured scrapes database and connect to the specified alternative. The configured metrics database will be reconnected. """ self.close() super().__init__(alt_db, isolation_level=None) self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx') def switch_back(self) -> None: """ **Switch back to the configured scrapes database.** Disconnect from the alternative scrapes database and connect to the configured one again. The configured metrics database will be reconnected. """ self.close() super().__init__(self.scrs_db, isolation_level=None) self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')
Connection class for scraping and reporting.
Subclass to connect to the scrapes and metrics databases.
Instance methods:
switch_to: switch to another scrapes databaseswitch_back: switch back to the configured scrapes database
Instance attributes:
scrs_db: path to the configured scrapes databasemtrx_db: path to the configured metrics database
View Source
def __init__(self): """ **Instantiate a master connection.** Open a connection to the configured scrapes database with the configured metrics database connected as 'mtrx'. These configurations come from the `[Main]` section of the configuration file as documented with the `Config` class. """ if not mst_dir.exists(): mst_dir.mkdir() self.scrs_db = mst_dir / main_conf.scrs_db_name if not self.scrs_db.exists(): self.scrs_db.touch() self.mtrx_db = mst_dir / main_conf.mtrx_db_name if not self.mtrx_db.exists(): self.mtrx_db.touch() super().__init__(self.scrs_db, isolation_level=None) self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')
Instantiate a master connection.
Open a connection to the configured scrapes database with the
configured metrics database connected as 'mtrx'. These configurations
come from the [Main] section of the configuration file as
documented with the Config class.
View Source
def switch_to(self, alt_db: Path) -> None: """ **Switch to another scrapes database.** Arguments: alt_db: path to the alternative scrapes database Disconnect the configured scrapes database and connect to the specified alternative. The configured metrics database will be reconnected. """ self.close() super().__init__(alt_db, isolation_level=None) self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')
Switch to another scrapes database.
Arguments:
alt_db: path to the alternative scrapes database
Disconnect the configured scrapes database and connect to the specified alternative. The configured metrics database will be reconnected.
View Source
def switch_back(self) -> None: """ **Switch back to the configured scrapes database.** Disconnect from the alternative scrapes database and connect to the configured one again. The configured metrics database will be reconnected. """ self.close() super().__init__(self.scrs_db, isolation_level=None) self.execute(f'ATTACH "{self.mtrx_db}" AS mtrx')
Switch back to the configured scrapes database.
Disconnect from the alternative scrapes database and connect to the configured one again. The configured metrics database will be reconnected.
Inherited Members
- sqlite3.Connection
- backup
- close
- commit
- create_aggregate
- create_collation
- create_function
- cursor
- enable_load_extension
- executemany
- executescript
- execute
- interrupt
- iterdump
- load_extension
- rollback
- set_progress_handler
- set_trace_callback
- Warning
- Error
- InterfaceError
- DatabaseError
- DataError
- OperationalError
- IntegrityError
- InternalError
- ProgrammingError
- NotSupportedError
- row_factory
- text_factory
- isolation_level
- total_changes
- in_transaction
View Source
class Scrape: """ **Context manager for a specific scrape.** Enables scrape specific access to the scrapes database. ***Instance attributes:*** - `ts`: timestamp of the selected scrape """ def __init__(self, ts: str): """ **Context for a specific scrape.** Arguments: ts: timestamp of a scrape [yymmdd-hhmm] Within the context all timestamped views (*tsd_...*) of the scrapes database are specific for the scrape that is identified by the timestamp. The views together represent a dataset that is comparable to the scrape tables (*scr_...*) after finishing a scrape. """ self.ts = ts def __enter__(self): mst_conn.executescript(f''' DELETE FROM tsd; INSERT INTO tsd VALUES ('{self.ts}') ''') def __exit__(self, exc_type, exc_val, exc_tb): mst_conn.executescript(''' DELETE FROM tsd ''')
Context manager for a specific scrape.
Enables scrape specific access to the scrapes database.
Instance attributes:
ts: timestamp of the selected scrape
View Source
def __init__(self, ts: str): """ **Context for a specific scrape.** Arguments: ts: timestamp of a scrape [yymmdd-hhmm] Within the context all timestamped views (*tsd_...*) of the scrapes database are specific for the scrape that is identified by the timestamp. The views together represent a dataset that is comparable to the scrape tables (*scr_...*) after finishing a scrape. """ self.ts = ts
Context for a specific scrape.
Arguments:
ts: timestamp of a scrape [yymmdd-hhmm]
Within the context all timestamped views (tsd_...) of the scrapes database are specific for the scrape that is identified by the timestamp. The views together represent a dataset that is comparable to the scrape tables (scr_...) after finishing a scrape.
View Source
class PageSourceZip: """ **Encapsulation class to store and retrieve scraped page sources.** ***Instance methods:*** - `add_page`: add page source - `get_page`: retrieve page source - `iter_pages`: iterable to retrieve all page sources - `page_ids`: return all page_id's ***Instance attributes:*** - `path`: path of the zip-file """ def __init__(self, ts: str): """ **Instantiate reference to a page sources zip-file.** Arguments: ts: timestamp of a scrape [yymmdd-hhmm] The zip-file is specific for the scrape with the specified timestamp. In case the zip-file does not exist, an empty one is created with the name '`<ts>`.zip'. """ src_dir = _own_home / scrape_conf.rel_mst_dir / scrape_conf.src_dir_name zip_path = src_dir / (ts + '.zip') self.path = zip_path if not zip_path.exists(): self.zip = zipfile.ZipFile(zip_path, mode='w') self.zip.close() def add_page(self, page_id: int, page_src: str) -> None: """ **Add page source.** Arguments: page_id: unique identification of the page page_src: html source of the page The page source will be added with the name '`<page_id>`.html'. An exception is raised when a page with `page_id` is already available in the zip-file. """ page_name = f'{page_id}.html' with zipfile.ZipFile(self.path, mode='a', compression=zipfile.ZIP_DEFLATED, compresslevel=9) as zf: if page_name not in zf.namelist(): zf.writestr(page_name, page_src) else: raise ValueError(f'{page_name} already stored in {self.path}') def get_page(self, page_id: int) -> str: """ **Retrieve page source.** Arguments: page_id: unique identification of the page An exception is raised when the `page_id` is not available in the zip-file. """ page_name = f'{page_id}.html' with zipfile.ZipFile(self.path) as zf: if page_name in zf.namelist(): return zf.read(page_name).decode() else: raise ValueError(f'{page_name} not available in {self.path}') def iter_pages(self) -> Iterable[tuple[int, str]]: """ **Retrieve all page sources.** Returns: successive (page_id, page source) tuples for all pages """ with zipfile.ZipFile(self.path) as zf: for page_name in zf.namelist(): yield int(page_name[:-5]), zf.read(page_name).decode() def page_ids(self) -> list[int]: """ **Return all page_id's.** Returns: list with page_id's of all page sources """ with zipfile.ZipFile(self.path) as zf: return [int(n.split('.')[0]) for n in zf.namelist()]
Encapsulation class to store and retrieve scraped page sources.
Instance methods:
add_page: add page sourceget_page: retrieve page sourceiter_pages: iterable to retrieve all page sourcespage_ids: return all page_id's
Instance attributes:
path: path of the zip-file
View Source
def __init__(self, ts: str): """ **Instantiate reference to a page sources zip-file.** Arguments: ts: timestamp of a scrape [yymmdd-hhmm] The zip-file is specific for the scrape with the specified timestamp. In case the zip-file does not exist, an empty one is created with the name '`<ts>`.zip'. """ src_dir = _own_home / scrape_conf.rel_mst_dir / scrape_conf.src_dir_name zip_path = src_dir / (ts + '.zip') self.path = zip_path if not zip_path.exists(): self.zip = zipfile.ZipFile(zip_path, mode='w') self.zip.close()
Instantiate reference to a page sources zip-file.
Arguments:
ts: timestamp of a scrape [yymmdd-hhmm]
The zip-file is specific for the scrape with the specified timestamp.
In case the zip-file does not exist, an empty one is created with the
name '<ts>.zip'.
View Source
def add_page(self, page_id: int, page_src: str) -> None: """ **Add page source.** Arguments: page_id: unique identification of the page page_src: html source of the page The page source will be added with the name '`<page_id>`.html'. An exception is raised when a page with `page_id` is already available in the zip-file. """ page_name = f'{page_id}.html' with zipfile.ZipFile(self.path, mode='a', compression=zipfile.ZIP_DEFLATED, compresslevel=9) as zf: if page_name not in zf.namelist(): zf.writestr(page_name, page_src) else: raise ValueError(f'{page_name} already stored in {self.path}')
Add page source.
Arguments:
page_id: unique identification of the page
page_src: html source of the page
The page source will be added with the name '<page_id>.html'.
An exception is raised when a page with page_id is already
available in the zip-file.
View Source
def get_page(self, page_id: int) -> str: """ **Retrieve page source.** Arguments: page_id: unique identification of the page An exception is raised when the `page_id` is not available in the zip-file. """ page_name = f'{page_id}.html' with zipfile.ZipFile(self.path) as zf: if page_name in zf.namelist(): return zf.read(page_name).decode() else: raise ValueError(f'{page_name} not available in {self.path}')
Retrieve page source.
Arguments:
page_id: unique identification of the page
An exception is raised when the page_id is not available in the
zip-file.
View Source
def iter_pages(self) -> Iterable[tuple[int, str]]: """ **Retrieve all page sources.** Returns: successive (page_id, page source) tuples for all pages """ with zipfile.ZipFile(self.path) as zf: for page_name in zf.namelist(): yield int(page_name[:-5]), zf.read(page_name).decode()
Retrieve all page sources.
Returns:
successive (page_id, page source) tuples for all pages
View Source
def page_ids(self) -> list[int]: """ **Return all page_id's.** Returns: list with page_id's of all page sources """ with zipfile.ZipFile(self.path) as zf: return [int(n.split('.')[0]) for n in zf.namelist()]
Return all page_id's.
Returns:
list with page_id's of all page sources
Configuration instance for general use, read from the [MAIN] section of the configuration file.
Configuration instance for scraping, read from the [MAIN] and [SCRAPE] section of the configuration file.
Configuration instance for requesting metrics, read from the [MAIN] and [MATOMO] section of the configuration file.
Configuration instance for reporting, read from the [MAIN] and [REPORT] section of the configuration file.
Master directory with the complete scrapes storage structure.
Connection to the scrapes and metrics databases, via which all database operations are executed. Remains open while using any of the modules and is used when specific scrape data is needed.