Sieve pipeline unit type

Note

💊 Check sieve prescription pipeline unit for a higher-level abstraction.

The next pipeline unit type triggered after pseudonym type pipeline units is called “sieve”. The main purpose of this pipeline unit is to filter out (hence “sieve”) packages that should not occur in the resulting stack. It’s called on each and every package that is resolved based on direct or transitive dependencies of the application stack supplied.

The pipeline unit of type sieve accepts a generator of resolved package-versions (see PackageVersion abstraction in thoth-python library) and decides which of these package versions can be included in the resulting stack. The generator of package-versions supplied is sorted based on Python’s version specification starting from the latest release down to the oldest one (respecting version string, not release date). The list will be shrinked based on limit_latest_versions (if supplied to the adviser) after pipeline sieve runs - this option reduces the state space considered. If sieves accept more package versions than limit_latest_versions package versions they will be reduced to limit_latest_versions size. Note the issues that can arise by providing “limit latest versions” parameter, usually this parameter is not needed.

It’s guaranteed that the list will contain package-versions in a specific (locked) version with information about the Python package index from where the given dependency came from (tripled “package name”, “locked package version” and “index url” uniquely identify any Python package, see compatibility section for additional info on Python package index specific resolution). It’s also guaranteed that the generator will contain packages of a same type (same package name).

Note

Each sieve can be run multiple times during the resolution. It can be run multiple times even on packages of a same type based on dependency graph resolution. An example can be package six that is a dependency of many packages in the Python ecosystem and each package can have different version range requirements on package six.

Main usage

  • Filter out packages, package-versions respectively, which should not occur in the resulting software stack

    • Returning an empty list discards all the resolved versions

    • Raising exception NotAcceptable has same effect as returning an empty list (compatibility with step pipeline unit)

  • Prematurely end resolution based on the sieve reached

    • Raising exception EagerStopPipeline will cause stopping the whole resolver run and causing resolver to return products computed so far

  • Removing a library from a stack even though it is stated as a dependency (directly or transitively) by raising SkipPackage

Note

Even if pipeline sieves discard all the versions for a certain package, the resolution can be still successful. An example can be discarding dependency tensorboard from a TensorFlow stack. Dependency tensorboard is present as a dependency only in some releases of tensorflow package.

Real world examples

  • Filter out packages like enum34 from the resolved software stack that will not install into the given software environment (enum34 is a backport of Enum to older Python releases so it will not be installed for Python3.4+, if environment markers are present and applied)

  • Filtering packages that have installation issues into the requested software environment - an example can be legacy Python2 packages that fail installation in Python3 environments due to syntax errors in setup.py

  • Filtering packages that have runtime issues (a package installs but fails during application start - e.g. bad release)

  • Filter out Python packages that use Python package index that is not allowed (restricted environments)

  • Filter out packages that require native packages or ABI provided by a native package that are not present in the software environment used (see Thoth’s analyses of container images that are aggregated into Thoth’s knowledge base and available for Thoth’s adviser)

  • Filter out packages that are nightly builds or pre-releases in case of STABLE recommendation type or disabled pre-releases configuration option in Pipfile

  • A library maintainer added enum34 package as a library dependency but did not restrict requirements to Python version with an environment marker:

    enum34>=1.0; python_version < '3.4'
    

    The resolver can skip this package based on a pipeline sieve specific to the library which would raise SkipPackage exception if the enum34 would be used with newer Python version.

Triggering unit for a specific package

To help with scaling the recommendation engine when it comes to number of pipeline units possibly registered, it is a good practice to state to which package the given unit corresponds. To run the pipeline unit for a specific package, this fact should be reflected in the pipeline unit configuration by stating package_name configuration option. An example can be a pipeline unit specific for TensorFlow packages, which should state package_name: "tensorflow" in the pipeline configuration.

If the pipeline unit is generic for any package, the package_name configuration has to default to None.

An example implementation

from typing import Any
from typing import Dict
from typing import Generator
from thoth.python import PackageVersion

from thoth.adviser import Sieve

class ExampleSieve(Sieve):
    """An example sieve implementation to demonstrate sieve purpose."""

    CONFIGURATION_DEFAULT: Dict[str, Any] = {"package_name": None}  # The pipeline unit is not specific to any package.

    def run(self, package_versions: Generator[PackageVersion, None, None]) -> Generator[PackageVersion, None, None]:
        for package_version in package_versions:
          if self.context.project.prereleases_allowed:
              _LOGGER.info(
                  "Project accepts pre-releases, skipping cutting pre-releases step"
              )
              yield package_version

          if package_version.semantic_version.is_prerelease:
              _LOGGER.debug(
                  "Removing package %s - pre-releases are disabled",
                  package_version.to_tuple(),
              )
              continue

          yield package_version

The implementation can also provide other methods, such as Unit.pre_run, Unit.post_run or Unit.post_run_report and pipeline unit configuration adjustment. See unit documentation for more info.