Step pipeline unit type

Another type of unit used in Thoth’s adviser is called “step”. You can see step as a step performed by resolver to obtain fully pinned down software stack - a package in a specific version is added to the resolver’s internal state (see Introduction to Thoth’s adviser principles for theoretical background). Each step adds one package to the resolver’s state - if there are no more packages to add, a so called final state then represents a fully pinned down software stack (as can be seen in Pipfile.lock).

Warning

The logic behind resolver manipulates with states. Step pipeline unit implementation must NOT adjust state attributes except for the stack information. Adjusting beam is also not allowed. If a step implementation adjusts state or beam, the behaviour is undefined.

The pipeline step is triggered after boot, pseudonym and sieve pipeline unit types and is used to score and decide whether the given package can be added the the resolver’s internal state. In contrast to sieves, a step has a full notion of package-versions present in not fully resolved software stack (resolver’s internal state) so steps can judge whether the given package should be added to the state based on packages already present (see Real world examples section bellow for examples).

Note

Step pipeline units can be called even though a package that is about to be added to a state is already present in the state. This can happen if there are multiple packages that introduce such dependency. An example can be a pipeline step run when adding tensorflow to a state based on requirement keras, but tensorflow is already present in the state as it was introduced by Seldon dependency (another example can be package six that can be introduced by many Python packages in the software stack).

Note this behaviour is turned off by default. If the pipeline step requires such call, the step implementation should set step_instance.configuration["multi_package_resolution"] to True in derived classes implementing step logic. This is usually accomplished using the default configuration (if the unit should not behave differently based on the should_include logic). The default option can be set using Step.CONFIGURATION_DEFAULT["multi_package_resolution"] = True

Main usage

  • Decide whether the given package should be added to the resolver state

    • Raising exception NotAcceptable will prevent from adding package-version to the state in resolver

  • Score positively or negatively presence of a package in the software stack

    • Each pipeline step can return a tuple formed out of float and a list of dictionaries

      • float represents score adjustment of the state

      • the list of dictionaries carries “justification” on why the given package-version was scored the way it was scored - this justification is shown to the user

  • Prematurely end resolution based on the step reached

    • Raising exception EagerStopPipeline will cause stopping the whole resolver run and causing resolver to return products computed so far

  • Removing a library from a stack even though it is stated as a dependency (directly or transitively) by raising SkipPackage based on the resolution process.

Real world examples

  • Some releases of tensorflow do not work with some numpy versions - a numpy in a specific version can be added to a software stack that has tensorflow incompatible with the given numpy release (even though the version range specification allows it, tensorflow maintainers did not tested the given numpy release with issued tensorflow release)

    • A step implementing this observation can simply raise NotAcceptable exception that will prevent from such issues in the resolved software stack as these two will never be resolved together

  • Packages that have security vulnerabilities (CVE) can be penalized during the resolution so that they do not occur in the resolved software stack, unless there is no better candidate based on scoring in other pipeline steps

  • Prevent adding scipy to a TensorFlow>2.1<=2.3 unless introduced explicitly in the stack. It is not needed (it was introduced accidentally).

Triggering unit for a specific package

To help with scaling the recommendation engine when it comes to number of pipeline units possibly registered, it is a good practice to state to which package the given unit corresponds. To run the pipeline unit for a specific package, this fact should be reflected in the pipeline unit configuration by stating package_name configuration option. An example can be a pipeline unit specific for TensorFlow packages, which should state package_name: "tensorflow" in the pipeline configuration.

If the pipeline unit is generic for any package, the package_name configuration has to default to None.

An example implementation

from typing import Any
from typing import Dict
from typing import List
from typing import Optional
from typing import Tuple

from thoth.adviser.exceptions import NotAcceptable
from thoth.adviser import State
from thoth.adviser import Step
from thoth.python import PackageVersion


class StepExample(Step):
    """Filter out numpy causing issues in upstream TensorFlow==1.9.0."""

    # This pipeline unit is specific for "numpy".
    CONFIGURATION_DEFAULT: Dict[str, Any] = {"package_name": "numpy", "multi_package_resolution": False}

    def run(self, state: State, package_version: PackageVersion) -> Optional[Tuple[Optional[float], Optional[List[Dict[str, str]]]]]:
        """The main entry-point for step implementation demonstration."""
        if state.resolved_dependencies.get("tensorflow") != ("tensorflow", "1.9.0", "https://pypi.org/simple"):
            # Accept any other state change.
            return None

        package_version_tuple = package_version.to_tuple()
        if package_version_tuple == ("numpy", "1.17.0", "https://pypi.org/simple"):
            raise NotAcceptable(
                f"Package {package_version_tuple!r} has known issues with upstream tensorflow in version 1.9.0 due to API incompatibility"
            )

The implementation can also provide other methods, such as Unit.pre_run, Unit.post_run or Unit.post_run_report and pipeline unit configuration adjustment. See unit documentation for more info.