Sieve pipeline unit type¶
Note
💊 Check sieve prescription pipeline unit for a higher-level abstraction.
The next pipeline unit type triggered after pseudonym type pipeline
units is called “sieve
”. The
main purpose of this pipeline unit is to filter out (hence “sieve”) packages
that should not occur in the resulting stack. It’s called on each and every
package that is resolved based on direct or transitive dependencies of the
application stack supplied.
The pipeline unit of type sieve
accepts a
generator of resolved package-versions (see PackageVersion
abstraction in
thoth-python
library) and decides which of these package versions can be
included in the resulting stack. The generator of package-versions supplied is
sorted based on Python’s version specification starting from the latest release
down to the oldest one (respecting version string, not release date). The list
will be shrinked based on limit_latest_versions
(if supplied to the
adviser) after pipeline sieve runs - this option reduces the state space
considered. If sieves accept more package versions than
limit_latest_versions
package versions they will be reduced to
limit_latest_versions
size. Note the issues that can arise by
providing “limit latest versions” parameter, usually this
parameter is not needed.
It’s guaranteed that the list will contain package-versions in a specific (locked) version with information about the Python package index from where the given dependency came from (tripled “package name”, “locked package version” and “index url” uniquely identify any Python package, see compatibility section for additional info on Python package index specific resolution). It’s also guaranteed that the generator will contain packages of a same type (same package name).
Note
Each sieve can be run multiple times during the resolution. It can be run
multiple times even on packages of a same type based on dependency graph
resolution. An example can be package six
that is a dependency of many
packages in the Python ecosystem and each package can have different version
range requirements on package six
.
Main usage¶
Filter out packages, package-versions respectively, which should not occur in the resulting software stack
Returning an empty list discards all the resolved versions
Raising exception
NotAcceptable
has same effect as returning an empty list (compatibility with step pipeline unit)
Prematurely end resolution based on the sieve reached
Raising exception
EagerStopPipeline
will cause stopping the whole resolver run and causing resolver to return products computed so far
Removing a library from a stack even though it is stated as a dependency (directly or transitively) by raising
SkipPackage
Note
Even if pipeline sieves discard all the versions for a certain package, the
resolution can be still successful. An example can be discarding dependency
tensorboard
from a TensorFlow stack. Dependency tensorboard
is
present as a dependency only in some releases of tensorflow
package.
Real world examples¶
Filter out packages like enum34 from the resolved software stack that will not install into the given software environment (enum34 is a backport of Enum to older Python releases so it will not be installed for Python3.4+, if environment markers are present and applied)
Filtering packages that have installation issues into the requested software environment - an example can be legacy Python2 packages that fail installation in Python3 environments due to syntax errors in
setup.py
Filtering packages that have runtime issues (a package installs but fails during application start - e.g. bad release)
Filter out Python packages that use Python package index that is not allowed (restricted environments)
Filter out packages that require native packages or ABI provided by a native package that are not present in the software environment used (see Thoth’s analyses of container images that are aggregated into Thoth’s knowledge base and available for Thoth’s adviser)
Filter out packages that are nightly builds or pre-releases in case of
STABLE
recommendation type or disabled pre-releases configuration option inPipfile
A library maintainer added enum34 package as a library dependency but did not restrict requirements to Python version with an environment marker:
enum34>=1.0; python_version < '3.4'
The resolver can skip this package based on a pipeline sieve specific to the library which would raise
SkipPackage
exception if theenum34
would be used with newer Python version.
Triggering unit for a specific package¶
To help with scaling the recommendation engine when it comes to number of
pipeline units possibly registered, it is a good practice to state to which
package the given unit corresponds. To run the pipeline unit for a specific
package, this fact should be reflected in the pipeline unit configuration by
stating package_name
configuration option. An example can be a pipeline
unit specific for TensorFlow packages, which should state package_name:
"tensorflow"
in the pipeline configuration.
If the pipeline unit is generic for any package, the package_name
configuration has to default to None
.
Justifications in the recommended software stacks¶
An example implementation¶
from typing import Any
from typing import Dict
from typing import Generator
from thoth.python import PackageVersion
from thoth.adviser import Sieve
class ExampleSieve(Sieve):
"""An example sieve implementation to demonstrate sieve purpose."""
CONFIGURATION_DEFAULT: Dict[str, Any] = {"package_name": None} # The pipeline unit is not specific to any package.
def run(self, package_versions: Generator[PackageVersion, None, None]) -> Generator[PackageVersion, None, None]:
for package_version in package_versions:
if self.context.project.prereleases_allowed:
_LOGGER.info(
"Project accepts pre-releases, skipping cutting pre-releases step"
)
yield package_version
if package_version.semantic_version.is_prerelease:
_LOGGER.debug(
"Removing package %s - pre-releases are disabled",
package_version.to_tuple(),
)
continue
yield package_version
The implementation can also provide other methods, such as Unit.pre_run
, Unit.post_run
or Unit.post_run_report
and pipeline unit configuration adjustment.
See unit documentation for more info.