A pipeline unit

Note

This documentation section discusses about Python unit interface used in resolver. You can use higher level abstraction in a form of prescriptions in most cases.

All units are derived from Unit that provides a common base for implemented units of any type. The base class also provides access to the input pipeline vectors and other properties that are accessible by context abstraction. See pipeline section as a prerequisite for pipeline unit documentation.

Note the instantiation of units is done once during pipeline creation - units are kept instantiated during stack generation pipeline run.

Registering a pipeline unit to pipeline

If the pipeline configuration is not explicitly supplied, Thoth’s adviser dynamically creates pipeline configuration (that is always the case in a deployment). This creation is done in a loop where each pipeline unit class of a type (boot, pseudonym, sieve, step, stride and wrap) is asked for inclusion into the pipeline configuration - each pipeline unit implementation is responsible for providing logic that states when the given pipeline unit should be registered into the pipeline.

Pipeline builder building the pipeline configuration.

This logic is implemented as part of Unit.should_include class method:

from typing import Any
from typing import Dict
from typing import Generator

from thoth.adviser import PipelineBuilderContext

@classmethod
def should_include(cls, builder_context: "PipelineBuilderContext") -> Generator[Dict[str, Any]]:
    """Check if the given pipeline unit should be included into pipeline."""
    if builder_context.is_adviser_pipeline() and not builder_context.is_included(cls):
        yield {"configuration1": 0.33}
        return None

    yield from ()
    return None

The Unit.should_include class method returns a generator which yields configuration for each pipeline unit that should be registered into the pipeline configuration. The configuration is a dictionary stating pipeline configuration that should be applied to pipeline unit instance (an empty dictionary if no configuration changes are applied to the default pipeline configuration but the pipeline unit should be included in the pipeline configuration). The default configuration is provided by pipeline in a dictionary available as a class attribute in Unit.CONFIGURATION_DEFAULT. See unit configuration section.

When prescription pipeline units are called, directive should_include maps to the should_include class method discussed above.

The pipeline configuration creation is done in multiple rounds so PipelineBuilderContext, besides other properties and routines, also provides PipelineBuilderContext.is_included method that checks if the given unit type is already present in the pipeline configuration. As you can see, pipeline unit can become part of the pipeline configuration multiple times based on requirements. See PipelineBuilderContext for more information.

Unit configuration

Each unit can have instance specific configuration. The default configuration can be supplied using Unit.CONFIGURATION_DEFAULT class property in the derived pipeline configuration type. Optionally, a schema of configuration can be defined by providing Unit.CONFIGURATION_SCHEMA in the derived pipeline configuration type - this schema is used to verify unit configuration correctness on unit instantiation.

Note units provide “package_name” configuration in the unit configuration to state on which package they operate on (this option can be mandatory for some of the units, such as pseudonyms). This configuration is used in resolver internally to optimize calls to pipeline units. A None value lets pipeline units work on any package. See unit specific documentation for more info.

Pipeline unit configuration is then accessible via Unit.configuration property on a unit instance which returns a dictionary with configuration - the default updated with the one returned by Unit.should_include class method on the pipeline unit registration.

Debugging a unit run in cluster

Adviser constructs the resolution pipeline dynamically on each request and runs units during the resolution. If you wish to see if a unit was registered to the resolution pipeline and run, you can run the adviser in debug mode by providing --debug flag to thamos advise command. This will cause that the adviser will run in a much more verbose mode and will report pipeline configuration and all the actions that are done during the resolution.

Note that running adviser in a debug mode adds additional overhead to the recommendation engine and slows it down. Results computed for two identical requests where one is run in a debug mode might (and most often will) differ as resolver will not be able to explore the state space given the time constraints in the recommendation engine. Nevertheless, the debug mode gives additional hints on pipeline configuration construction and actions done that might be helpful in many cases.

If you wish to avoid the overhead issue described, it might be a good idea to experiment with requirements (and possibly constraints as well) to narrow down to the issue one wants to debug. An example can be a failure when adviser was not able to find a resolution that would satisfy requirements. In such a case, it might be good to generate a lock file with expected pinned set of packages using other tools (e.g. Pipenv, pip-tools) and submit the lock file to the recommender system. The logs produced during the resolution and stack level justifications might give hints why the given resolution was rejected.

Additional pipeline unit methods

All pipeline unit types can implement the following methods that are triggered in the described events:

  • Unit.pre_run - called before running any pipeline unit with context already assigned

  • Unit.post_run - called after the resolution is finished

  • Unit.post_run_report - post-run method run after the resolving has finished - this method is called only if resolving with a report

Note the “post-run” methods are called in a reverse order to pre_run. The very first pipeline unit on which the pre-run method was called will be notified as last after the pipeline finishes in its respective post-run method implementation.

Pipeline unit module implementation placement

To enable scaling adviser to cover specific nuances and to keep adviser implementation clean, follow already created structure for pipeline units.

If a pipeline unit is pecific to a package, place it to a module named after this package. An example can be a tf_21_urllib3 module implementing thoth.adviser.steps.tensorflow.tf_21_urllib3.TensorFlow21Urllib3Step step. As this unit is a type of “step”, it is placed in thoth.adviser.steps, subsequently thoth.adviser.steps.tensorflow states this step is specific to TensorFlow package.

All pipeline units specific to Python interpreter should go to python module under the respective pipeline unit type module (e.g. thoth.adviser.wraps.python for Python interpreter specific wraps).

Any other modules that are generic enough should be placed inside the top-level module for the pipeline unit (e.g. inside thoth.adviser.sieves for a sieve specific units not specific to any Python interpreter or any Python package).

An exception are also units used for debugging that should go to _debug module of the respective pipeline unit type module.

Afterword for pipeline units

All units can raise thoth.adviser.exceptions.EagerStopPipeline to immediately terminate resolving and causing the resolver to report back all the products computed so far.

Pipeline units of type Sieve and Step can also raise NotAcceptable, see sieves and steps sections for more info.

Pipeline units of type sieve and step can also raise SkipPackage to exclude the given package from an application stack completely. See sieves and steps section for more info.

Pipeline units of type steps can raise NotAcceptable signalizing the given step is not acceptable (corresponds to “not-acceptable” action taken in the Markov Decision Process).

Raising any other exception in pipeline units causes undefined behavior.

All pipeline units should be atomic pieces and they should do one thing and do it well. They were designed to be small pieces forming complex resolution system.

Unit placement in a pipeline

The pipeline configuration (which pipeline units in what configuration) is determined dynamically on each adviser start. This enables construction of the pipeline depending on an input vector (e.g. packages used, Python indexes configured, library usage, recommendation type and such). Each pipeline unit requests to be registered to the pipeline configuration until the pipeline configuration has been changed, indicating that the unit has been registered. This loop respects __all__ listing of the respective thoth.adviser.boots, thoth.adviser.pseudonyms, thoth.adviser.sieves, thoth.adviser.strides, thoth.adviser.steps and thoth.adviser.wraps module.

It’s good to note how pipeline units should be listed in __all__:

  1. If a pipeline unit Foo depends on another pipeline unit, say Bar, the pipeline unit Foo should be stated before Bar in the __all__ listing.

  2. It’s a good practice to place pipeline units that remove/filter packages from an application stack sooner than pipeline units that perform other tasks (e.g. scoring, adding package information, …). As packages are filtered, the code of other units is performed less time making the pipeline run more optimal.

  3. If a pipeline unit Foo is less expensive than another pipeline unit, say Bar, the pipeline unit Foo should be stated before Bar in the __all__ listing.

An example of a pipeline unit that is considered expensive is a pipeline unit that performs a knowledge graph query

Which pipeline unit type should be chosen?

Sometimes it might be tricky to select the right pipeline unit. Multiple unit types were designed to provide a framework for resolver to easily write units. These units have different overhead and are designed for specific use cases. It’s crucial to select the right pipeline unit for the right use case to keep the pipeline performing well.

The most expensive pipeline units are steps. They are run each time a package is about to be added to resolver’s internal state. As it is the most expensive one, it also provides the most information for a pipeline unit developer - which package in which specific version is about to be added to a partially resolved state and what the resolver state looks like. These units are the only ones that can affect the final unit score. Make sure these units provide a package to which they correspond if they are specific to packages (the package_name configuration) - this enables optimization which performs the unit call only if the given unit should be called.

The second most expensive pipeline units are sieves. They do not provide access to resolver’s internal state, but are called each time there are packages in specific versions considered for further resolution. As the name suggests, these units filter out packages that should not occur in the final software stack. These units, unlike steps, do not provide access to resolver’s internal state (states are created out of the packages that were not filtered by sieves).

The third most expensive units are pseudonyms. They can provide “pseudonyms” - alternative packages published under different name or alternative versions that can be used (or both assumptions).

The fourth most expensive pipeline units are strides. They are called on each fully resolved state that eventually form the recommended software stack (hence become final states).

The most cheapest pipeline units are boots and wraps. Boot pipeline unit types were designed to prepare resolver, the input vector coming to the resolver or pipeline units. Wrap pipeline unit types make final changes to final states that are not relevant to the state score, packages resolved in the final state or resolver input vector.

Refer to sections specific to pipeline unit types for examples and more information.

Unroll pipeline units

To keep the resolver performing well, try to always unroll all the operations that do not need to be included in the actual pipeline unit run method and put these operations to pre or post run methods. In that case, pipeline units can configure/prepare for a resolver run in advance, keeping the initialization part out of the actual pipeline run. Note the run method of a pipeline unit can be called thousands times in a single resolver run so optimizing these pieces matter a lot.