.. _debusine-concepts:

=================
Debusine concepts
=================

Debusine has been designed to run a network of generic “workers” that can
perform various “tasks” producing “artifacts”. Interesting artifacts that
we want to keep in the long term are stored in “collections”.

While tasks can be scheduled as individual “work requests”, the power of
Debusine lies in its ability to combine multiple (different) tasks in
“workflows”, where each workflow has its own logic to orchestrate multiple
work requests across the available workers.

A Debusine instance can be multi-tenant, divided into “scopes” of users
and groups. These contain “workspaces” that have their own sets of
artifacts and collections. Workspaces can inherit from one another in
order to share collections and artifacts when required.

.. _explanation-artifacts:

Artifacts
=========

Artifacts are at the heart of Debusine. They combine:

* a set of files
* key-value data (stored as a JSON-encoded dictionary)
* a category

The category is just a string identifier used to recognize artifacts sharing
the same structure. You can create and use categories as you see fit but we
have defined a basic :ref:`ontology <artifact-reference>` suited for the
case of a Debian-based distribution.

Artifacts are both inputs (submitted by users) and outputs (generated by
tasks). They are created and stored in a workspace and can have an
expiration delay controlling their lifetime. Artifacts are (mostly)
immutable, they should never be modified after creation.

Files in artifacts are content-addressed (stored by hash) in the
database, so a single file can be referenced in multiple places without
unnecessary data duplication.

Files in artifacts have names that may include directories.

Artifacts can have relations with other artifacts, see :ref:`artifact
relationships <artifact-relationships>`.

.. _explanation-assets:

Assets
======

Assets are holders of data, like Artifacts. They don't store any files,
but like assets, they have:

* a category
* key-value data (stored as a JSON-encoded dictionary)

Assets are used to manage signing keys.  They have strong permissions
associated with them, that may link them to workspaces.

.. _explanation-collections:

Collections
===========

A collection is an abstraction used to manage and store a coherent set of
"collection items". Each collection has a ``category`` field describing
its intended use case, the allowed collection items, the associated
metadata, etc. See the :ref:`reference <collection-reference>`.

A collection is meant to represent things like this:

* A suite in the Debian archive (e.g. "Debian bookworm"): the
  ``debian:suite`` collection is a collection of ``debian:source-package``
  and ``debian:binary-package`` artifacts.
* A Debian archive (a.k.a. repository) that contains multiple suites:
  the ``debian:archive`` collection is a collection of ``debian:suite``
  collections
* Build chroots for all Debian suites: the
  ``debian:environments`` collection stores ``debian:system-tarballs``
  artifacts for multiple Debian suites
* The results of a lintian analysis or autopkgtests runs across all the
  packages in a target suite
* Extracted ``.desktop`` files for each package name in a suite

To cover for those various cases, each collection item consists of some
arbitrary metadata and can optionally link to an artifact or to a
collection. Hence we define 3 kinds of collection items:

* artifact-based items: they link an artifact with some metadata
* collection-based items: they link a collection with some metadata
* bare-data items: they only store some metadata

Each collection item has its own "category" that defines the nature of the
item. For artifact-based items and collection-based items, it duplicates
the category of the linked artifact or collection. For bare-data items, it
indirectly defines the structure to expect in the metadata.

A collection item also has a unique name within the collection so that
the collection can be seen like a big Python dictionary mapping names
to artifacts, collections and arbitrary data.

Collections can be uniquely identified within a workspace by category and
name, and can provide useful starting points for further lookups within
collections.

To learn more about collections, you can read more details about their
:ref:`data model <collection-data-models>`.

.. _explanation-tasks:

Tasks
=====

Tasks are time-consuming operations that are typically offloaded to
dedicated workers.

Debusine contains a library of tasks to perform various operations that
are useful when you contribute to Debian or one of its derivatives ("build
a package", "run lintian", "upload a package", etc.).

The behaviour of each task can be controlled/customized with some input
parameters. The combination of a task and actual input parameters
constitutes a :ref:`work request <explanation-work-requests>` that can be
scheduled to run.

There are :ref:`six types of tasks <reference-task-types>` but the most
interesting ones are the ``Worker``, ``Server`` and ``Signing`` tasks.

Worker tasks
~~~~~~~~~~~~

Worker tasks run on external workers, often within some controlled
:ref:`execution environments <reference-execution-environment>`. They can
only interact with Debusine through the public API. Hence they will
typically only consume and produce artifacts, and create relationships
between them.

Worker tasks can require specific features from the workers on which they
will run. This is used to ensure that the assigned worker will have all
the required resources for the task to succeed.

Signing tasks
~~~~~~~~~~~~~

Signing tasks are very much like worker tasks, except that they have
access to a local database containing sensitive cryptographic material
(i.e. private keys) that needs to be stored in a secure manner and whose
access should be tightly controlled.

Server tasks
~~~~~~~~~~~~

Server tasks perform operations that require direct database access
and that may take some time to run. They run on Celery workers, and must
not execute any user-controlled code.

.. _explanation-work-requests:

Work requests
=============

Work requests are the way Debusine schedules tasks to workers and monitors
their progress and success. Basically it ties together a task (that is
some code to execute on a worker) together with its parameters (values used
to customize the behaviour of the task).

.. note::

   There are different :ref:`types of tasks <explanation-tasks>`, but they
   all share the same work request structure for the purpose of being
   scheduled. This includes workflows, thus much of what is said about
   work requests also apply to the concept of workflows even if we present
   workflows separately from tasks due to their special role in Debusine.

Worker tasks and workflows are the two types of tasks that can be
scheduled individually by Debusine users. All the other types of tasks are
restricted and can only be started indirectly through one of the workflows
that is available in the workspace.

A work request is tied to a workspace. This defines what the task has
access too and where its output will be stored.  The :ref:`artifacts
<explanation-artifacts>` generated as output by the task are linked to the
work request and can be easily reused.

To learn more about work requests, you can read:

* :ref:`work-request-scheduling` for more explanations about how work
  requests are scheduled.
* :ref:`work-requests` for more information about the data model and all
  the special cases.

.. _explanation-workflows:

Workflows
=========

Workflows are advanced server-side logic that can schedule and combine
server tasks and worker tasks: outputs of some work requests can become
the input of other work requests, and the flow of execution can be
influenced by the results of already executed work requests.

Workflows are powerful operations in particular due to their ability
to run server tasks. Until finer grained access control is implemented,
users can only start the subset of workflows that have been made available
by the workspace administrator (by creating *workflow templates*). This
process:

* grants a unique name to the workflow so that it can be easily identified
  and started by users
* defines all the input parameters that cannot be overridden when a user
  starts the workflow

Those workflow templates can then be turned into actual running workflows
by users or external events, through the web
interface or through the API.

The input parameters that are not set in the workflow template are
called run-time parameters and they have to be provided by the user
that starts the workflow. Those parameters are stored in a WorkRequest
model with task_type ``workflow`` that will be used as the root of a
WorkRequest hierarchy covering the whole duration of the process
controlled by the workflow.

Once completed, the remaining lifetime of the workflow instances is
controlled by their expiration date and the expiration of some associated
artifacts.

To begin with, available workflows will be limited to those that
are fully implemented in Debusine. In the future, we expect to add
a more flexible approach where administrators can submit a fully
customized logic combining various building blocks.

Here are some examples of possible workflows:

 * Package build: it would take a source package and a target distribution
   as input parameters, and the workflow would automate the following
   steps:
   { sbuild on all architectures supported in the target distribution }
   → add source and binary packages to target distribution.

   See :ref:`sbuild workflow <workflow-sbuild>`.

 * Package review: it would take a source package and associated binary
   packages and a target distribution, and the workflow would control
   the following steps:
   { generating debdiff between source packages, lintian, autopkgtest,
   autopkgtests of reverse-dependencies } → manual validation by reviewer
   → add source and binary packages to target distribution.

 * Both build and review could be combined in a larger workflow.

   In that case, the reverse-dependencies whose autopkgtests should be run
   cannot be identified until the sbuild task has completed, so the
   workflow would be expanded/reconfigured after that step completed.

 * Update a collection of lintian analyses of the latest packages in a
   given distribution based on the changes of the collection
   representing that distribution.

   Here again the set of lintian analyses to run depends on a :ref:`first
   step of comparison between the two collections <collection-derived>`.

See :ref:`workflow-orchestration` for more on how they work, and
:ref:`Workflows <workflow-reference>` for a list of available workflows.

.. _explanation-file-stores:

File stores
===========

Files in artifacts are stored in file stores.  These are content-addressed:
a file with a given SHA-256 digest is only stored once in any given store,
and may be retrieved by that digest.  When a new artifact is created, its
files are uploaded to stores as needed.  Some of the files may already be
present.  In that case, if the file is already part of the artifact's
workspace, then it does not need to be reuploaded; but otherwise, it must be
reuploaded to avoid users obtaining unauthorized access to existing file
contents.

Local storage is useful as the initial destination for uploads to Debusine,
but it has to be backed up manually and might not scale to sufficiently
large volumes of data.  Remote storage such as S3 is also available.  It is
possible to serve a file from any store, with policies for which one to
prefer for downloads and uploads.

Administrators can set policies for which file stores to use at the
:ref:`scope <explanation-scopes>` level, as well as policies for populating
and draining stores of files.  Most bulk movement is handled by a periodic
job.

To learn more about file stores, see their :ref:`reference
<file-store-reference>`.

.. _explanation-scopes:

Scopes
======

Scopes are the foundational concept used to implement multi-tenancy in
Debusine. They are an administrative grouping of users, groups and
workspaces. They appear as the initial segment in the URL path of most web
views.

Groups and workspaces can only exist in a single scope. Users are global
and might be part of multiple scopes.

Since artifacts have to be stored somewhere, scopes also define the set of
:ref:`file stores <explanation-file-stores>` where files can be stored.

.. _explanation-workspaces:

Workspaces
==========

A workspace is an administrative concept hosting artifacts and
collections. Users can get different levels of access to those artifacts
and collections by being granted different roles on the workspace.

Workspaces have the following important properties:

* public: a boolean which indicates whether the artifacts are publicly
  accessible or if they are restricted to the users belonging to the
  workspace
* default_expiration_delay: the minimal duration that a new
  artifact is kept in the workspace before being expired. See
  :ref:`expiration-of-data`.

.. _explanation-workers:

Workers
=======

Workers are services that run :ref:`tasks <explanation-tasks>` on behalf of
a Debusine server.  There are three types of worker.

External workers
~~~~~~~~~~~~~~~~

Most workers are external workers, running an instance of
``debusine-worker``.  This is a daemon that runs untrusted tasks using some
form of containerization or virtualization.  It has no direct access to the
Debusine database; instead, it interacts with the server using the HTTP API
and WebSockets.

External workers process one task at a time, and only process ``Worker``
tasks.

To support spikes in work requests, Debusine is able to use
:ref:`dynamic-worker-pools` to host external workers in clouds.
These are provisioned as required, and terminated when idle.

Celery workers
~~~~~~~~~~~~~~

A Debusine instance normally has an associated Celery worker, which is used
to run tasks that require direct access to the Debusine database.  These
tasks are necessarily trusted, so they must not involve running
user-controlled code.

Celery workers have a concurrency level, normally set to the number of
logical CPUs in the system (:py:func:`os.cpu_count`).

.. todo::

   Document (and possibly fix) what happens when workers are restarted while
   running a task.

Signing workers
~~~~~~~~~~~~~~~

Signing workers work in a similar way to external workers, but they have
access to private key material, either directly or via a hardware security
module.  They only process ``Signing`` tasks.
