Version: sdf-beta11

Deployment

Introduction

When you use the run command to execute a dataflow, it runs within the same process as the CLI. This is useful for development and testing because it's easy to start without needing to manage additional resources. It allows for quick testing and validation of the dataflow, and you can easily load and integrate development packages.

For production deployment, the deploy command is used to deploy the dataflow on a worker. All operations available in run also apply to deploy, with the following differences:

The dataflow is executed on the worker, not within the CLI process. The CLI communicates with the worker on the user's behalf.
The dataflow continues running even if the CLI is shut down. It will only terminate if the worker is stopped or shut down, or if the dataflow is explicitly stopped or deleted.
Dataflows in the worker only have access to published packages, unlike run mode, which allows access to local packages. If you need to use a package, you must publish it first.
Multiple dataflows can be deployed on the worker, with each dataflow isolated from the others. They do not share any state or memory but can communicate via Fluvio topics.

To use deployment mode, it's essential to understand what a worker is, and how to manage a dataflow inside a worker.

Workers

A worker is the deployment target for a dataflow and must be created and provisioned before deploying a dataflow. The worker can run anywhere as long as it can connect to the same Fluvio cluster.

There is no limit to the number of dataflows you can run on each worker, apart from CPU, memory, and disk constraints. For optimal performance, it is recommended to run a single worker per machine.

There are two types of workers: host and remote. A host worker is a simple worker designed for local deployment without requiring any additional infrastructure. It is not designed for robust production deployments. For typical production deployments, you will use remote workers. Remote workers are designed to run in the cloud, data center, or on edge devices. If you are using InfinyOn Cloud, the remote cloud worker is automatically provisioned and registered in your profile.

A worker "profile" is maintained for each Fluvio cluster. The worker profile maintains a list of uuids and human-readable names for each worker deployed on the cluster, as well as the currently selected worker. When you switch the Fluvio profile, the corresponding worker profile is used automatically. Together, the worker profile and Fluvio profile allow the SDF CLI to issue commands to the selected worker. Once a worker is selected, it will be used for all dataflow operations until you choose a different worker.

Host Workers

To create host worker, you can use the following command.

$> sdf worker create <name>

This will creates and register a new worker on your machine. It will run in the background until you shutdown the worker or machine is rebooted. The name can be anything.

Once you have created a worker, You can view the list of workers on your Fluvio cluster.

$> sdf worker create main
Worker `main` created for cluster: `local`
$> sdf worker list
    NAME  TYPE  CLUSTER  WORKER ID                             VERSION
 *  main  Host  local    7fd7eda3-2738-41ef-8edc-9f04e500b919  <your SDF version>

The * indicates the current selected worker.

SDF only supports running a single host worker for each machine since a single worker can support many dataflows. If you try to create another worker, you will get an error message.

$ sdf worker create main2
$ Starting worker: main2
There is already a host worker with pid 20686

Shutting down a worker will terminate all running dataflow and worker processes.

$> sdf worker shutdown main
Shutting down pid: 20688
Host worker: main has been shutdown

Even though the host worker is shutdown and removed from the profile, the dataflow files and state are still persisted. You can restart the worker and the dataflow will resume.

Host workers store the dataflow state in the local file system at ~/.sdf/local/worker/dataflows. If you have deleted your local fluvio cluster, the worker needs to be manually shutdown and created again. This limitation will be removed in a future release

Remote Workers

There are many ways to deploy a remote worker, you may use Kubernetes, Docker, Systemd, Terraform, Ansible, or any other tool that can manage the server process and ensure it can restart when server is rebooted.

If you are planning to use InfinyOn Cloud to run SDF, your first worker will be automatically provisioned and registered in your locally maintained SDF profile. Additional Infinyon Cloud workers can be provisioned by contacting support. If are planning to use Infinyon Cloud's hosted workers, you can safely skip to Managing Dataflows.

Overview

Below is the typical strategy for using a remote worker.

Provision the remote worker on a server.
Register the worker on your local machine with a human readable name. This will add it to your SDF profile so it can be controlled from the CLI.
Run your dataflows on the remote worker.
Unregister the worker when it is no longer needed. This doesn't shut down the worker but removes it from your SDF profile.

Provisioning

First, comfirm your Fluvio profile is pointing at your remote cluster.

$> fluvio profile switch <my-cluster>

Then provision the worker on the remote cluster.

$> sdf worker create <name>

Once the worker is created, you should be able to view it with sdf worker list.

$> sdf worker list
    NAME       TYPE    FLUVIO      WORKER ID                             VERSION
 *  my-worker  Remote  my-cluster  665b200d-21ab-46ac-8a8d-ad0e15d9b968  sdf-beta7

To view workers that have been provisoned by someone else, you can use sdf worker list --all.

Registration

Registering a worker will add it to the SDF profile maintained on your local machine, so you can control the worker from the SDF CLI.

To register the remote worker, you can use the register command.

$> sdf worker register my-worker 665b200d-21ab-46ac-8a8d-ad0e15d9b968
Worker `my-worker` is registered for cluster: `my-cluster`

Now that the worker is registered you can run your dataflows on the remote cluster. Deploying your first dataflow and more will be covered in the next section

When you have multiple workers, you can switch between them using the switch command.

$> sdf worker switch <name>

Clean up

When you are done with a worker, you can unregister it.

$> sdf worker unregister <name>

When a worker is unregistered, it will no longer be visible in you SDF profile but it will continue to run. To remove a remote worker you can use the shutdown command.

$> sdf worker shutdown <name>

Managing Dataflows

Deploying Dataflows to Workers

Once a worker is selected, you can deploy a dataflow defined in a dataflow.yaml file using the deploy command:

$> sdf deploy

The deploy command is similar to the run command. It deploys a dataflow and starts the REPL prompt. In deploy mode, the CLI sends requests to the worker. If no worker is selected, an error message will be displayed.

Error: No workers. run `sdf worker create` to create one.

When you are running a dataflow on a worker, it will indicate the name of the worker in the prompt:

$> sdf deploy
[jolly-pond] >> show state

Listing and Selecting Dataflows

To list all dataflows running in the worker, you can use the show dataflow command which shows the fully qualified name of each dataflow and its status.

$> sdf deploy
[jolly-pond]>> show dataflow 
    Dataflow                           Status           Last Updated 
    myorg/wordcount-simple@0.1.0       running          2 days ago
 *  myorg/user-job-map@0.1.0           running          10 minutes ago
[jolly-pond]>> 

Other commands like show state require an active dataflow. If there is no active dataflow, it will show an error message.

[jolly-pond]>> show state 
No dataflow selected.  Run `select dataflow`
[jolly-pond]>> 

To select a dataflow, you can use the dataflow select command with the fully qualified dataflow name.

[jolly-pond]>> select dataflow myorg/wordcount-simple@0.1.0
dataflow switched to: myorg/wordcount-simple@0.1.0

Stopping and Restarting Dataflows

In certain cases, you may want to stop a dataflow without deleting it. This can be done with the stop command.

[jolly-pond]>> stop dataflow myorg/wordcount-simple@0.1.0
Stopped dataflow: `my-org/wordcount-simple@0.1.0`

You can then restart the dataflow with the restart command.

[jolly-pond]>> restart dataflow myorg/wordcount-simple@0.1.0
Restarted dataflow: `my-org/wordcount-current@0.1.0`

Note that the stop command is not persistent. If a worker is restarted, its dataflows will be restarted as well.

Deleting Dataflows

To delete a dataflow, you can use the dataflow delete command. After you delete a dataflow, it will no longer appear in the dataflow list.

[jolly-pond]>> delete dataflow myorg/wordcount-simple@0.1.0 
    Dataflow                           Status           Last Updated  
 *  myorg/user-job-map@0.1.0           running          10 minutes ago

Introduction​

Workers​

Host Workers​

Remote Workers​

Overview​

Provisioning​

Registration​

Clean up​

Managing Dataflows​

Deploying Dataflows to Workers​

Listing and Selecting Dataflows​

Stopping and Restarting Dataflows​

Deleting Dataflows​