⚙️ How to launch a Ray Serve API on a remote cluster (AWS)

This tutorial will provide a step-by-step guide on how to deploy a Ray Serve application on a cloud cluster. By the end of the tutorial, you will have a minimal Ray Serve API application running on a remote cluster on AWS.

Ray.io is quite new. Even though is a great software, I found their documentation a bit confusing. It took me long than expected to deploy a ray scale on a cloud cluster. This guide will hopefully teach you something and make you spare some precious time.

I am writing this guide as I would have loved to find it: a step-by-step tutorial to follow along. This tutorial assume basic concepts such as ssh and the command line.

This guide was written for Ray 2.2.0, the stable version at the time of writing (January 2023).



Ray is a distributed compute framework written in python that allows to parallelize and scale out workloads across a cluster of machines.

Ray abstracts away the complexities of distributed systems, allowing the programmer to focus on the logic of the applications. Under the hood, it uses a distributed actor model to execute tasks and share data between nodes. Actors are lightweight concurrent units of computation that can be created and scheduled on any node in the cluster. They can run in parallel, communicate with each other asynchronously, and share data using a distributed in-memory store called the object store.

Ray Serve

Ray is composed of many lbraries (Ray Core, Ray AIR, Ray Datasets, …) including Ray Serve, a scalable model serving library for building online inference APIs. Ray Serve is a toolkit that allows to serve various types of models and business logic, regardless of the framework they were built with. This includes deep learning models built with frameworks such as PyTorch, Tensorflow, and Keras, as well as Scikit-Learn models and arbitrary Python code.

Launch Ray Serve on a local cluster

One of the most compelling aspects of Ray is that it enables seamless scaling of workloads from a laptop to a large cluster. We will start by deploying Ray Serve application on a local cluster using the Ray CLI and later on deploy it on the cloud.

⚠️ We will make use of two different CLI, the Ray CLI to manage the clusters and the Serve CLI to manage the deployment of the Ray Serve application.


pip install "ray[serve]"

Pro-tip: If you are using an Apple computer with a Silicon chip, it is recommended to install grpcio with conda before calling pip install (conda install grpcio).

Local cluster

Start the local cluster.

ray start --head

By default ray start will start a Ray Cluster on port 6379. We can access the Ray Dashboard at

⚠️ Do not confuse the Ray Dashboard port used to access the Ray Dashboard with the ray dashboard agent’s port (default to 52365) used by the Ray Serve CLI to access the cluster.

Server API

We write a simple API server on api.py. It will be composed of a single endpoint that returns `hello world`. You can refer to the RAY Serve documentation to learn how to apply more complex models and how to compose it.

from starlette.requests import Request
from ray import serve

@serve.deployment(num_replicas=1, ray_actor_options={"num_cpus": 0.2, "num_gpus": 0})
class HelloWorld:
    async def __call__(self, http_request: Request) -> str:
        return "hello world"

helloworld = HelloWorld.bind()

Deploy the API

To deploy our Serve application we use the serve deploy command from the Serve CLI.

First, we define a Serve Config File that we call server_config.yaml. In here we specify the python path of the api server as well as the host and port of our API inference application.

import_path: api:helloworld

runtime_env: {}


port: 8000

We start the server by calling serve deploy on the previously started cluster. Note that we use as address the Ray dashboard agent’s port.

serve deploy server_config.yaml -a

Note: we can use the serve status and serve config command to check the status of the deployed application.

  • serve status to see the deployment status of the API

  • serve config to see the current config file (in our case, this will be the same content as server_config.yaml).

  • Test

    It is time to test our freshly-created API.

    import requests
    python test.py

    The last command should display on screen “hello world”. Congratulations, you just created a Ray Serve application on a local cluster 🎉

    Stop the local ray cluster

    Time to stop the local cluster and move on to the next part.

    ray stop

    Launch a Ray Serve on a remote Cluster (AWS)

    We will now follow similar steps to launch the cluster on the cloud.

    Setting up AWS

    In order to take advantage of Ray's cluster management feature, we must first give it the appropriate permissions. First, we will create an AWS user with the necessary access.

  • Open AWS and open the “IAM” settings.

  • Click on “Add users”

  • Set a username

  • Select “Access key - Programmatic access” and click next

  • Click the tab “Attach existing policies directly” and select the following policies:

    1. AmazonEC2FullAccess

    2. IAMFullAccess

    3. AmazonS3FullAccess

  • Keep clicking “next” and create the new user

  • Save the aws_access_key_id and aws_secret_access_key keys under `~/.aws/credentials`. Do not share such keys with anyone!

  • [default]
    aws_access_key_id = ...
    aws_secret_access_key = ...

    Cluster config

    To specify the settings for the cluster, we will create a config file called cluster.yaml. Refer to the Cluster YAML Configuration Options page for more information about the available settings.

    cluster_name: t2small
    max_workers: 0 # default to 2
      type: aws
      region: us-west-2
      availability_zone: us-west-2a
      cache_stopped_nodes: False
      image: "rayproject/ray:latest-cpu"
      container_name: "ray_container"
       - --ulimit nofile=65536:65536
      ssh_user: ubuntu
          InstanceType: t2.small
        resources: {}
          InstanceType: t2.small
            MarketType: spot
        resources: {}
        min_workers: 0
        max_workers: 0
    head_node_type: ray.head.default

    These settings configure the head node to run on a t2.small instance, which is one of the least expensive options. While this configuration is suitable for testing, it may not be sufficient for handling larger workloads with high demand.

    Install boto3 (AWS SDK for python)

    Ray requires boto3, the AWS SDK for python, to manage the cluster. Install it.

    pip install boto3

    Launch the cluster on AWS

    We use the ray up command to launch the Ray cluster. This will take about 10 minutes. Go grab a coffee ☕.

    ray up cluster.yaml 

    Start the dashboard

    Once the cluster is operational, we can start the ray dashboard again, this time connecting to the remote cluster.

    ray dashboard cluster.yaml

    We can view the dashboard at


    ⚠️ During my testing of Ray, I found the process of forwarding the Ray dashboard agent's port
    to be somewhat confusing.

    To deploy the Serve application, we must first use the ray attach command (or ssh) to forward the port to our laptop.

    ray attach cluster.yaml -p 52365

    Send api.py to the cluster:

    ray rsync_up cluster.yaml api.py /home/ray/api.py -v

    We can use the same command as before to deploy our Ray Serve application on the cloud cluster, but this time the address will point to the cloud cluster we just started due to the port forwarding.

    serve deploy server_config.yaml -a

    Time to test.

    Open port 8000

    ray attach cluster.yaml -p 8000

    Test it:

    python test.py
    >>> "hello world"

    Congratulations, you have successfully deployed a Ray Serve application on an AWS cloud cluster!

    I hope this guide was helpful. If you have any questions or suggestions for improvement, don't hesitate to reach out.

    Have fun working with Ray!


  • Ray’s documentation

  • Boto3' s documentation

  • © Copyright 2023, Jonathan Besomi