Computational costs are one of the main challenges in the adoption of machine learning models. Some of the recent breakthrough models in areas such as deep reinforcement learning(DRL) have computational requirements that result prohibited to most organizations which have caused DRL to remain constrained to experiments in big AI research labs. For DRL to achieve mainstream adoption, it has to be accompanied by efficient distributed computation methods that effectively address complex computation requirements. Recently, Uber open sourced Fiber, a scalable distributed computing framework for DRL and population-based methods.
Distributed computing methods are required across many areas of the machine learning lifecycle from training to simulations. In supervised learning methods, we already seem progress with distributed training frameworks like Horovod. However, DRL scenarios introduced their own set of challenges when comes to a distributed computing infrastructure.
Distributed Computing Challenges for Reinforcement Learning
Intuitively, we tend to think that a framework for distributed training of supervised learning models should work for DRL methods. However, reality is a bit different. Given that DRL methods are often trained using a large variety simulations, we need a distributed computing framework that adapts to that unique environments. For starters, most simulation models run on CPUs and are not optimized for GPU environments. From that perspective, a distributed training method should be able to concurrently use a large amount of resources based on the specific requirements. Additionally, DRL methods typically require different resources throughout its training lifecycle. It’s very common that a DRL method would gradually scale training depending on the characteristics of its environment. Those factors make the scaling of DRL training a very unique challenge not well suited for distributed training frameworks designed for supervised models.
In addition to the unique characteristics of DRL, distributed training/computing posses very unique hurtles that should be factored in:
- There is a huge gap between making code work locally on laptops or desktops and running code on a production cluster. You can make MPI work locally but it’s a completely different process to run it on a computer cluster.
- No dynamic scaling is available. If you launch a job that requires a large amount of resources, then most likely you’ll need to wait until everything is allocated before you can run your job. This waiting to scale up makes it less efficient.
- Error handling is missing. While running, some jobs may fail. And you may be put into a very nasty situation where you have to recover part of the result or discard the whole run.
- High learning cost. Each system has different APIs and conventions for programming. To launch jobs with a new system, a user has to learn a set of completely new conventions before jobs can be launched.
These are some of the challenges that Uber setup to address with their new open source framework.
Fiber is a Python-based distributed computing library for modern computer clusters. The framework provides users the ability to write applications for a large computer cluster with a standard and familiar library interface. From a design perspective, Fiber encapsulates some key capabilities that facilitate the distributed training of DRL models:
- Easy to use. Fiber allows you to write programs that run on a computer cluster without the need to dive into the details of the computer cluster.
- Easy to learn. Fiber provides the same API as Python’s standard multiprocessing library that people are familiar with.
- Fast performance. Fiber’s communication backbone is built on top of Nanomsg, which is a high-performance asynchronous messaging library to allow fast and reliable communication.
- No need for deployment. You run Fiber application the same way as running a normal application on a computer cluster and Fiber handles the rest for you.
- Reliable computation. Fiber has built-in error handling when you are running a pool of workers.
To achieve the aforementioned goals, Fiber provides an architecture that is divided into three different layers: API, backend and cluster. The API layer provides basic building blocks for Fiber like processes, queues, pools and managers. They have the same semantics as in multiprocessing, but are extended to work in distributed environments. The backend layer handles tasks like creating or terminating jobs on different cluster managers. Finally, the cluster layer consists of different cluster managers. Although they are not a part of Fiber itself, they help Fiber to manage resources and keep track of different jobs, thereby reducing the number of items that Fiber needs to track.
The distributed computing primitives in Fiber borrow traditional artifacts from concurrent and parallel programming theory. Specifically, Fiber multiprocessing models provides an architecture that includes components such as pipes, queues, pools, and managers.
Queues and pipes in Fiber behave the same as in multiprocessing. The difference is that queues and pipes are now shared by multiple processes running on different machines. For instance, the following figure shows a Fiber queue shared across three different Fiber processes. One Fiber process is located on the same machine as the queue and the other two processes are located on another machine. One process is writing to the queue and the other two are reading from the queue.
Pools allow the user to manage a pool of worker processes. Fiber extends pools with job-backed processes so that it can manage thousands of (remote) workers per pool. The job-backed process runs containerized applications in a computer cluster either locally or across a large number of remote machines.
Finally, Fiber managers enable Fiber to support capabilities such as shared storage, which is critical to distributed systems. Usually, this function is handled by external storage like Cassandra, Redis, etc. on a computer cluster. Fiber instead provides built-in in-memory storage for applications to use. The interface is the same as multiprocessing’s Manager type.
To see Fiber in action, let’s take an example that operates with a prototypical DRL model. The following simplified code, illustrates the steps to get Fiber working. Essentially, developers require initiating the manager (RemoveEnvManager), create a series of environments and distribute the model across them collecting the final results. This code is obviously dependent on the underlying infrastructure setup for your Fiber environment.
# fiber.BaseManager is a manager that runs remotely class RemoteEnvManager(fiber.managers.AsyncManager): pass class Env(gym.env): # gym env pass RemoteEnvManager.register('Env', Env) def build_model(): # create a new policy model return model def update_model(model, observations): # update model with observed data return new_model def train(): model = build_model() manager = RemoteEnvManager() num_envs = 10 envs = [manager.Env() for i in range(num_envs)] handles = [envs[i].reset() for i in num_envs] obs = [handle.get() for handle in handles] for i in range(1000): actions = model(obs) handles = [env.step() for action in actions] obs = [handle.get() for handle in handles] model = update_model(model, obs)
Uber benchmarked Fiber against state-of-the-art distributed computing methods such as Spark, or IPyParallel across different criterias. The results showed that Fiber was able to outperform the alternative methods in most tests. For instance, the following figure shows that Fiber was able to complete tasks 24 times faster than IPyParallel and 38 times faster than Spark.
Fiber provides a very reliable architecture for the distributed training of DRL models. Fiber accomplishes many goals including efficiently leveraging a large amount of heterogeneous computing hardware, dynamically scaling algorithms to improve resource usage efficiency, and reducing the engineering burden required to make complex algorithms work on computer clusters. The initial version of Fiber has been open sourced in GitHub and the research paper is also available.
Original. Reposted with permission.