Deep learning workloads are highly parallelizable. However, distributing deep neural network on GPUs can be inefficient due to data movement overheads, GPU stalls, limited GPU memory etc. This paper describes GeePS, a parameter server system specialized for scaling deep learning applications across GPU cluster.
Prior works to this paper uses CPU based parameter servers to scale distributed DNNs. Similar to those, GeePS also stores and synchronizes the DNN parameters like weights of the connections, convolution matrix etc. On top of that GeePS performs other optimizations including prebuilt indexes, GPU caching, data staging, memory management etc.
There are 2 ways to distribute DNNs – data parallel and model parallel. Data parallel is used to improve the throughput of the training in which the data is partitioned across GPUs. Data parallelism is possible because of the additive property of the gradients. Model parallel partitions the parameters across GPUs and is used when a single GPU memory is not large enough to fit the entire neural network. Data parallelism is limited by the desire to fit the model in each worker’s memory. GeePS supports data parallelism and it solves the memory limitation problem by making use of the unused CPU memory(an approach similar to vDNN).
In GeePS, the parameters being learned are kept in a distributed key-value store with data sharded among the worker machines. The ML application uses explicit read and update operations to fetch or apply a delta to the parameter values. GeePS supports 2 types of synchronization mechanism; BSP and SSP. BSP updates all parameters from the previous cycle before starting the next one. SSP relaxes this synchronization frequency by a constant factor. But this affects the quality of the model accuracy.
As mentioned earlier, GeePS shards the parameters across the machines. The parameters are stored as an array of fixed size called rows and each row is associated with a key. Rows are bundled together into tables and each table stores information like data age, which denotes how many times the values in the table has updated and requires a refresh.
In addition to this each instance has a parameter cache, local data and access buffer pool. Parameter cache(write back) caches all parameters of the NN in GPU( can be overflowed to CPU memory). This reduces the communication overhead during an iteration. Local data (also stored in GPU & CPU) are the additional memory needed for intermediate states. This is local to the machine and not shared across the cluster. Access buffer pool is used when application makes read and update calls. GeePS first allocates a buffer in GPU memory and copies the data to it before letting application to read or write to that buffer. This ensures data consistency.
GeePS prebuilts the indexes based on the details collected during a dry run before the actual training starts. Since NNs computations are repetitive this information is not going to change. During the training iteration, the application reads and updates the parameters on individual machines and at the end GeePS triggers a TableClock. This signals all machines to synchronize the data based on the BSP or SSP policies. The data transfer between machines is done using 3 threads, keeper, pusher and puller. Pusher pushes updates from the master machines(of a set of parameters) to other machines to cache. Keeper receives this updates and forward it to Puller for updated GPU cache.
The data allocation and reclamation is also done using 2 threads called allocator and reclaimer. These threads synchronize each other with mutex locks on parameter partitions. GeePS avoid using row level fine grained locking. Usually there is only one CPU thread interacting with GeePS and the lock contention is not a big deal.
GeePS is evaluated against image classification workloads. It provides 13x improvement in throughput using 16 GPUs. Less than 8% of the time is lost to GPU stalls as compared to CPU based parameter server. GeePS gets same throughput as ProjectAdam, a state of the art CPU based distributed deep learning solution using 108 machines, with only 4 GPUs.