Train your models faster on AWS: Importance of choosing right storage medium

Tanmay Rane
4 min readApr 16, 2022

This blog will give you an idea of how training speed🚀varies in accordance with the storage medium to store training/validation data.

Photo by PAUL SMITH on Unsplash

I believe the faster you complete experiments, the faster you can draw conclusions and get to desired results. Typically in delivery projects, most of the time can be spent designing the experiments, but once the design is complete the time to run the experiment should be as minimal as possible.

INTRODUCTION

In AWS, majorly you have 3 ways you can choose for the training process. I will try to draw a conclusion using a simple experiment conducted on AWS. Refer to the image below for understanding more.

SageMaker training using different storage media

EXPERIMENT SETUP

We started with 1,25,350 images from a particular dataset having around 500 classes. We have already created recordIO files as we will be using SageMaker built-in image classification algorithm to train the models. The size of the training file was about 16 GB for your reference. Here are the descriptions of all 3 methods:

Pipe Mode: Your dataset is streamed directly from S3 to your training instances instead of being downloaded first. This means that your training jobs start sooner, and needs less disk space.

File Mode: Your dataset is first downloaded from S3 to your training instances, which acts as a sequential process and takes its own time before training starts. This means that your training jobs starts slow, and needs more disk space. But not all is bad here, as after the data is downloaded it is local to the training job and the training speed is higher.

Amazon FSx: Amazon FSx makes it easy and cost-effective to launch, run, and scale feature-rich, high-performance file systems in the cloud. It supports a wide range of workloads with its reliability, security, scalability, and broad set of capabilities. You can directly store your data on FSx and mount it at the start of the training job to access the contents.

In each of the methods mentioned above, the training was done for 20 epochs and I noted the following parameters which can help to decide which way to choose:

  1. One Epoch Time ⌛
  2. Samples/Sec (Number of images processed by algorithm per second) 💽
  3. Launch Time (Time taken for each training job to launch and start training) 💤

For training, we have used Resnet-50 as a standard across all the experiments.

OBSERVATIONS 📝

Comparison Table

Let’s analyze the table above
Launch Time: Well no surprises here, as data has to be downloaded first File mode needs time to download the data. Due to streaming, Pipe mode is able to catch up with directly mounted FSx! 🚀

Samples/Sec: Well here’s where streaming is not seen as effective as other modes. File mode is almost catching up with FSx which is quite nice! As the storages are local, the algorithm can pull some faster IO operations 💽which is seen above.

Epoch Time: Proportional to Samples/Sec this also reveals the same results where FSx and File Mode are almost the same and Pipe Mode lagging with a considerable difference.

20 Epoch Total Time: This is what the blog is about, Pipe Mode takes an hour extra than the other two modes! FSx and File Mode finish around the 2-hour mark, but Pipe Mode takes up 3 hours 😲.

CONCLUSION 🤔

Well, truth be told it won’t be fair to choose a clear winner but we’ll judge from various parameters.

Speed 🚀: FSx would be the obvious choice, it outperforms every other mode. Another addition is EFS which I conveniently did not test 🤦‍♂️!! But if the narrow wins against EBS are indicating something, it would be that they would be pretty close. I’ll not advise File Mode here as file sizes rise the launch time will be a pain point.

Manageability 🧑‍💼: Managing data on FSx/EFS is much easier and faster as they are mounted on the go. But here I’m talking about prepared data i.e. some sort of binary data exchange formats (recordIO, tfrecords, etc.). If we are talking about RAW file management I would prefer S3.

Cost 💸: Very important, as I have seen some tight budget projects, where using something as gourmet as FSx/EFS is not an option. So I would definitely choose File Mode or Pipe Mode in these scenarios as File Mode doesn't have any extra cost and S3 is the ultimate cost saver🤑.

Time 🕛: Last but most important for me, whenever my aim will be to complete experiments FASTER, FSx/EFS would be my go-to option.

Hope this blog helps you in choosing the storage media for your training adventures. Hit the claps 👏 button if you appreciate the content. Until next time.

--

--

Tanmay Rane

Machine Learning Engineer hacking solutions to problems everyday!