AWS SageMaker Model Training Service

An AWS SageMaker Model Training Service is a model training service within AWS SageMaker (a fully managed end-to-end machine learning service).

Context:
- It can be used to create Trained SageMaker Models.
Example(s):
- AWS SageMaker Training Service, (2017-11-29).
- …
Counter-Example(s):
See: Jupyter Notebook, Anaconda Enterprise.

References

2018a

http://aws.amazon.com/blogs/aws/sagemaker/
- QUOTE: Authoring: Zero-setup hosted Jupyter notebook IDEs for data exploration, cleaning, and preprocessing. You can run these on general instance types or GPU powered instances.
  - Model Training: A distributed model building, training, and validation service. You can use built-in common supervised and unsupervised learning algorithms and frameworks or create your own training with Docker containers. The training can scale to tens of instances to support faster model building. Training data is read from S3 and model artifacts are put into S3. The model artifacts are the data dependent model parameters, not the code that allows you to make inferences from your model. This separation of concerns makes it easy to deploy Amazon SageMaker trained models to other platforms like IoT devices.

2018b

https://userX-180207.notebook.us-east-2.sagemaker.aws/notebooks/sample-notebooks/advanced_functionality/data_distribution_types/data_distribution_types.ipynb#Train
- QUOTE: Now that we have our data in S3, we can begin training. We'll use Amazon SageMaker's linear regression algorithm, and will actually fit two models in order to properly compare data distribution types:
  - In the first job, we'll use FullyReplicated for our train channel. This will pass every file in our input S3 location to every machine (in this case we're using 5 machines).
  - While in the second job, we'll use ShardedByS3Key for the train channel (note that we'll keep FullyReplicated for the validation channel. So, for the training data, we'll pass each S3 object to a separate machine. Since we have 5 files (one for each year), we'll train on 5 machines, meaning each machine will get a year's worth of records.
  - First let's setup a list of training parameters which are common across the two jobs.

common_training_params = {
   ...
   "ResourceConfig": {
       "InstanceCount": 5,
       "InstanceType": "ml.c4.2xlarge",
       "VolumeSizeInGB": 10
    …

2018c

https://aws.amazon.com/blogs/aws/sagemaker/
- QUOTE: I’m going to leave out the actual model training code here for brevity, but in general for any kind of Amazon SageMaker common framework training you can implement a simple training interface that looks something like this:

def train(
   channel_input_dirs, hyperparameters, output_data_dir,
   model_dir, num_gpus, hosts, current_host):
   pass
def save(model):
   pass

- I want to create a distributed training job on 4 ml.p2.xlarge instances in my Amazon SageMaker infrastructure. I’ve already downloaded all of the data I need locally.

import sagemaker
from sagemaker.mxnet import MXNet
m = MXNet("cifar10.py", role=role,
         train_instance_count=4, train_instance_type="ml.p2.xlarge",
         hyperparameters={'batch_size': 128, 'epochs': 50,
                          'learning_rate': 0.1, 'momentum': 0.9})

- Now that we’ve constructed our model training job we can feed it data by calling: m.fit("s3://randall-likes-sagemaker/data/gluon-cifar10").
- If I navigate to the jobs console I can see that my job is running!

AWS SageMaker Model Training Service

References

2018a

2018b

2018c

Navigation menu

Search