用Amazon Sagemaker训练机器学习模型的步骤--初探

发表于2019年10月22日

博客, 见解, 技术

作者：谢佳怡

分享。

SageMaker is a machine learning service managed by Amazon. It’s basically a service that combines EC2, ECR and S3 all together, allowing you to train complex machine learning models quickly and easily, and then deploy the model into a production-ready hosted environment. It provides many best-in-class built-in algorithms, such as Factorization Machines, XGBoost etc. It also allows you to train models using various machine learning frameworks, such as Apache MXNet, TensorFlow, and Scikit-learn.

A straightforward way to interact with SageMaker is using the notebook Instance. This process is described in detail by Amazon (link). We use SageMaker in a slightly different way. we only want to use SageMaker for the model training part, so that we can train a complex model on a large dataset without worrying about the messy infrastructural details. But the rest of the process (e.g., data preparation, making predictions) run locally. So, in our use case, we want to:

1. Interact with SageMaker jobs from local machine, without using SageMaker notebook Instance. Why? Well, there are a few advantages:

启动一个笔记本实例需要几分钟的时间，这很慢。
除非您手动停止该实例，否则您将始终为正在运行的实例收费，无论您是否在积极使用它。另一方面，如果你从本地机器提交训练作业，你将只对模型训练部分收费。
如果代码放在本地，你可以使用你的IDE进行调试，并使用github进行版本控制。

2.在本地访问训练好的模型，这样我们就可以

观察模型的细节，而不是把模型当作一个黑盒子来使用
在本地进行预测，并以我们自己的方式使用模型

这篇文章的其余部分将介绍我们如何通过5个步骤实现这一目标。

设置你的本地机器，这样你就可以在本地与SageMaker作业互动。
准备好你的数据
提交培训工作
下载经过训练的模型
在当地进行预测

在最后，我们还将简要地告诉你如何使用SageMaker的超参数调整器，它可以帮助你调整机器学习模型。

Set up your local machine

为了与SageMaker作业进行编程和本地交互，你需要安装sagemaker Python API，以及AWS SDK for python。你可以通过运行以下程序来安装它们 pip install sagemaker boto3

The easiest way to test if your local environment is ready, is by running through a sample notebook, for example, An Introduction to Factorization Machines with MNIST. Run this sample notebook, and check if you need to install additional packages, or if any AWS credential information is missing.

现在你确定你的本地机器已经正确设置了与SageMaker的交互，那么你就可以带着你自己的数据，使用SageMaker训练一个因数机分类模型，下载模型并进行预测。首先，让我们看看如何准备你的数据进行训练。

准备好你的数据

Before you can train a model, data need to be uploaded to S3. The format of the input data depends on the algorithm you choose, for SageMaker’s Factorization Machine algorithm, protobuf is typically used.

To begin, you need to preprocess your data (clean, one hot encoding etc.), split both feature (X) and label (y) into train and test sets. Sometimes, you may also want to leave a validation set aside.

After you have obtained feature (X) and label (y), use the following python code to transform them into protobuf and upload to S3 bucket. Run this for both train, and test sets.

[code language="python"]
import sagemaker.amazon.common as smac
import boto3
import os
# after lots of data cleaning, preprocessing, feature engineering, split into train, test etc.
feature = your_features
label = your_labels
# define the S3 path to store data, the data would be uploaded to s3://{bucket}/{prefix}/{key} where key is the file name
bucket = your_S3_bucket_name
prefix = your_prefix_name
key = 'train.protobuf' # or 'test.protobuf'
# transform the feature and label into protobuf
buf = io.BytesIO()
# if the feature is a numpy array use smac.write_numpy_to_dense_tensor(buf, feature, label),
# if the feature is sparse matrix, use smac.write_spmatrix_to_sparse_tensor(buf, feature, label)
smac.write_numpy_to_dense_tensor(buf, feature, label)
buf.seek(0)
# upload the protobuf to S3
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, key)).upload_fileobj(buf)
path_to_train_data = f's3://{bucket}/{prefix}/{key}'
print(f'uploaded training data location: s3://{bucket}/{prefix}/{key}')
[/code]

在这一点上，你已经把你的训练，和测试数据上传到了S3。你可以进入AWS控制台，选择S3，并检查你刚刚上传的protobuf文件。

提交培训工作

一旦你准备好了数据，你就可以定义你的估计器并提交一个训练作业。下面的代码定义了一个因式分解机估计器，并对其进行数据拟合。

[code language="python"]
import sagemaker
from sagemaker.amazon.amazon_estimator import get_image_uri
output_prefix = your_prefix_name_for_output_model
role = your_full_IAM_role_arn_string  # the &quot;Set up your local machine&quot; session describes how to get this string
path_to_train_data = your_path_to_train_data  # from the &quot;Prepare your data&quot; step above
path_to_test_data = your_path_to_train_data
job_name = None  # you can name your job. Otherwise, sagemaker with auto assign a job name
output_prefix = 's3://{}/{}/factorization_machine_output'.format(bucket, output_prefix)
container = get_image_uri(boto3.Session(region_name='us-east-1').region_name, 'factorization-machines')
eatimator = sagemaker.estimator.Estimator(container, role, train_instance_count=1, train_instance_type='ml.c4.xlarge', output_path=output_prefix, sagemaker_session=sagemaker.Session())
eatimator.set_hyperparamters(feature_dim=feature.shape[1], predictor_type='binary_classifier', num_factors=100)
# run training job
eatimator.fit({'train': path_to_train_data, 'test': path_to_test_data}, wait=False, job_name=job_name)
training_job_name = estimator.latest_training_job.job_name
[/code]

Model parameters can be changed by calling the set_hyperparamters method, if you are not sure what’s the optimal value, you can try the Hyperparameter Tuner described later in this post.

在估计者的拟合方法中，有一个参数等待，被设定为真默认情况下是这样。这意味着，在这个拟合过程（即模型训练）完成之前，这一行下面的任何代码都不会运行。我们发现这非常不方便，特别是如果你想同时提交多个训练作业。因此，我们设置 wait = False，你可以通过查看AWS控制台（选择SageMaker -> Training -> Training jobs），或者运行以下代码来检查工作状态。

[code language="python"]
def get_training_job_status(training_job_name: str):
job_info = boto3.client('sagemaker').describe_training_job(TrainingJobName=training_job_name)
job_status = job_info['TrainingJobStatus']
if job_status == 'Failed':
message = job_info['FailureReason']
print(f'Training failed with the following error: {message}')
return job_status, job_info
job_status, job_info = get_training_job_status(training_job_name)
if job_status != 'Completed':
print(f'Reminder: Training job {training_job_name} has not be completed. Cannot get model, or evaluate it.')
else:
s3_model_artifact_path = job_info['ModelArtifacts']['S3ModelArtifacts']
print('path to the output model artifacts:', s3_model_artifact_path)
[/code]

下载经过训练的模型

After Sagemaker trains the model, a model artifact is stored to S3. You can download it, and access the model coefficients locally.

The way to access the model differs from algorithm to algorithm, here we only show you how to access the model coefficients for Sagemaker’s factorization machine model. Please note that this may NOT apply to other algorithms.

First, download the model artifact output from the training job

[code language="python"]
local_name = 'model_fm.tar.gz'
bucket = s3_model_artifact_path.split('s3://')[1].split('/')[0]
key = s3_model_artifact_path.split(bucket + '/')[1]
s3 = boto3.resource('s3')
s3.Bucket(bucket).download_file(key, local_name)
[/code]

接下来，将信息提取出来

[code language="python"]

!tar -zxvf model_fm.tar.gz

!unzip -o model_algo-1

!mv params model_fm-0000.params

!mv symbol.json model_fm-symbol.json

[/code]

最后，将该模型加载到一个 mx.module 对象，这样你就可以读取存储在模型内的信息。

[code language="python"] import mxnet as mx mx_model = mx.module.Module.load("./model_fm", 0, False, label_names=['out_label']) [/code]

对于因子化机器模型， mx_model._arg_params 有三个关键。这些包括 w0_weight 的偏见）。 w1_weight (线性项的权重)，和 v (减维因子化空间的权重)。你可以看一下它们的值，以了解更多关于你的模型。

在当地进行预测

在你将模型加载到本地后，你可以将模型应用于你的测试数据，并进行预测，而无需向AWS付费。

如果你有少量的数据，你可以通过运行以下程序来进行预测 make_prediction_dense 功能如下。

[code language="python"]
def make_prediction_dense(model: mx.module, x_array: np.ndarray, batch_size: int=100):
data_iter = mx.io.NDArrayIter(data=x_array, batch_size=batch_size)
model.bind(data_shapes=data_iter.provide_data)
prediction = model.predict(data_iter).asnumpy().flatten()
return model, prediction
[/code]

如果你有大量的数据。 make_prediction_dense 将需要很长的时间来完成。在这种情况下，我们建议你转换你的输入数据 x_array 在运行预测之前，将其转换成scipy稀疏矩阵。

现在你已经使用SageMaker获得了一个因式分解模型，并且能够用它来进行预测了你训练的模型是基于一组特定的超参数值。你可能会问，我怎么知道什么是超参数的最佳值？SageMaker的超参数调整器将帮助你找到答案。

Hyperparameter tuner:

In many cases, you do not know what is the optimal value for model hyperparameters. Therefore, you would like to tune the model. Sagemaker’s hyperparameter tuner uses Bayesian Optimization to find the optimal model hyperparameters, as described here.

除非你使用 CategoricalParameter 来定义超参数范围，超参数调谐器不能探索定义范围内的所有可能的值，而是将其训练工作集中在最佳位置。在每个迭代中，要测试的值是基于调谐器到目前为止对这个问题的所有了解。这个过程是随机的，它对调整复杂的模型很有帮助，因为在这种情况下不可能探索所有可能的组合。另一方面，由于它是随机的，即使指定的范围是正确的，超参数调谐模型也有可能无法收敛于最佳答案。

我们建议你花一些时间来探索超参数的范围，并逐渐缩小探索的范围，这样超参数调谐器就更有可能更快地在最佳答案周围收敛。此外，还有一个小的权衡，即 max_parallel_jobs 和最终模型的质量。较大的 max_parallel_jobs 减少了整体的调整，但更小的 max_parallel_jobs 可能会产生一个稍好的结果。

If you’d like to dig further, you can use 这个样本笔记本来显示目标度量和超参数值如何随时间变化。它可以帮助你了解超参数调整器是否收敛了。有了这些信息，你可以调整你的超参数范围，以及 max_jobs 因此。

Conclusion

Overall, SageMaker is a very powerful machine learning service. It allows you to train a complex model on a large dataset, and deploy the model without worrying about the messy infrastructural details. SageMaker provides lots of best-in-class built in algorithms, and allows to bring your own model. Besides, you can use machine learning frameworks such as Scikit-learn, and TensorFlow with SageMaker. There are many sample notebooks, so you can learn by doing.

That being said, we think there is still room to improve:

1 ) Difficult to troubleshoot. Because SageMaker is relatively new, you can hardly find solutions to your questions on places like Stack OverFlow. From my experience, these are the best resources for troubleshooting:

python sdk repo，查看源代码以找到sagemaker文档中没有描述的信息。
sagemaker论坛，你可能会在那里找到问题的答案，你也可以发布自己的问题，AWS的人通常会在一两天内做出回应。
在AWS支持中心，你可以在那里创建一个票据，支持团队将回答你的问题。

2) Incomplete documentation. For example, when we were using SageMaker, the documentation does not cover how to extract the model coefficient, or how to set up the hyperparameter values for tuning. We found the answers by looking into their sample notebooks, AWS blog, and the SageMaker forum.

3) Not flexible enough. For example, when using SageMaker’s factorization machines with hyperparameter tuning, there are very limited objective metrics we can choose from. It is still unclear how to run cross validation with SageMaker’s built-in algorithm.

SageMaker has many functionalities, and this post is based on initial experimentation only. We plan to continue exploring other areas in SageMaker, such as how to bring my own model, and how to use Scikit-learn and Spark in SageMaker.

想知道更多吗？

如有任何问题、反馈或咨询，请与我们联系，以获得关于您的数字媒体活动的Copilot战略：xaxcopilotproduct@xaxis.com