Pasos para entrenar un modelo de aprendizaje automático con Amazon Sagemaker - Primer vistazo

Publicado el 22 de octubre de 2019

blog, ideas, tecnología

Por Jiayi Xie

Comparte:

SageMaker is a machine learning service managed by Amazon. It’s basically a service that combines EC2, ECR and S3 all together, allowing you to train complex machine learning models quickly and easily, and then deploy the model into a production-ready hosted environment. It provides many best-in-class built-in algorithms, such as Factorization Machines, XGBoost etc. It also allows you to train models using various machine learning frameworks, such as Apache MXNet, TensorFlow, and Scikit-learn.

A straightforward way to interact with SageMaker is using the notebook Instance. This process is described in detail by Amazon (link). We use SageMaker in a slightly different way. we only want to use SageMaker for the model training part, so that we can train a complex model on a large dataset without worrying about the messy infrastructural details. But the rest of the process (e.g., data preparation, making predictions) run locally. So, in our use case, we want to:

1. Interact with SageMaker jobs from local machine, without using SageMaker notebook Instance. Why? Well, there are a few advantages:

se tarda unos minutos en iniciar una instancia de cuaderno, lo que es lento
a menos que detenga manualmente la instancia, siempre se le cobrará por la instancia en funcionamiento, sin importar si la está utilizando activamente o no. Por otro lado, si envía el trabajo de entrenamiento desde la máquina local, sólo se le cobrará por la parte de entrenamiento del modelo
si el código se encuentra localmente, puedes usar tu IDE para depurar, y usar github para el control de versiones

2. Acceder al modelo entrenado localmente, para poder

mirar los detalles del modelo, en lugar de utilizar el modelo como una caja negra
hacer predicciones a nivel local, y utilizar el modelo a nuestra manera

El resto de este post cubrirá cómo lo hicimos en 5 pasos:

Configure su máquina local, para que pueda interactuar con los trabajos de SageMaker localmente.
Prepare sus datos
Presentar el trabajo de formación
Descargar el modelo entrenado
Hacer predicciones a nivel local

Al final, también le mostraremos brevemente cómo utilizar el sintonizador de hiperparámetros de SageMaker que le ayuda a ajustar el modelo de aprendizaje automático.

Set up your local machine

Para interactuar con los trabajos de SageMaker de forma programática y local, necesitas instalar la API de sagemaker Python, y el SDK de AWS para python. Puedes instalarlos ejecutando pip install sagemaker boto3

The easiest way to test if your local environment is ready, is by running through a sample notebook, for example, An Introduction to Factorization Machines with MNIST. Run this sample notebook, and check if you need to install additional packages, or if any AWS credential information is missing.

Ahora que está seguro de que su máquina local está correctamente configurada para interactuar con SageMaker, entonces puede traer sus propios datos, entrenar un modelo de clasificación de Máquina de Factorización usando SageMaker, descargar el modelo y hacer predicciones. Para empezar, veamos cómo preparar sus datos para el entrenamiento.

Prepare sus datos

Before you can train a model, data need to be uploaded to S3. The format of the input data depends on the algorithm you choose, for SageMaker’s Factorization Machine algorithm, protobuf is typically used.

To begin, you need to preprocess your data (clean, one hot encoding etc.), split both feature (X) and label (y) into train and test sets. Sometimes, you may also want to leave a validation set aside.

After you have obtained feature (X) and label (y), use the following python code to transform them into protobuf and upload to S3 bucket. Run this for both train, and test sets.

[code language="python"]
import sagemaker.amazon.common as smac
import boto3
import os
# after lots of data cleaning, preprocessing, feature engineering, split into train, test etc.
feature = your_features
label = your_labels
# define the S3 path to store data, the data would be uploaded to s3://{bucket}/{prefix}/{key} where key is the file name
bucket = your_S3_bucket_name
prefix = your_prefix_name
key = 'train.protobuf' # or 'test.protobuf'
# transform the feature and label into protobuf
buf = io.BytesIO()
# if the feature is a numpy array use smac.write_numpy_to_dense_tensor(buf, feature, label),
# if the feature is sparse matrix, use smac.write_spmatrix_to_sparse_tensor(buf, feature, label)
smac.write_numpy_to_dense_tensor(buf, feature, label)
buf.seek(0)
# upload the protobuf to S3
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, key)).upload_fileobj(buf)
path_to_train_data = f's3://{bucket}/{prefix}/{key}'
print(f'uploaded training data location: s3://{bucket}/{prefix}/{key}')
[/code]

En este punto, usted ha subido su tren, y los datos de prueba a S3. Puedes ir a la consola de AWS, seleccionar S3 y comprobar el archivo protobuf que acabas de subir.

Presentar el trabajo de formación

Una vez que tenga los datos listos, puede definir su estimador y enviar un trabajo de entrenamiento. El código siguiente define un estimador de máquina de factorización, y ajusta los datos a él:

[code language="python"]
import sagemaker
from sagemaker.amazon.amazon_estimator import get_image_uri
output_prefix = your_prefix_name_for_output_model
role = your_full_IAM_role_arn_string  # the &quot;Set up your local machine&quot; session describes how to get this string
path_to_train_data = your_path_to_train_data  # from the &quot;Prepare your data&quot; step above
path_to_test_data = your_path_to_train_data
job_name = None  # you can name your job. Otherwise, sagemaker with auto assign a job name
output_prefix = 's3://{}/{}/factorization_machine_output'.format(bucket, output_prefix)
container = get_image_uri(boto3.Session(region_name='us-east-1').region_name, 'factorization-machines')
eatimator = sagemaker.estimator.Estimator(container, role, train_instance_count=1, train_instance_type='ml.c4.xlarge', output_path=output_prefix, sagemaker_session=sagemaker.Session())
eatimator.set_hyperparamters(feature_dim=feature.shape[1], predictor_type='binary_classifier', num_factors=100)
# run training job
eatimator.fit({'train': path_to_train_data, 'test': path_to_test_data}, wait=False, job_name=job_name)
training_job_name = estimator.latest_training_job.job_name
[/code]

Model parameters can be changed by calling the set_hyperparamters method, if you are not sure what’s the optimal value, you can try the Hyperparameter Tuner described later in this post.

En el método de ajuste del estimador, hay un parámetro espere, que se fija en Verdadero por defecto. Esto significa que antes de que el proceso de ajuste (es decir, el entrenamiento del modelo) haya terminado, cualquier código por debajo de esta línea no se ejecutará. Por lo tanto, establecemos wait = Falsey puedes comprobar el estado del trabajo mirando en la consola de AWS (selecciona SageMaker -> Training -> Training jobs), o ejecutando el siguiente código:

[code language="python"]
def get_training_job_status(training_job_name: str):
job_info = boto3.client('sagemaker').describe_training_job(TrainingJobName=training_job_name)
job_status = job_info['TrainingJobStatus']
if job_status == 'Failed':
message = job_info['FailureReason']
print(f'Training failed with the following error: {message}')
return job_status, job_info
job_status, job_info = get_training_job_status(training_job_name)
if job_status != 'Completed':
print(f'Reminder: Training job {training_job_name} has not be completed. Cannot get model, or evaluate it.')
else:
s3_model_artifact_path = job_info['ModelArtifacts']['S3ModelArtifacts']
print('path to the output model artifacts:', s3_model_artifact_path)
[/code]

Descargar el modelo entrenado

After Sagemaker trains the model, a model artifact is stored to S3. You can download it, and access the model coefficients locally.

The way to access the model differs from algorithm to algorithm, here we only show you how to access the model coefficients for Sagemaker’s factorization machine model. Please note that this may NOT apply to other algorithms.

First, download the model artifact output from the training job

[code language="python"]
local_name = 'model_fm.tar.gz'
bucket = s3_model_artifact_path.split('s3://')[1].split('/')[0]
key = s3_model_artifact_path.split(bucket + '/')[1]
s3 = boto3.resource('s3')
s3.Bucket(bucket).download_file(key, local_name)
[/code]

A continuación, extraiga la información

[code language="python"]

!tar -zxvf model_fm.tar.gz

!unzip -o model_algo-1

!mv params model_fm-0000.params

!mv symbol.json model_fm-symbol.json

[/code]

Por último, cargue el modelo en un mx.module para poder leer la información almacenada en el modelo.

[code language="python"] import mxnet as mx mx_model = mx.module.Module.load("./model_fm", 0, False, label_names=['out_label']) [/code]

Para un modelo de máquina de factorización, el mx_model._arg_params tiene tres llaves. Estas incluyen, w0_weight (el sesgo), w1_weight (pesos para los términos lineales), y v (pesos para el espacio de factorización de dimensión reducida). Puedes mirar sus valores para entender mejor tu modelo.

Hacer predicciones a nivel local

Después de cargar el modelo localmente, puede aplicar el modelo a sus datos de prueba y hacer predicciones, sin pagar a AWS.

Si tiene una pequeña cantidad de datos, puede hacer predicciones ejecutando make_prediction_dense función de abajo.

[code language="python"]
def make_prediction_dense(model: mx.module, x_array: np.ndarray, batch_size: int=100):
data_iter = mx.io.NDArrayIter(data=x_array, batch_size=batch_size)
model.bind(data_shapes=data_iter.provide_data)
prediction = model.predict(data_iter).asnumpy().flatten()
return model, prediction
[/code]

Si tienes una gran cantidad de datos, make_prediction_dense tardaría mucho tiempo en terminar. En este caso, le sugerimos que transforme sus datos de entrada x_array a la matriz dispersa de scipy antes de ejecutar la predicción.

Ahora ha obtenido un modelo de factorización utilizando SageMaker, ¡y puede hacer predicciones con él! El modelo que ha entrenado se basa en un conjunto específico de valores de hiperparámetros. Usted puede preguntarse, ¿cómo puedo saber cuáles son los valores óptimos para los hiperparámetros? El afinador de hiperparámetros de SageMaker le ayudará a encontrar la respuesta.

Hyperparameter tuner:

In many cases, you do not know what is the optimal value for model hyperparameters. Therefore, you would like to tune the model. Sagemaker’s hyperparameter tuner uses Bayesian Optimization to find the optimal model hyperparameters, as described here.

A menos que utilices CategoricalParameter para definir el rango de hiperparámetros, el Afinador de Hiperparámetros no puede explorar todos los valores posibles dentro del rango definido, sino que centra sus esfuerzos de entrenamiento en los mejores lugares. En cada iteración, el valor a probar se basa en todo lo que el sintonizador sabe sobre este problema hasta el momento. Este proceso es estocástico, es muy útil para afinar modelos complejos, donde es imposible explorar todas las combinaciones posibles. Por otro lado, al ser estocástico, es posible que el modelo de sintonización de hiperparámetros no converja en la mejor respuesta, aunque los rangos especificados sean correctos.

Le sugerimos que se tome un tiempo para explorar los rangos de los hiperparámetros, y que reduzca gradualmente los rangos a explorar para que el sintonizador de hiperparámetros tenga más probabilidades de converger alrededor de la mejor respuesta más rápidamente. Además, hay una pequeña compensación entre max_parallel_jobs y la calidad del modelo final. Más grande max_parallel_jobs disminuye la sintonía general, pero menor max_parallel_jobs probablemente generará un resultado ligeramente mejor.

If you’d like to dig further, you can use este cuaderno de muestras para visualizar cómo la métrica objetivo y los valores de los hiperparámetros cambian con el tiempo. Le ayuda a entender si el sintonizador de hiperparámetros convergió o no. Con esta información, puede ajustar los rangos de los hiperparámetros, y el max_jobs en consecuencia.

Conclusion

Overall, SageMaker is a very powerful machine learning service. It allows you to train a complex model on a large dataset, and deploy the model without worrying about the messy infrastructural details. SageMaker provides lots of best-in-class built in algorithms, and allows to bring your own model. Besides, you can use machine learning frameworks such as Scikit-learn, and TensorFlow with SageMaker. There are many sample notebooks, so you can learn by doing.

That being said, we think there is still room to improve:

1 ) Difficult to troubleshoot. Because SageMaker is relatively new, you can hardly find solutions to your questions on places like Stack OverFlow. From my experience, these are the best resources for troubleshooting:

el repo del sdk de python, mira el código fuente para encontrar información que no está descrita en la documentación de sagemaker
el foro de sagemaker, puede encontrar respuestas a sus preguntas allí, también puede publicar sus propias preguntas, y la gente de AWS suele responder en uno o dos días
el centro de soporte de AWS, puede crear un ticket allí, y el equipo de soporte responderá a su pregunta

2) Incomplete documentation. For example, when we were using SageMaker, the documentation does not cover how to extract the model coefficient, or how to set up the hyperparameter values for tuning. We found the answers by looking into their sample notebooks, AWS blog, and the SageMaker forum.

3) Not flexible enough. For example, when using SageMaker’s factorization machines with hyperparameter tuning, there are very limited objective metrics we can choose from. It is still unclear how to run cross validation with SageMaker’s built-in algorithm.

SageMaker has many functionalities, and this post is based on initial experimentation only. We plan to continue exploring other areas in SageMaker, such as how to bring my own model, and how to use Scikit-learn and Spark in SageMaker.

¿Quiere saber más?

Por favor, póngase en contacto con cualquier pregunta, comentario o consulta para obtener una estrategia Copilot en su campaña de medios digitales: xaxcopilotproduct@xaxis.com