Jiayi Xie - Xaxis

Selecting the best bidding model with automatic model selector (AMS)

Jiayi Xie — Wed, 03 Mar 2021 19:24:02 +0000

Making decisions is hard. When you choose a restaurant, do you fall back on favorites or try something new? With the regular spots you know what you’re getting, so you’re lowering your chance of regret. But you could be missing out on something better. On the other hand, exploring a new restaurant increases your risk of suffering through sad and soggy scallion pancakes. We call the former behavior “exploitation” and the latter “exploration” -- essentially, exploiting the information you have versus exploring for new data. Exploitation and exploration have to be balanced for you to have a decent shot at sustainable, successful decision-making.

In online advertising, we may face the problem of choosing between different bidding models or strategies for an ad campaign. How should one balance exploitation and exploration to achieve the best performance?

We recently published a paper on how to address this problem with our Automatic Model Selector (AMS). It’s a system for scalable online selection of bidding strategies based on live performance metrics. Yes, a human can set up multiple bidding strategies -- but AMS can choose the right one at the right time as needed to maximize performance. It automatically balances explore-exploit.

The system employs Multi-Armed Bandits (MAB) to continuously run and evaluate multiple models against live traffic, allocating the most traffic to the best performing model while decreasing traffic to those with poorer performance. It explores by giving non-zero traffic to all models so that each model can be evaluated, and exploits by putting most traffic to the model that performs best. The extent of exploitation increases over time as the system gains more confidence on which model performs best. This figure gives an overview of the components for the AMS system:

Figure 1. Components for the AMS system. ML Model Trainer provides trained ML models as the arms available to the MAB algorithm, run by MAB Model Selector. The selected model powers a bidding algorithm for a live RTB campaign running on a DSP. The campaign’s performance KPIs are tracked by the Performance Monitor, based on which the selection probabilities for the arms are updated. Models are swapped every 15 mins and performance KPIs are updated once a day.

Compared to the traditional model evaluation process that depended heavily on humans, AMS has a few advantages:

AMS regularly and automatically evaluates the model performances with online data using your media metrics (e.g., CTR, CPC, etc.). This avoids the possibility of relying on old or stale data as well as the inconsistency between media metrics and machine learning metrics.
AMS is flexible. It performs model selection for each campaign individually instead of choosing one model and applying it across all campaigns. Simpler models may perform better for small campaigns, while complex models may be needed for larger campaigns. AMS can find the model that works best for each campaign depending on the specific advertiser or market situation.
AMS is scalable. It saves time by being self-sufficient at running controlled experiments and allows people to focus on high level strategy. AMS evaluates model candidates on a case-by-case basis and systematically applies findings to shift models.

This system was demonstrated to be effective in initial online experiments. While AMS is not yet reflecting in our live product, this research is fueling our thought process for innovations to come. If you’re interested in learning more about the details of AMS, or the online experiment results, you can find the information in our published paper: Online and Scalable Model Selection with Multi-Armed Bandits.

Steps To Train A Machine Learning Model With Amazon Sagemaker — First Look

Jiayi Xie — Tue, 22 Oct 2019 14:00:36 +0000

SageMaker is a machine learning service managed by Amazon. It’s basically a service that combines EC2, ECR and S3 all together, allowing you to train complex machine learning models quickly and easily, and then deploy the model into a production-ready hosted environment. It provides many best-in-class built-in algorithms, such as Factorization Machines, XGBoost etc. It also allows you to train models using various machine learning frameworks, such as Apache MXNet, TensorFlow, and Scikit-learn.

A straightforward way to interact with SageMaker is using the notebook Instance. This process is described in detail by Amazon (link). We use SageMaker in a slightly different way. we only want to use SageMaker for the model training part, so that we can train a complex model on a large dataset without worrying about the messy infrastructural details. But the rest of the process (e.g., data preparation, making predictions) run locally. So, in our use case, we want to:

1. Interact with SageMaker jobs from local machine, without using SageMaker notebook Instance. Why? Well, there are a few advantages:

it takes a few minutes to start a notebook Instance, which is slow
unless you manually stop the instance, you will always be charged for the running instance, no matter if you are actively using it or not. On the other hand, if you submit the training job from local machine, you will only be charged for the model training part
if the code sits locally, you can use your IDE to debug, and use github for version control

2. Access the trained model locally, so that we can

look at the details of the model, instead of using the model as a black box
make predictions locally, and use the model in our own way

The rest of this post will cover how we did that in 5 steps:

Set up your local machine, so that you can interact with SageMaker jobs locally.
Prepare your data
Submit the training job
Download the trained model
Make predictions locally

At the end, we’ll also briefly show you how to use SageMaker’s hyperparameter tuner which helps you tune the machine learning model.

Set up your local machine

To interact with SageMaker jobs programmatically and locally, you need to install the sagemaker Python API, and AWS SDK for python. You can install them by running pip install sagemaker boto3

The easiest way to test if your local environment is ready, is by running through a sample notebook, for example, An Introduction to Factorization Machines with MNIST. Run this sample notebook, and check if you need to install additional packages, or if any AWS credential information is missing.

Now you are sure that your local machine is properly set up to interact with SageMaker, then you can bring your own data, train a Factorization Machine classification model using SageMaker, download the model and make predictions. To start, let’s look at how to prepare your data for training.

Prepare your data

Before you can train a model, data need to be uploaded to S3. The format of the input data depends on the algorithm you choose, for SageMaker’s Factorization Machine algorithm, protobuf is typically used.

To begin, you need to preprocess your data (clean, one hot encoding etc.), split both feature (X) and label (y) into train and test sets. Sometimes, you may also want to leave a validation set aside.

After you have obtained feature (X) and label (y), use the following python code to transform them into protobuf and upload to S3 bucket. Run this for both train, and test sets.

[code language="python"]
import sagemaker.amazon.common as smac
import boto3
import os
# after lots of data cleaning, preprocessing, feature engineering, split into train, test etc.
feature = your_features
label = your_labels
# define the S3 path to store data, the data would be uploaded to s3://{bucket}/{prefix}/{key} where key is the file name
bucket = your_S3_bucket_name
prefix = your_prefix_name
key = 'train.protobuf' # or 'test.protobuf'
# transform the feature and label into protobuf
buf = io.BytesIO()
# if the feature is a numpy array use smac.write_numpy_to_dense_tensor(buf, feature, label),
# if the feature is sparse matrix, use smac.write_spmatrix_to_sparse_tensor(buf, feature, label)
smac.write_numpy_to_dense_tensor(buf, feature, label)
buf.seek(0)
# upload the protobuf to S3
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, key)).upload_fileobj(buf)
path_to_train_data = f's3://{bucket}/{prefix}/{key}'
print(f'uploaded training data location: s3://{bucket}/{prefix}/{key}')
[/code]

At this point, you have uploaded your train, and test data to S3. You can go to AWS console, select S3, and check the protobuf file you just uploaded.

Submit the training job

Once you have the data ready, you can then define your estimator and submit a training job. The code below defines a factorization machine estimator, and fits data to it:

[code language="python"]
import sagemaker
from sagemaker.amazon.amazon_estimator import get_image_uri
output_prefix = your_prefix_name_for_output_model
role = your_full_IAM_role_arn_string  # the "Set up your local machine" session describes how to get this string
path_to_train_data = your_path_to_train_data  # from the "Prepare your data" step above
path_to_test_data = your_path_to_train_data
job_name = None  # you can name your job. Otherwise, sagemaker with auto assign a job name
output_prefix = 's3://{}/{}/factorization_machine_output'.format(bucket, output_prefix)
container = get_image_uri(boto3.Session(region_name='us-east-1').region_name, 'factorization-machines')
eatimator = sagemaker.estimator.Estimator(container, role, train_instance_count=1, train_instance_type='ml.c4.xlarge', output_path=output_prefix, sagemaker_session=sagemaker.Session())
eatimator.set_hyperparamters(feature_dim=feature.shape[1], predictor_type='binary_classifier', num_factors=100)
# run training job
eatimator.fit({'train': path_to_train_data, 'test': path_to_test_data}, wait=False, job_name=job_name)
training_job_name = estimator.latest_training_job.job_name
[/code]

Model parameters can be changed by calling the set_hyperparamters method, if you are not sure what’s the optimal value, you can try the Hyperparameter Tuner described later in this post.

In the estimator’s fit method, there is a parameter wait, which is set to True by default. That means, before this fitting process (i.e., model training) is finished, any code below this line will not run. we find this very inconvenient especially if you want to submit multiple training jobs at the same time. Therefore we set wait = False, and you can check the job status by either looking at the AWS console (select SageMaker -> Training -> Training jobs), or by running the following code:

[code language="python"]
def get_training_job_status(training_job_name: str):
job_info = boto3.client('sagemaker').describe_training_job(TrainingJobName=training_job_name)
job_status = job_info['TrainingJobStatus']
if job_status == 'Failed':
message = job_info['FailureReason']
print(f'Training failed with the following error: {message}')
return job_status, job_info
job_status, job_info = get_training_job_status(training_job_name)
if job_status != 'Completed':
print(f'Reminder: Training job {training_job_name} has not be completed. Cannot get model, or evaluate it.')
else:
s3_model_artifact_path = job_info['ModelArtifacts']['S3ModelArtifacts']
print('path to the output model artifacts:', s3_model_artifact_path)
[/code]

Download the trained model

After Sagemaker trains the model, a model artifact is stored to S3. You can download it, and access the model coefficients locally.

The way to access the model differs from algorithm to algorithm, here we only show you how to access the model coefficients for Sagemaker’s factorization machine model. Please note that this may NOT apply to other algorithms.

First, download the model artifact output from the training job

[code language="python"]
local_name = 'model_fm.tar.gz'
bucket = s3_model_artifact_path.split('s3://')[1].split('/')[0]
key = s3_model_artifact_path.split(bucket + '/')[1]
s3 = boto3.resource('s3')
s3.Bucket(bucket).download_file(key, local_name)
[/code]

Next, extract the information out

[code language="python"]

!tar -zxvf model_fm.tar.gz

!unzip -o model_algo-1

!mv params model_fm-0000.params

!mv symbol.json model_fm-symbol.json

[/code]

Finally, load the model into a mx.module object, so that you can read the information stored inside the model.

[code language="python"] import mxnet as mx mx_model = mx.module.Module.load("./model_fm", 0, False, label_names=['out_label']) [/code]

For a Factorization Machine model, the mx_model._arg_params has three keys. These include, w0_weight (the bias), w1_weight (weights for the linear terms), and v (weights for reduced dimension factorization space). You can look at their values to understand more about your model.

Make predictions locally

After you have loaded the model locally, you can apply the model to your test data, and make predictions, without paying to AWS.

If you have a small amount of data, you can make predictions by running make_prediction_dense function below.

[code language="python"]
def make_prediction_dense(model: mx.module, x_array: np.ndarray, batch_size: int=100):
data_iter = mx.io.NDArrayIter(data=x_array, batch_size=batch_size)
model.bind(data_shapes=data_iter.provide_data)
prediction = model.predict(data_iter).asnumpy().flatten()
return model, prediction
[/code]

If you have a large amount of data, make_prediction_dense would take a long time to finish. In this case, we’d suggest you transform your input data x_array to scipy sparse matrix before running the prediction.

Now you have obtained a factorization model using SageMaker, and are able to make predictions with it! The model you trained is based on a specific set of hyper parameter values. You may ask, how do I know what are the optimal values for the hyper parameters? SageMaker’s Hyperparameter Tuner will help you find the answer.

Hyperparameter tuner:

In many cases, you do not know what is the optimal value for model hyperparameters. Therefore, you would like to tune the model. Sagemaker’s hyperparameter tuner uses Bayesian Optimization to find the optimal model hyperparameters, as described here.

Unless you use CategoricalParameter to define the hyperparameter range, the Hyperparameter Tuner can not explore all the possible values within the defined range, but focuses its training efforts on the best places. At each iteration, the value to test is based on everything the tuner knows about this problem so far. This process is stochastic, it is very helpful for tuning complex models, where it is impossible to explore all the possible combinations. On the other hand, because it’s stochastic, it’s possible that the hyperparameter tuning model will fail to converge on the best answer, even if the ranges specified are correct.

We suggest you take some time to explore the hyperparameter ranges, and gradually shrink the ranges to explore so that the hyperparameter tuner is more likely to converge around the best answer faster. In addition, there is a small trade off between max_parallel_jobs and the quality of the final model. Larger max_parallel_jobs decreases overall tuning, but smaller max_parallel_jobs will probably generate a slightly better result.

If you’d like to dig further, you can use this sample notebook to visualize how the objective metric, and hyperparameter values change with time. It helps you understand if the hyperparameter tuner converged or not. With this information, you can adjust your hyperparameter ranges, and the max_jobs accordingly.

Conclusion

Overall, SageMaker is a very powerful machine learning service. It allows you to train a complex model on a large dataset, and deploy the model without worrying about the messy infrastructural details. SageMaker provides lots of best-in-class built in algorithms, and allows to bring your own model. Besides, you can use machine learning frameworks such as Scikit-learn, and TensorFlow with SageMaker. There are many sample notebooks, so you can learn by doing.

That being said, we think there is still room to improve:

1 ) Difficult to troubleshoot. Because SageMaker is relatively new, you can hardly find solutions to your questions on places like Stack OverFlow. From my experience, these are the best resources for troubleshooting:

the python sdk repo, look at the source code to find information that are not described in the sagemaker documentation
the sagemaker forum, you might find answers to your questions there, you can also post your own questions, and AWS people will typically respond within a day or two
the AWS support center, you can create a ticket there, and the support team will answer your question

2) Incomplete documentation. For example, when we were using SageMaker, the documentation does not cover how to extract the model coefficient, or how to set up the hyperparameter values for tuning. We found the answers by looking into their sample notebooks, AWS blog, and the SageMaker forum.

3) Not flexible enough. For example, when using SageMaker’s factorization machines with hyperparameter tuning, there are very limited objective metrics we can choose from. It is still unclear how to run cross validation with SageMaker’s built-in algorithm.

SageMaker has many functionalities, and this post is based on initial experimentation only. We plan to continue exploring other areas in SageMaker, such as how to bring my own model, and how to use Scikit-learn and Spark in SageMaker.

Want to know more?

Please get in touch with any questions, feedback, on inquiries to get a Copilot strategy on your digital media campaign: xaxcopilotproduct@xaxis.com

The Impact Of Data Size On CTR Model Performance

Jiayi Xie — Tue, 22 Oct 2019 13:00:30 +0000

In machine learning (ML), model performance is affected by both the data, and the learning algorithm you choose. In general, you would expect the more (good) data collected, the more information you could extract and the better your model could perform. However, there exists a saturation point where model performance stops improving even with additional data. This saturation happens when the size of the data cannot help a model surpass the assumptions of the learning algorithm. For instance, when a linear model is used to classify data within which nonlinear relationship resides, perfect prediction will never be reached, even with a lot of training data.

Let’s take a closer look at an example of how training data size affects ML model performance.

In online advertising, it’s critical to estimate the probability a user will respond to an ad (either click or conversion). This probability indicates the user’s interest in a product or brand and obtaining an accurate prediction improves both the user experience, and effectiveness of the advertising campaign. When predicting click-through rate (CTR), linear algorithms like Logistic Regression (LR) are often used, and here we’ll look at how data size affects LR performance. We also have a look at the results for Factorization Machines (FM) – an efficient algorithm to add cross features to LR. Here, we concentrate on the effect of data size on model performance, and leave the performance comparison between LR and FM to a separate post.

Our data comes from 5 advertisers; the table below shows the number of entries and number of features for each:

This data set is preprocessed and randomly sampled into data groups with different sizes. Then LR and FM are separately applied to each data group to generate classification models, the models are then evaluated on left-out test data sets using AUC scores.
The figures below show the relationship between model performance and data size. The horizontal axis represents the sampling fraction, e.g. 0.1 means 10% of the data is used for training. The vertical axis indicates the model performance (i.e., AUC). The AUC values are at different ranges for different advertisers, and this makes comparison difficult. To have a more clear comparison, we rescale AUC for each advertiser with the following rule:

AUC_rescaled = (AUC / AUC_avg – 1) * 100%

where AUC_avg is the AUC averaged across all sampling fractions for that advertiser.

In the plot, different advertisers are color-coded, and the thick gray curve represents the average re-scaled AUC across all 5 advertisers. Figure on the left and right show the results for LR and FM, respectively. It’s clear that the model performance keeps increasing when more data is used for training, and with the amount of data used in the test, neither LR nor FM reach the saturation levels. One thing worth mentioning is, when we increase the size of training data, the feature space is also expanded, therefore, model complexity increases, making it harder to saturate.

There is also an interesting trend: the AUC increases by about 1% when the data size almost doubles. This relationship is probably clearer if we make a lin-log plot by plotting the x-axis in log scale, and now we see a straight (sort of, at least for LR) gray line, meaning that when we increase x (data size) exponentially, y (performance) increases by a linear amount.

In summary, in our experiment, the saturation point was never met with increasing amounts of data. When the data size doubles, the AUC increases by almost 1%. For industrial applications 1% increase in AUC is often considered a significant improvement. This is to make the case that when there is enough data, increasing training size is a straightforward way to improve model performance.