Some tutorials and usecases¶
Tutorial 0 : how to launch a basic experiment with keras or sklearn¶
Step 1 : launching alp¶
Follow the instructions in the setup section. We assume at this point that you have a Jupyter notebook running on the controller.
Step 2 : defining your model¶
You can follow step from Step 2.1 : Keras or from Step 2.2 : Scikit learn regarding if you want to use Keras or scikit-learn. In both case we will do the right imports, get some classification data, put them in the ALP format and instanciate a model. The important thing at the end of step 2 is to have the data, data_val and model objects and a model ready.
Step 2.1 : Keras¶
The following code gets some data and declares a simple artificial neural network with Keras:
# we import numpy and fix the seed
import numpy as np
np.random.seed(1337) # for reproducibility
# we import alp and Keras tools that we will use
import alp
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.utils import np_utils
import keras.backend as K
from keras.optimizers import Adam
from alp.appcom.ensembles import HParamsSearch
# if you use tensorflow you must use this configuration
# so that it doesn't use all of the GPU's memory (default config)
import tensorflow as tf
config = tf.ConfigProto(allow_soft_placement=True)
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
K.set_session(session)
batch_size = 128
nb_classes = 10
nb_epoch = 12
# input image dimensions
img_rows, img_cols = 28, 28
# number of features to use
nb_filters = 32
# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')
if K.image_dim_ordering() == 'th':
X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)
# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)
# put the data in the form ALP expects
data, data_val = dict(), dict()
data["X"] = X_train
data["y"] = Y_train
data_val["X"] = X_test
data_val["y"] = Y_test
# finally define and compile the model
model = Sequential()
model.add(Flatten(input_shape=input_shape))
model.add(Dense(nb_filters))
model.add(Activation('relu'))
model.add(Dropout(0.25))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adadelta',
metrics=['accuracy'])
Note that we compile the model so that we also have information about the optimizer.
Step 2.2 : Scikit learn¶
The following code gets some data and declares a simple logistic regression with scikit-learn:
# some imports
from sklearn import cross_validation
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
# get some data
iris = datasets.load_iris()
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
iris.data, iris.target, test_size=0.2, random_state=0)
# put the data in the form ALP expects
data, data_val = dict(), dict()
data["X"] = X_train
data["y"] = y_train
data_val["X"] = X_test
data_val["y"] = y_test
# define the model
model = LogisticRegression()
Please note that by default for the LogisticRegression, the multi-class parameter is set to OvR, that is to say one classifier per class. On the iris dataset, it means 3 classifiers. Unlike in Keras, the model is not compiled. So far, the measure of performance (validation metric) can only be the mean absolute error, but we will soon have several metrics working.
Step 3 : fitting the model with ALP¶
Step 3.1 : defining the Experiment¶
In ALP, the base object is the Experiment. An Experiment trains, predicts, saves and logs a model. So the first step is to import and define the Experiment object.
from alp.appcom.core import Experiment
expe = Experiment(model)
Step 3.2 : fit the model¶
You have access to two types of methods to fit the model.
The
fitandfit_genmethods allows you to fit the model in the same process.For the
scikit-learnbackend, you can launch the computation with the following command without extra arguments:expe.fit([data], [data_val])
Note that the
dataand thedata_valare put in lists.With Keras you might want to specify the number of epochs and the batch_size, as you would have done to fit directly a Keras
modelobject. These arguments will flow trough to the final call. Note that they are not necessary for the fit, see the default arguments in the Keras model doc.expe.fit([data], [data_val], nb_epoch=2, batch_size=batch_size)
In both cases, the model is trained and automatically saved in the databases.
- The
fit_asyncmethod sends the model to the broker container that will manage the training using the workers you defined in the setup phase. The commands are then straightforward: For the
scikit-learnbackend:expe.fit_async([data], [data_val])
For the Keras backend you still need to provide extra arguments to override the defaults.
expe.fit_async([data], [data_val], nb_epoch=2, batch_size=batch_size)
In both cases, the model is also trained and automatically saved in the databases.
- The
Step 4 : Identifying and reusing the fitted model¶
Once the experiment has been fitted, you can access the id of the model in the db and load it to make prediction or access the parameters in the current process.
print(expe.mod_id)
print(expe.data_id)
expe.load_model(expe.mod_id, expe.data_id)
It’s then possible to make predictions using the loaded model.
expe.predict(data['X'])
You could of course provide new data to the model. You can also load the model in another experiment.
Tutorial 1 : Simple Hyperparameter Tuning with ALP - sklearn models¶
In this tutorial, we will get some data, build an Experiment with a simple model and tune the parameters of the model to get the best performance on validation data (by launching several experiments). We will then reuse this best model on unseen test data an check that it’s better than the untuned model. The whole thing will be using the asynchronous fit to highlight the capacity of ALP.
1 - Get some data¶
Let us start with the usual Iris dataset. Note that we will split the test set in 2 samples of size 25: the “validation” set to select the best model, and the “new” set to assess that the selected model was the best.
from sklearn import datasets
from sklearn.model_selection import train_test_split
# get some data
iris = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=50, random_state=0)
X_test_val, X_test_new, y_test_val, y_test_new = train_test_split(
X_test, y_test, test_size=25, random_state=1)
# put it in ALP expected format
data, data_val, data_new = dict(), dict(), dict()
data["X"], data["y"] = X_train, y_train
data_val["X"], data_val["y"] = X_test_val, y_test_val
data_new["X"], data_new["y"] = X_test_new, y_test_new
2 - Define an easy model and an ALP Experiment in a loop¶
We will define a simple LogisticRegression to demostrate how to use ensembles of experiments in ALP.
Let us first define an helper function.
import random
import sklearn.linear_model
from alp.appcom.core import Experiment
from operator import mul
def grid_search(grid_dict, tries, model_type='LogisticRegression'):
''' This function randomly build Experiments with different hyperparameters and return the list of experiments.
Args:
grid_dict(dict) : hyperparameter grid from which to draw samples from
tries(int) : number of model to be generated and tested
async(bool) : should the fit be asynchronous
model_type(string) : type of model to be tested (must be in sklearn.linear_model)
Returns:
expes(list): a list of Experiments.
'''
expes = dict()
# 1 - infos
size_grid = reduce(mul, [len(v) for v in grid_dict.values()])
print("grid size: {}".format(size_grid))
print("tries: {}".format(tries))
# 2 - models loop
for i in range(tries):
select_params = {}
key = [str(i)]
for k, v in grid_dict.items():
value = random.choice(v)
select_params[k] = value
key += ['{}:{}'.format(k, value)]
model = getattr(sklearn.linear_model, model_type)(**select_params)
expe = Experiment(model)
expes['_'.join(key)] = expe
return expes
Details of what this function does is:
1. display some infos about the size of the grid.
2. models loop: as many times as tries, it selects randomly a point in the hyperparameter grid, creates an Experiment object with the model parametrized with this point.
3 - Run the grid search¶
We use the HParamsSearch class to wrap several Experiment.
For now, because the grid is defined outside of the class, you have to pass a dictionnary mapping experiments name to Experiment.
from alp.appcom.ensemble import HParamsSearch
# setting the seed for reproducibility: feel free to change it
random.seed(12345)
# defining the grid that will be explored
grid_tol = [i*10**-j for i in (1,2,5) for j in (1, 2, 3, 4, 5, 6)]
grid_C = [i*10**-j for i in (1,2,5) for j in (-2, -1, 1, 2, 3, 4, 5, 6)]
grid = {'tol':grid_tol, 'C':grid_C}
tries = 100
expes = grid_search(grid, tries)
# we define the ensemble with our experiments and a metric
ensemble = HParamsSearch(experiments=expes, metric='score', op=np.max)
results = ensemble.fit([data], [data_val])
label, predictions = ensemble.predict(data['X'])
print('Best model: {}'.format(label)
Note
You can also use the fit_async() method.
grid size : 432
tries : 100
Best model: 52_C:100_tol:1e-06
- A word on the interpretation of the params:
- the parameter C is the regularisation parameter of the Logistic Regression. A small value of C means a higher L2 constraint on w (the L2 constraint is not applied on $c$, the intercept parameter). A larger C can lead to overfitting, while a smaller value can lead to too much regularization. As such, it is the ideal candidate for automatic tuning.
- the tol parameter is the tolerance for stopping criteria. Our experiments did not show a strong impact of this parameter unless it was set to high values.
4 - Validation that the best model is better than the untuned one¶
ALP makes prediction with the loaded best model on the unseen data easy. The accuracy of the best model is decent (one mistake over 25 points).
label, predictions = ensemble.predict(data_new['X'])
print('Best model: {}'.format(label))
0.96
We can now create an untuned model (C=1 by default) and assess its precision on unseen data is lower that the tuned one.
model = sklearn.linear_model.LogisticRegression()
expe = Experiment(model)
expe.fit([data], [data_val])
pred_worst_new = expe.predict(X_test_new)
print(sklearn.metrics.accuracy_score(pred_worst_new, data_new["y"]))
0.88
Tutorial 2 : Feed simple data to your ALP Experiment¶
In this tutorial, we will build an Experiment with a simple model and fit it on various number of pieces of data The aim of this tutorial is to explain the expected behaviour of ALP.
1 - Get some data¶
Let us start with the usual Iris dataset.
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# get some data
iris = datasets.load_iris()
X_train, X_val, y_train, y_val = train_test_split(
iris.data, iris.target, test_size=100, random_state=0)
The data is then put in the form ALP expects: a dictionary with a field ‘X’ for the input and a field ‘y’ for the output. Note that the same is done for the validation data.
data, data_val = dict(), dict()
data["X"], data["y"] = X_train, y_train
data_val["X"], data_val["y"] = X_val, y_val
Let us shuffle the data some more. After these lines, 2 more datasets are created.
more_data, some_more_data = dict(), dict()
more_data["X"], some_more_data["X"], more_data["y"], some_more_data["y"] = train_test_split(
iris.data, iris.target, test_size=75, random_state=1)
2 - Expected behaviour with sklearn¶
2.1 - Defining the experiment and model¶
We then define a first simple sklearn logistic regression.
from alp.appcom.core import Experiment
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
Expe = Experiment(lr)
2.2 - Fitting with one data set and one validation¶
Fitting one data set with one validation set is done this way:
Expe.fit([data],[data_val])
({'data_id': '1c59c0c562a5abdb84ad4f4a2c1868bf',
'metrics': {'iter': nan,
'score': [0.97999999999999998],
'val_score': [0.93999999999999995]},
'model_id': '5cabd17bbac6934fb487fa7f69bbda6e',
'params_dump': u'/parameters_h5/5cabd17bbac6934fb487fa7f69bbda6e1c59c0c562a5abdb84ad4f4a2c1868bf.h5'},
None)
Now let’s take a look at the results:
there is a data_id field: that is where the data is stored in the appropriate collection.
there is a model_id field: this is where the model architecture is stored.
theparam_dump field is path of a file where the attributes of the fitted model are stored.
- the metrics field is itself a dictionary with several attributes:
- the iter field is here for compatibility with the keras backend.
- the score field is model specific, you will have to look into sklearn’s documentation to see what kind of metric is used. For the logistic regression, it is the accuracy. This field is then the accuracy of the fitted model on the training data.
- the val_score is the score on the validation data (it is still the accuracy in this case).
You can access the full result of the experiment in the full_res attribut of the object.
Expe.full_res
{'data_id': '1c59c0c562a5abdb84ad4f4a2c1868bf',
'metrics': {'iter': nan,
'score': [0.97999999999999998],
'val_score': [0.93999999999999995]},
'model_id': '5cabd17bbac6934fb487fa7f69bbda6e',
'params_dump': u'/parameters_h5/5cabd17bbac6934fb487fa7f69bbda6e1c59c0c562a5abdb84ad4f4a2c1868bf.h5'}
Predicting the “more_data” on the model fitted on “data” is done this way.
pred_on_more_data = Expe.predict(more_data["X"])
At this point, pred_on_more_data is a vector of prediction. It’s accuracy is obtained as follows:
accuracy_score(pred_on_more_data,more_data["y"])
0.95999999999999996
Now you can check that the full_res field of the Expe object was not modified during the predict call.
Expe.full_res
{'data_id': '1c59c0c562a5abdb84ad4f4a2c1868bf',
'metrics': {'iter': nan,
'score': [0.97999999999999998],
'val_score': [0.93999999999999995]},
'model_id': '5cabd17bbac6934fb487fa7f69bbda6e',
'params_dump': u'/parameters_h5/5cabd17bbac6934fb487fa7f69bbda6e1c59c0c562a5abdb84ad4f4a2c1868bf.h5'}
2.3 - Fitting with one data set and no validation:¶
If you want to fit an experiment and don’t have a validation set, you need to specify a None in the validation field. Note that all the fields have changed. Since the data has changed, the data_id is different. The model created is a new one, so are the parameters. Finally, the metrics are different.
Expe.fit([some_more_data],[None])
({'data_id': '3554c1421fd9056e69c3cdf1b0ec8c3f',
'metrics': {'iter': nan, 'score': [0.95999999999999996], 'val_score': [nan]},
'model_id': 'ceb5d5632334515c4ebbd72a256bd421',
'params_dump': u'/parameters_h5/ceb5d5632334515c4ebbd72a256bd4213554c1421fd9056e69c3cdf1b0ec8c3f.h5'},
None)
As a result, the model actually stored in the Experiment at that time of the code execution is not the same as in 2.2. You can check that by predicting on the more_data set and check that the score is not the same.
pred_on_more_data = Expe.predict(more_data["X"])
accuracy_score(pred_on_more_data,more_data["y"])
0.94666666666666666
2.4 - Fitting several dataset¶
Now it’s an important point since the behavior of sklearn differs from the keras one: if you feed different datasets to an Experiment with an sklearn model, ALP proceeds as such:
- the first model is fitted, then the score and validation score are computed (on the first validation data, if provided).
- the second model is fitted, then the score and validation score are computed (on the second validation data, if provided).
- and so on
As a result, the parameters data_id, model_id and param_dumps in the full_res field of the Experiment of the following line are the one of the second model. The metrics (score and val_score) fields have a length of 2, one for each model.
Note that you can specify a None as validation set if you don’t want to validate a certain model.
Expe.fit([data,more_data],[None,some_more_data])
({'data_id': '2767007837282c3da5a86cfe41b57cce',
'metrics': {'iter': nan,
'score': [0.97999999999999998, 0.94666666666666666],
'val_score': [nan, 0.92000000000000004]},
'model_id': 'c6f885968087dc779ce47f3f1af86a9b',
'params_dump': u'/parameters_h5/c6f885968087dc779ce47f3f1af86a9b2767007837282c3da5a86cfe41b57cce.h5'},
None)
Tutorial 3 : Feed more data with Fuel or generators¶
Because we aim at supporting online learning on streamed data, we think that generators support was a good start. We support Fuel, a library that helps you to pre-process and yield chunks of data while being serializable.
1 - Create some data¶
You can easily use Fuel iterators in an Experiment. We will first create some fake data.
import fuel
import numpy as np
input_dim = 2
nb_hidden = 4
nb_class = 2
batch_size = 5
train_samples = 512
test_samples = 128
(X_tr, y_tr), (X_te, y_te) = get_test_data(nb_train=train_samples,
nb_test=test_samples,
input_shape=(input_dim,),
classification=True,
nb_class=nb_class)
y_tr = np_utils.to_categorical(y_tr)
y_te = np_utils.to_categorical(y_te)
data, data_val = dict(), dict()
X = np.concat([X_tr, X_te])
y = np.concat([y_tr, y_te])
inputs = [X, X]
outputs = [y]
2 - Transform the data¶
We then import an helper function that will convert our list of inputs to an HDF5 dataset. This dataset has a simple structure and we can divide it into multiple sets.
# we save the mean and the scale (inverse of the standard deviation)
# for each channel
scale = 1.0 / inputs[0].std(axis=0)
shift = - scale * inputs[0].mean(axis=0)
# for 3 sets, we need 3 slices
slices = [0, 256, 512]
# and 3 names
names = ['train', 'test', 'valid']
file_name = 'test_data_'
file_path_f = to_fuel_h5(inputs, outputs, slices, names, file_name, '/data_generator')
3 - Build your generator¶
The next step is to construct our Fuel generator using our dataset, a scheme and to transform the data so it’s prepared for our model.
train_set = H5PYDataset(file_path_f,
which_sets=('train','test', 'valid'))
scheme = SequentialScheme(examples=128, batch_size=32)
data_stream_train = DataStream(dataset=train_set, iteration_scheme=scheme)
stand_stream_train = ScaleAndShift(data_stream=data_stream_train,
scale=scale, shift=shift,
which_sources=('input_X',))
4 - Build and wrap your model¶
We finally build our model and wrap it in an experiment.
inputs = Input(shape=(input_dim,), name='X')
x = Dense(nb_hidden, activation='relu')(inputs)
x = Dense(nb_hidden, activation='relu')(x)
predictions = Dense(nb_class, activation='softmax')(x)
model = Model(input=inputs, output=predictions)
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
expe = Experiment(model)
5 - Train your model¶
We can finally use the alp.appcom.core.Experiment.fit_gen() method with our model and dataset.
expe.fit_gen([gen], [val], nb_epoch=2,
model=model,
metrics=metrics,
custom_objects=cust_objects,
samples_per_epoch=128,
nb_val_samples=128)
You can also use alp.appcom.core.Experiment.fit_gen_async() with the same function parameters if you have a worker running.
expe.fit_gen([gen], [val], nb_epoch=2,
model=model,
metrics=metrics,
custom_objects=cust_objects,
samples_per_epoch=128,
nb_val_samples=128)
Tutorial 4 : how to use custom layers for Keras with ALP¶
Because serialization of complex Python objects is still a challenge we will present a way of sending a custom layer to a Keras model with ALP.
1 - Get a dataset¶
We will work with the CIFAR10 dataset available via Keras.
from keras.datasets import cifar10
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.optimizers import SGD
from keras.utils import np_utils
from fuel.datasets.hdf5 import H5PYDataset
from fuel.schemes import SequentialScheme
from fuel.streams import DataStream
from fuel.transformers import ScaleAndShift
from alp.appcom.core import Experiment
from alp.appcom.utils import to_fuel_h5
import numpy as np
nb_classes = 10
nb_epoch = 25
# input image dimensions
img_rows, img_cols = 32, 32
# the CIFAR10 images are RGB
img_channels = 3
# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = cifar10.load_data()
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train = X_train/255
X_test = X_test/255
batch_size = 128
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)
2 - Build the generators¶
We build two generators, one for training and one for validation.
def dump_data():
inputs = [np.concatenate([X_train, X_test])]
outputs = [np.concatenate([Y_train, Y_test])]
file_name = 'test_data_dropout'
scale = 1.0 / inputs[0].std(axis=0)
shift = - scale * inputs[0].mean(axis=0)
file_path, i_names, o_names = to_fuel_h5(inputs, outputs, [0, 50000],
['train', 'test'],
file_name,
'/data_generator')
return file_path, scale, shift, i_names, o_names
file_path, scale, shift, i_names, o_names = dump_data()
def make_gen(set_to_gen, nb_examples):
file_path_f = file_path
names_select = i_names
train_set = H5PYDataset(file_path_f,
which_sets=set_to_gen)
scheme = SequentialScheme(examples=nb_examples, batch_size=64)
data_stream_train = DataStream(dataset=train_set, iteration_scheme=scheme)
stand_stream_train = ScaleAndShift(data_stream=data_stream_train,
scale=scale, shift=shift,
which_sources=(names_select[-1],))
return stand_stream_train, train_set, data_stream_train
train, data_tr, data_stream_tr = make_gen(('train',), 50000)
test, data_te, data_stream_te = make_gen(('test',), 10000)
3 - Build your custom layer¶
Imagine you want to reimplement a dropout layer. We could wrap it in a function that returns the object:
def return_custom():
import keras.backend as K
import numpy as np
from keras.engine import Layer
class Dropout_cust(Layer):
'''Applies Dropout to the input.
'''
def __init__(self, p, **kwargs):
self.p = p
if 0. < self.p < 1.:
self.uses_learning_phase = True
self.supports_masking = True
super(Dropout_cust, self).__init__(**kwargs)
def call(self, x, mask=None):
if 0. < self.p < 1.:
x = K.in_train_phase(K.dropout(x, level=self.p), x)
return x
def get_config(self):
config = {'p': self.p}
base_config = super(Dropout_cust, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
return Dropout_cust
4 - Build you model¶
We then define our model and call our function to instanciate this custom layer.
model = Sequential()
model.add(Convolution2D(64, 3, 3, border_mode='same',
input_shape=(img_channels, img_rows, img_cols)))
model.add(Activation('relu'))
model.add(Convolution2D(64, 3, 3))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Convolution2D(128, 3, 3, border_mode='same'))
model.add(Activation('relu'))
model.add(Convolution2D(128, 3, 3))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(1024))
model.add(Activation('relu'))
model.add(return_custom()(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
sgd = SGD(lr=0.02, decay=1e-7, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',
optimizer=sgd,
metrics=['accuracy'])
5 - Fit your model¶
We then map the name of the custom object to our function that returns the custom object in a dictionnary.
After wrapping the model in an alp.appcom.core.Experiment(), we call the alp.appcom.core.Experiment.fit_gen() method and send the custom_objects.
custom_objects = {'Dropout_p': return_custom}
expe = Experiment(model)
results = expe.fit_gen_async([train], [test], nb_epoch=nb_epoch,
model=model,
metrics=['accuracy'],
samples_per_epoch=50000,
nb_val_samples=10000,
verbose=2,
custom_objects=custom_objects))
Note
Why do we wrap this class and all the dependencies?
We use dill to be able to serialize object but unfortunatly, handling class with inheritance is not doable. It’s also easier to pass the information about all the dependencies of the object. All the dependencies and your custom objects will be instanciated during the evaluation of the function so that it will be available in the __main__. This way the information could be sent to workers without problems.