Some tutorials and usecases

Tutorial 0 : how to launch a basic experiment with keras or sklearn

Step 1 : launching alp

Follow the instructions in the setup section. We assume at this point that you have a Jupyter notebook running on the controller.

Step 2 : defining your model

You can follow step from Step 2.1 : Keras or from Step 2.2 : Scikit learn regarding if you want to use Keras or scikit-learn. In both case we will do the right imports, get some classification data, put them in the ALP format and instanciate a model. The important thing at the end of step 2 is to have the data, data_val and model objects and a model ready.

Step 2.1 : Keras

The following code gets some data and declares a simple artificial neural network with Keras:

# we import numpy and fix the seed
import numpy as np
np.random.seed(1337)  # for reproducibility

# we import alp and Keras tools that we will use
import alp
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.utils import np_utils
import keras.backend as K
from keras.optimizers import Adam
from alp.appcom.ensembles import HParamsSearch

# if you use tensorflow you must use this configuration
# so that it doesn't use all of the GPU's memory (default config)
import tensorflow as tf

config = tf.ConfigProto(allow_soft_placement=True)
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
K.set_session(session)

batch_size = 128
nb_classes = 10
nb_epoch = 12

# input image dimensions
img_rows, img_cols = 28, 28
# number of features to use
nb_filters = 32

# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

if K.image_dim_ordering() == 'th':
    X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
    X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
    X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)

# put the data in the form ALP expects
data, data_val = dict(), dict()
data["X"] = X_train
data["y"] = Y_train
data_val["X"] = X_test
data_val["y"] = Y_test

# finally define and compile the model

model = Sequential()

model.add(Flatten(input_shape=input_shape))
model.add(Dense(nb_filters))
model.add(Activation('relu'))
model.add(Dropout(0.25))

model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])

Note that we compile the model so that we also have information about the optimizer.

Step 2.2 : Scikit learn

The following code gets some data and declares a simple logistic regression with scikit-learn:

# some imports
from sklearn import cross_validation
from sklearn import datasets
from sklearn.linear_model import LogisticRegression

# get some data
iris = datasets.load_iris()
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
                iris.data, iris.target, test_size=0.2, random_state=0)

# put the data in the form ALP expects
data, data_val = dict(), dict()
data["X"] = X_train
data["y"] = y_train
data_val["X"] = X_test
data_val["y"] = y_test

# define the model
model = LogisticRegression()

Please note that by default for the LogisticRegression, the multi-class parameter is set to OvR, that is to say one classifier per class. On the iris dataset, it means 3 classifiers. Unlike in Keras, the model is not compiled. So far, the measure of performance (validation metric) can only be the mean absolute error, but we will soon have several metrics working.

Step 3 : fitting the model with ALP

Step 3.1 : defining the Experiment

In ALP, the base object is the Experiment. An Experiment trains, predicts, saves and logs a model. So the first step is to import and define the Experiment object.

from alp.appcom.core import Experiment

expe = Experiment(model)

Step 3.2 : fit the model

You have access to two types of methods to fit the model.

  • The fit and fit_gen methods allows you to fit the model in the same process.

    For the scikit-learn backend, you can launch the computation with the following command without extra arguments:

    expe.fit([data], [data_val])
    

    Note that the data and the data_val are put in lists.

    With Keras you might want to specify the number of epochs and the batch_size, as you would have done to fit directly a Keras model object. These arguments will flow trough to the final call. Note that they are not necessary for the fit, see the default arguments in the Keras model doc.

    expe.fit([data], [data_val], nb_epoch=2, batch_size=batch_size)
    

    In both cases, the model is trained and automatically saved in the databases.

  • The fit_async method sends the model to the broker container that will manage the training using the workers you defined in the setup phase. The commands are then straightforward:

    For the scikit-learn backend:

    expe.fit_async([data], [data_val])
    

    For the Keras backend you still need to provide extra arguments to override the defaults.

    expe.fit_async([data], [data_val], nb_epoch=2, batch_size=batch_size)
    

    In both cases, the model is also trained and automatically saved in the databases.

Step 4 : Identifying and reusing the fitted model

Once the experiment has been fitted, you can access the id of the model in the db and load it to make prediction or access the parameters in the current process.

print(expe.mod_id)
print(expe.data_id)

expe.load_model(expe.mod_id, expe.data_id)

It’s then possible to make predictions using the loaded model.

expe.predict(data['X'])

You could of course provide new data to the model. You can also load the model in another experiment.

Tutorial 1 : Simple Hyperparameter Tuning with ALP - sklearn models

In this tutorial, we will get some data, build an Experiment with a simple model and tune the parameters of the model to get the best performance on validation data (by launching several experiments). We will then reuse this best model on unseen test data an check that it’s better than the untuned model. The whole thing will be using the asynchronous fit to highlight the capacity of ALP.

1 - Get some data

Let us start with the usual Iris dataset. Note that we will split the test set in 2 samples of size 25: the “validation” set to select the best model, and the “new” set to assess that the selected model was the best.

from sklearn import datasets
from sklearn.model_selection import train_test_split

# get some data
iris = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=50, random_state=0)
X_test_val, X_test_new, y_test_val, y_test_new = train_test_split(
    X_test, y_test, test_size=25, random_state=1)

# put it in ALP expected format
data, data_val, data_new = dict(), dict(), dict()
data["X"], data["y"] = X_train,  y_train
data_val["X"], data_val["y"] = X_test_val, y_test_val
data_new["X"], data_new["y"] = X_test_new, y_test_new

2 - Define an easy model and an ALP Experiment in a loop

We will define a simple LogisticRegression to demostrate how to use ensembles of experiments in ALP.

Let us first define an helper function.

import random
import sklearn.linear_model
from alp.appcom.core import Experiment
from operator import mul

def grid_search(grid_dict, tries, model_type='LogisticRegression'):
    ''' This function randomly build Experiments with different hyperparameters and return the list of experiments.

    Args:
        grid_dict(dict) : hyperparameter grid from which to draw samples from
        tries(int) : number of model to be generated and tested
        async(bool) : should the fit be asynchronous
        model_type(string) : type of model to be tested (must be in sklearn.linear_model)

    Returns:
        expes(list): a list of Experiments.

    '''

    expes = dict()

    # 1 - infos
    size_grid = reduce(mul, [len(v) for v in grid_dict.values()])
    print("grid size: {}".format(size_grid))
    print("tries: {}".format(tries))


    # 2 - models loop
    for i in range(tries):
        select_params =  {}
        key = [str(i)]
        for k, v in grid_dict.items():
            value = random.choice(v)
            select_params[k] = value
            key += ['{}:{}'.format(k, value)]
        model = getattr(sklearn.linear_model, model_type)(**select_params)
        expe = Experiment(model)
        expes['_'.join(key)] = expe
    return expes

Details of what this function does is: 1. display some infos about the size of the grid. 2. models loop: as many times as tries, it selects randomly a point in the hyperparameter grid, creates an Experiment object with the model parametrized with this point.

4 - Validation that the best model is better than the untuned one

ALP makes prediction with the loaded best model on the unseen data easy. The accuracy of the best model is decent (one mistake over 25 points).

label, predictions = ensemble.predict(data_new['X'])
print('Best model: {}'.format(label))
0.96

We can now create an untuned model (C=1 by default) and assess its precision on unseen data is lower that the tuned one.

model = sklearn.linear_model.LogisticRegression()
expe = Experiment(model)
expe.fit([data], [data_val])
pred_worst_new = expe.predict(X_test_new)
print(sklearn.metrics.accuracy_score(pred_worst_new, data_new["y"]))
0.88

Tutorial 2 : Feed simple data to your ALP Experiment

In this tutorial, we will build an Experiment with a simple model and fit it on various number of pieces of data The aim of this tutorial is to explain the expected behaviour of ALP.

1 - Get some data

Let us start with the usual Iris dataset.

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# get some data
iris = datasets.load_iris()
X_train, X_val, y_train, y_val = train_test_split(
                    iris.data, iris.target, test_size=100, random_state=0)

The data is then put in the form ALP expects: a dictionary with a field ‘X’ for the input and a field ‘y’ for the output. Note that the same is done for the validation data.

data, data_val = dict(), dict()
data["X"], data["y"] = X_train,  y_train
data_val["X"], data_val["y"] = X_val, y_val

Let us shuffle the data some more. After these lines, 2 more datasets are created.

more_data, some_more_data = dict(), dict()
more_data["X"], some_more_data["X"], more_data["y"], some_more_data["y"] = train_test_split(
                    iris.data, iris.target, test_size=75, random_state=1)

2 - Expected behaviour with sklearn

2.1 - Defining the experiment and model

We then define a first simple sklearn logistic regression.

from alp.appcom.core import Experiment
from  sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
Expe = Experiment(lr)

2.2 - Fitting with one data set and one validation

Fitting one data set with one validation set is done this way:

Expe.fit([data],[data_val])
({'data_id': '1c59c0c562a5abdb84ad4f4a2c1868bf',
  'metrics': {'iter': nan,
   'score': [0.97999999999999998],
   'val_score': [0.93999999999999995]},
  'model_id': '5cabd17bbac6934fb487fa7f69bbda6e',
  'params_dump': u'/parameters_h5/5cabd17bbac6934fb487fa7f69bbda6e1c59c0c562a5abdb84ad4f4a2c1868bf.h5'},
 None)

Now let’s take a look at the results:

  • there is a data_id field: that is where the data is stored in the appropriate collection.

  • there is a model_id field: this is where the model architecture is stored.

  • theparam_dump field is path of a file where the attributes of the fitted model are stored.

  • the metrics field is itself a dictionary with several attributes:
    • the iter field is here for compatibility with the keras backend.
    • the score field is model specific, you will have to look into sklearn’s documentation to see what kind of metric is used. For the logistic regression, it is the accuracy. This field is then the accuracy of the fitted model on the training data.
    • the val_score is the score on the validation data (it is still the accuracy in this case).

You can access the full result of the experiment in the full_res attribut of the object.

Expe.full_res
{'data_id': '1c59c0c562a5abdb84ad4f4a2c1868bf',
 'metrics': {'iter': nan,
  'score': [0.97999999999999998],
  'val_score': [0.93999999999999995]},
 'model_id': '5cabd17bbac6934fb487fa7f69bbda6e',
 'params_dump': u'/parameters_h5/5cabd17bbac6934fb487fa7f69bbda6e1c59c0c562a5abdb84ad4f4a2c1868bf.h5'}

Predicting the “more_data” on the model fitted on “data” is done this way.

pred_on_more_data = Expe.predict(more_data["X"])

At this point, pred_on_more_data is a vector of prediction. It’s accuracy is obtained as follows:

accuracy_score(pred_on_more_data,more_data["y"])
0.95999999999999996

Now you can check that the full_res field of the Expe object was not modified during the predict call.

Expe.full_res
{'data_id': '1c59c0c562a5abdb84ad4f4a2c1868bf',
 'metrics': {'iter': nan,
  'score': [0.97999999999999998],
  'val_score': [0.93999999999999995]},
 'model_id': '5cabd17bbac6934fb487fa7f69bbda6e',
 'params_dump': u'/parameters_h5/5cabd17bbac6934fb487fa7f69bbda6e1c59c0c562a5abdb84ad4f4a2c1868bf.h5'}

2.3 - Fitting with one data set and no validation:

If you want to fit an experiment and don’t have a validation set, you need to specify a None in the validation field. Note that all the fields have changed. Since the data has changed, the data_id is different. The model created is a new one, so are the parameters. Finally, the metrics are different.

Expe.fit([some_more_data],[None])
({'data_id': '3554c1421fd9056e69c3cdf1b0ec8c3f',
  'metrics': {'iter': nan, 'score': [0.95999999999999996], 'val_score': [nan]},
  'model_id': 'ceb5d5632334515c4ebbd72a256bd421',
  'params_dump': u'/parameters_h5/ceb5d5632334515c4ebbd72a256bd4213554c1421fd9056e69c3cdf1b0ec8c3f.h5'},
 None)

As a result, the model actually stored in the Experiment at that time of the code execution is not the same as in 2.2. You can check that by predicting on the more_data set and check that the score is not the same.

pred_on_more_data = Expe.predict(more_data["X"])
accuracy_score(pred_on_more_data,more_data["y"])
0.94666666666666666

2.4 - Fitting several dataset

Now it’s an important point since the behavior of sklearn differs from the keras one: if you feed different datasets to an Experiment with an sklearn model, ALP proceeds as such:

  • the first model is fitted, then the score and validation score are computed (on the first validation data, if provided).
  • the second model is fitted, then the score and validation score are computed (on the second validation data, if provided).
  • and so on

As a result, the parameters data_id, model_id and param_dumps in the full_res field of the Experiment of the following line are the one of the second model. The metrics (score and val_score) fields have a length of 2, one for each model.

Note that you can specify a None as validation set if you don’t want to validate a certain model.

Expe.fit([data,more_data],[None,some_more_data])
({'data_id': '2767007837282c3da5a86cfe41b57cce',
  'metrics': {'iter': nan,
   'score': [0.97999999999999998, 0.94666666666666666],
   'val_score': [nan, 0.92000000000000004]},
  'model_id': 'c6f885968087dc779ce47f3f1af86a9b',
  'params_dump': u'/parameters_h5/c6f885968087dc779ce47f3f1af86a9b2767007837282c3da5a86cfe41b57cce.h5'},
 None)

Tutorial 3 : Feed more data with Fuel or generators

Because we aim at supporting online learning on streamed data, we think that generators support was a good start. We support Fuel, a library that helps you to pre-process and yield chunks of data while being serializable.

1 - Create some data

You can easily use Fuel iterators in an Experiment. We will first create some fake data.

import fuel
import numpy as np
input_dim = 2
nb_hidden = 4
nb_class = 2
batch_size = 5
train_samples = 512
test_samples = 128
(X_tr, y_tr), (X_te, y_te) = get_test_data(nb_train=train_samples,
                                          nb_test=test_samples,
                                          input_shape=(input_dim,),
                                          classification=True,
                                          nb_class=nb_class)

y_tr = np_utils.to_categorical(y_tr)
y_te = np_utils.to_categorical(y_te)

data, data_val = dict(), dict()

X = np.concat([X_tr, X_te])
y = np.concat([y_tr, y_te])

inputs = [X, X]
outputs = [y]

2 - Transform the data

We then import an helper function that will convert our list of inputs to an HDF5 dataset. This dataset has a simple structure and we can divide it into multiple sets.

# we save the mean and the scale (inverse of the standard deviation)
# for each channel
scale = 1.0 / inputs[0].std(axis=0)
shift = - scale * inputs[0].mean(axis=0)

# for 3 sets, we need 3 slices
slices = [0, 256, 512]

# and 3 names
names = ['train', 'test', 'valid']

file_name = 'test_data_'
file_path_f = to_fuel_h5(inputs, outputs, slices, names, file_name, '/data_generator')

3 - Build your generator

The next step is to construct our Fuel generator using our dataset, a scheme and to transform the data so it’s prepared for our model.

train_set = H5PYDataset(file_path_f,
                        which_sets=('train','test', 'valid'))

scheme = SequentialScheme(examples=128, batch_size=32)

data_stream_train = DataStream(dataset=train_set, iteration_scheme=scheme)

stand_stream_train = ScaleAndShift(data_stream=data_stream_train,
                                   scale=scale, shift=shift,
                                   which_sources=('input_X',))

4 - Build and wrap your model

We finally build our model and wrap it in an experiment.

inputs = Input(shape=(input_dim,), name='X')

x = Dense(nb_hidden, activation='relu')(inputs)
x = Dense(nb_hidden, activation='relu')(x)
predictions = Dense(nb_class, activation='softmax')(x)

model = Model(input=inputs, output=predictions)

model.compile(loss='categorical_crossentropy',
                optimizer='rmsprop',
                metrics=['accuracy'])

expe = Experiment(model)

5 - Train your model

We can finally use the alp.appcom.core.Experiment.fit_gen() method with our model and dataset.

expe.fit_gen([gen], [val], nb_epoch=2,
              model=model,
              metrics=metrics,
              custom_objects=cust_objects,
              samples_per_epoch=128,
              nb_val_samples=128)

You can also use alp.appcom.core.Experiment.fit_gen_async() with the same function parameters if you have a worker running.

expe.fit_gen([gen], [val], nb_epoch=2,
              model=model,
              metrics=metrics,
              custom_objects=cust_objects,
              samples_per_epoch=128,
              nb_val_samples=128)

Tutorial 4 : how to use custom layers for Keras with ALP

Because serialization of complex Python objects is still a challenge we will present a way of sending a custom layer to a Keras model with ALP.

1 - Get a dataset

We will work with the CIFAR10 dataset available via Keras.

from keras.datasets import cifar10
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.optimizers import SGD
from keras.utils import np_utils

from fuel.datasets.hdf5 import H5PYDataset
from fuel.schemes import SequentialScheme
from fuel.streams import DataStream
from fuel.transformers import ScaleAndShift

from alp.appcom.core import Experiment

from alp.appcom.utils import to_fuel_h5

import numpy as np

nb_classes = 10
nb_epoch = 25

# input image dimensions
img_rows, img_cols = 32, 32
# the CIFAR10 images are RGB
img_channels = 3

# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = cifar10.load_data()

X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train = X_train/255
X_test = X_test/255

batch_size = 128
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)

2 - Build the generators

We build two generators, one for training and one for validation.

def dump_data():
    inputs = [np.concatenate([X_train, X_test])]
    outputs = [np.concatenate([Y_train, Y_test])]

    file_name = 'test_data_dropout'
    scale = 1.0 / inputs[0].std(axis=0)
    shift = - scale * inputs[0].mean(axis=0)

    file_path, i_names, o_names = to_fuel_h5(inputs, outputs, [0, 50000],
                                            ['train', 'test'],
                                            file_name,
                                            '/data_generator')
    return file_path, scale, shift, i_names, o_names

file_path, scale, shift, i_names, o_names = dump_data()


def make_gen(set_to_gen, nb_examples):
    file_path_f = file_path
    names_select = i_names
    train_set = H5PYDataset(file_path_f,
                            which_sets=set_to_gen)

    scheme = SequentialScheme(examples=nb_examples, batch_size=64)

    data_stream_train = DataStream(dataset=train_set, iteration_scheme=scheme)

    stand_stream_train = ScaleAndShift(data_stream=data_stream_train,
                                      scale=scale, shift=shift,
                                      which_sources=(names_select[-1],))
    return stand_stream_train, train_set, data_stream_train

train, data_tr, data_stream_tr = make_gen(('train',), 50000)
test, data_te, data_stream_te = make_gen(('test',), 10000)

3 - Build your custom layer

Imagine you want to reimplement a dropout layer. We could wrap it in a function that returns the object:

def return_custom():
    import keras.backend as K
    import numpy as np
    from keras.engine import Layer
    class Dropout_cust(Layer):
        '''Applies Dropout to the input.
        '''
        def __init__(self, p, **kwargs):
            self.p = p
            if 0. < self.p < 1.:
                self.uses_learning_phase = True
            self.supports_masking = True
            super(Dropout_cust, self).__init__(**kwargs)

        def call(self, x, mask=None):
            if 0. < self.p < 1.:
                x = K.in_train_phase(K.dropout(x, level=self.p), x)
            return x

        def get_config(self):
            config = {'p': self.p}
            base_config = super(Dropout_cust, self).get_config()
            return dict(list(base_config.items()) + list(config.items()))
    return Dropout_cust

4 - Build you model

We then define our model and call our function to instanciate this custom layer.

model = Sequential()

model.add(Convolution2D(64, 3, 3, border_mode='same',
                        input_shape=(img_channels, img_rows, img_cols)))
model.add(Activation('relu'))
model.add(Convolution2D(64, 3, 3))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Convolution2D(128, 3, 3, border_mode='same'))
model.add(Activation('relu'))
model.add(Convolution2D(128, 3, 3))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(1024))
model.add(Activation('relu'))
model.add(return_custom()(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

sgd = SGD(lr=0.02, decay=1e-7, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',
              optimizer=sgd,
              metrics=['accuracy'])

5 - Fit your model

We then map the name of the custom object to our function that returns the custom object in a dictionnary. After wrapping the model in an alp.appcom.core.Experiment(), we call the alp.appcom.core.Experiment.fit_gen() method and send the custom_objects.

custom_objects = {'Dropout_p': return_custom}

expe = Experiment(model)

results = expe.fit_gen_async([train], [test], nb_epoch=nb_epoch,
                             model=model,
                             metrics=['accuracy'],
                             samples_per_epoch=50000,
                             nb_val_samples=10000,
                             verbose=2,
                             custom_objects=custom_objects))

Note

Why do we wrap this class and all the dependencies?

We use dill to be able to serialize object but unfortunatly, handling class with inheritance is not doable. It’s also easier to pass the information about all the dependencies of the object. All the dependencies and your custom objects will be instanciated during the evaluation of the function so that it will be available in the __main__. This way the information could be sent to workers without problems.