Convolutional Neural Networks on Tabular data

TL;DR; Convolutional Networks are magical: they work on tables too!

Convolution Networks on Tabular Data

It is known: a typical data scientist frequently swings during her workday between two states: "Why isn't it working?" and "Why does it work?"

Today I'd like to tell you a short story that will switch you to the second state:

"How come it just works?!?"

We'll research a bit how can we use one of the most successful architectures, Convolutional Neural Networks (CNN), to solve problems with tabular data.

Mechanisms of Action (MoA) Prediction Competition

A few months ago, I participated in the Kaggle competition Mechanisms of Action (MoA) Prediction. The goal in this competition was to predict the biological mechanism of action of a given molecule, based on the genetic reactions and the observed samples of cellulose.

The final ranking was very tight. The difference between the first 10 ranked groups in the leaderboard was 0.0001. All these groups had one thing in common: They all used CNN as an integral part of their solution.

We'll review here the CNN Model that was used by the second-place group. When I first read his writing, I was amazed by the simplicity and the efficiency of the model architecture which he called: 1D-CNN.

Although their winning submission was an ensemble of 1D-CNN and TabNet, the 1D-CNN alone could have scored 5th place and was the only high-performance model in this competition. Reaching one of the first five places with a single model (not an ensemble) is an outstanding result!

Convolutional Neural Networks (CNN)

We all know that CNNs are an excellent architecture to solve computer vision problems. Most of the modern algorithms are using CNN as their building blocks. Their effectiveness in image and video processing tasks originated because they consider the spatial structure of the data and capture local spatial input patterns.

Convolution filter is an excellent extractor that is using two attributes of the input images: local connectivity and spatial locality. Local connectivity means that every filter is connected to a small area in the input image when performing the action. spatial locality means that the pixels on which the kernel is affected have high correlations, and usually processing them together enables extracting meaningful representation. For example, a single convolution filter can learn to extract edges, structures, shapes, inclines, etc.

What happens when we activate CNN on tabular data?

We can use a single layer of a one-dimensional convolution, but this layer expects a local correlation between the features. In other words, the layer expects those adjacent columns will be spatially correlated (the relative location of the columns has a meaning) that is false for nearly every table.

So, basically, it shouldn't work.

If you're still not with me: changing the table column order should not change the prediction output.

Convolution just isn't built for it.

That's where the trick starts...

We can't feed a table directly into the convolution layer since the tabular features aren't spatial adjusted... but... What if we can learn to represent these features so that they are?

And this is what the mysterious user "tmp" used in the competition.

In his/her words:

As shown above, feature dimension is increased through a FC layer firstly. The role of this layer includes providing enough pixels for the image by increasing the dimension, and making the generated image meaningful by features sorting.

First, s/he enriches the data from 937 original features to 4096 using a standard fully connected layer. Then, this layer is reshaped through 256 channels which contain 16x1 signals (or 16x1 sized images) each. In simple words: each of these signals matches another group of 16 feature sorting, and there we have 256 (16x16) groups with different sorting. 💥

Indeed "💥", S/he literally blew my brain up.

A new network that can win anything - and I don't have that??

So I went and checked tmp's Github and reimplemented the code with Keras. I refined and deleted all the original solutions until only several lines of code were left, which still reached second place on the leaderboard.

What's up in there then?

There's a network with a fully connected first layer, followed by a convolution which looks at the features that the first layer learned. Everything is trained end2end. So the idea was to learn how to "rearrange" the features. But in real life, the values the convolution receives are 16x1 signals which are not identical to the original features but are some kind of a nonlinear combination.

After reshaping, these features enter several one-dimensional convolutional layers with residual connections in several different ways. The extracted features are used for prediction after being flattened and gone through another fully connected layer (see sketch).

Code implementation

And here is the code implementation:

def create_cnn( num_columns, num_labels, hidden_units, dropout_rates, label_smoothing, learning_rate ):
    inp = Input(shape=(num_columns,))
    x = BatchNormalization()(inp)
    x = Dropout(dropout_rates[0])(x)
    x = WeightNormalization(Dense(4096))(x)
    x = Reshape((256, 16))(x)
    x = BatchNormalization()(x)
    x = Dropout(0.1)(x)
    x = WeightNormalization(Conv1D(filters = 16, kernel_size = 5, activation = swish, use_bias = False, padding = 'SAME'))(x)
    x = AveragePooling1D(pool_size = 2)(x)
    xs = x
    x = BatchNormalization()(x)
    x = Dropout(0.1)(x)
    x = WeightNormalization(Conv1D(filters = 16, kernel_size = 3, activation = swish, use_bias = True, padding = 'SAME'))(x)
    x = BatchNormalization()(x)
    x = Dropout(0.1)(x)
    x = WeightNormalization(Conv1D(filters = 16, kernel_size = 3, activation = swish, use_bias = True, padding = 'SAME'))(x)
    x = Multiply()([x, xs])
    x = MaxPool1D(pool_size = 4, strides = 2)(x)
    x = Flatten()(x)
    x = BatchNormalization()(x)
    x = Activation(swish)(x)
    x = Dense(num_labels)(x)
    out = Activation('sigmoid')(x)
    model = Model(inputs=inp, outputs=out)
    model.summary()
    return model

Few important highlights

There was another skip of add in the original solution, which I've kicked out. As I said: I was looking for the minimal code. This add was not mandatory.
I added the WeightNorm. Hey, free performance optimization...
The network is sensitive to dimension sizes for some reason...
I was looking at the Poolings for a while to understand why it works. I have a few ideas, but nothing concrete...

Does it also work outside of this competition?

Let's get our hands dirty and see what potential this network has. For this experiment, we'll use another Kaggle competition dataset: Santander customer transaction prediction.

It is a binary classification problem in which we need to predict if a customer will perform a future transaction based on an anonymous user behavior feature set.

The baseline is 3 other models:

MLP: a standard network with 3 fully connected layers
1DCNN: our new network here
LightGBM: was one of the best performing models in that competition (and in the world, in general).

The result:
The convolutional network performed better than the MLP and scored close to the LightGBM. With a bit of parameter tuning, the results would have been even better.

Conclusions:
I have no idea anymore why Deep Learning works.

The convolutional networks' generalization capabilities make them an undisputed winner in almost every computer vision task. It turns out we can use CNN for tables too. The brilliancy here is to add a Fully Connected network before the convolution. It is a simple but effective idea, which requires some tricks.

For references and additional reading: