View on GitHub

PhDthesis

PhD thesis in Applied Physics

Combinatorial algorithm

The most computational time-expensive step of the algorithm is certainly the couples evaluation. From a computation point-of-view this step requires (O(N^2)) operations for the full set of combinations. Since we want to perform also an internal Leave-One-Out cross validation for the couple performances estimation, we have to add a (O(S-1)) to the algorithmic complexity. Let’s focused on some preliminary considerations before discuss about the proposed implementation:

With these ideas in mind, we have written a C++ code ables to optimize this step in a multi-threading environment, aiming to test its scalability over multi-core machines.

Starting from the first discussed point, we have chosen to implement the full code inside a unique main function with the help of only a single SoA custom object and one external function (sorting algorithm discussed in the next section). This allows us to implement the code inside a single parallel section, reducing the time of thread spawning. We have chosen to import the data from file in sequential mode, since the I/O is not (particularly) affected by parallel optimizations (in our simulations).

Following the instructions suggested in Appendix A - Numerical Implementation, we compute the statistical quantities on the full set of data before starting the couples evaluation. Taking a look to the variance equation

we can see that the first equation involves the mean computation as a simple sum of elements, using a large number of subtractions that are numerically unstable for data outliers (moreover because they are elevated to square). The better choice, in this case, is given by the second formula that allows to compute the both quantities in the formula inside a single parallel loop1. At each cross validation, we used the two pre-computed sums of variables, removing the only data points excluded by the Leave-One-Out. Another precaution to take in care is to add a small epsilon to the variance before its use in the denominator into the classifier function to prevent numerical underflow.

The set of pair variables can be obtained only by two nested for loops in C++ and a naive optimization can be simply obtained by reducing the number of iterations following the triangular indexes of the full matrix (by definition the score of the couple (i, j) is equal to the score of (j, i)). This precaution easily allows the parallelization of the external loop and drastically reduce the number of iterations, but it also creates a link between the two iteration variables. The new release of OpenMP libraries [OpenMP] 2 (from OpenMP 4.5) introduces a new keyword in the language, that allows the collapsing of nested for loops into a single one (whose number of iterations is given by the product of the single dimensions), in the only exception of a completely independence of iteration variables. So, the best strategy to use in this case is to perform the full set of N^2 iterations with a single collapse clause in the external loop 3.

import pandas as pd
import itertools
import multiprocessing
from functools import partial

from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import LeaveOneOut, cross_val_score

def couple_evaluation (couple, data, labels):
  f1, f2 = couple

  samples = data.iloc[[f1, f2]]
  score = cross_val_score(GaussianNB(), samples.T, labels,
                          cv=LeaveOneOut(), n_jobs=1).mean() # nested parallel loops are not allowed

  return (f1, f2, score)

def read_data (filename):
  data = pd.read_csv(filename, sep='\t', header=0)
  labels = data.columns.astype('float').astype('int')
  data.columns = labels

  return (data, labels)

if __name__ == '__main__':

  filename = 'data.txt'

  data, labels = read_data(filename)

  Nfeature, Nsample = data.shape

  couples = itertools.combinations(range(0, Nfeature), 2)
  couples_eval = partial(couple_evaluation, data=data, labels=labels)

  nth = multiprocessing.cpu_count()

  with multiprocessing.Pool(nth) as pool:
    score = zip(*pool.map(couple_eval, couples))

In this section we provide an “equivalent” Python implementation with the use of common machine learning libraries and parallel settings (ref. code). In the next sections we will discuss the computational performances of this naive implementation with C++ one discussed above.

This precaution takes in care the register alignment inside the loops and facilitate the compile time optimizer.

next »

  1. To facilitate the SIMD optimization the code is written using only float (single precision) and integer variables. 

  2. The OpenMP library is the most common non-standard library for C++ multi-threading applications. 

  3. Obviously the iteration where the inner loop variable is lower than the outer one will be skipped by an if condition.