Normalization

From emergent
Jump to: navigation, search

One of the common tasks in data processing is normalization. Running an algorithm on the normalized data may, depending on the particular algorithm, substantially increase its performance.

There are several ways to normalize the data, in particular by computing its standard score ("z-value"), defined as

z = \frac{x - \mu}{\sigma},

where \mu is the mean and \sigma is the standard deviation of the population.

A dataset can be thought of as a set of observations, each having a common number of properties (features), where each feature is drawn from a particular probability distribution. To normalize the data is then to find a mean and a standard deviation of each feature (most usually of a column if the data is given in the matrix form) and for each cell of each row to subtract its mean and divide it by the standard deviation of the appropriate feature. The resulting data will thus have zero mean and standard deviation equal to one.

Overview

The algorithm may be summarized as follows:

  1. Duplicate the initial datatable.
  2. For each column (feature), calculate its mean and standard deviation.
  3. Loop over rows and modify the data according to the formula using \mu and \sigma from the previous step.

Program Elements

Although normalization is conceptually very simple, its implementation in css is not immediately obvious (especially to novice users), although straightforward.

Copying the Data

Instead of modifying the source datatable directly, we operate on its copy, leaving the original data intact. To create a copy of a datatable,

taDataProc::CopyData()

method call is used. Needless to day, both source and destination tables should already exist and be referenced in the program's vars section.

Calculating Mean and SD

Mean and standard deviation can be calculated using the following methods:

  • mean:
    float taMath_float::vec_mean(const float_Matrix* vec)
    
  • standard deviation:
    float taMath_float::vec_std_dev(const float_Matrix* vec, float mean = 0, bool use_mean = false, bool use_est = false)
    

We want to call them on each column of the dataset. This can be done using a simple for loop:

for (col = 0; col < shape(data["Input"].ar)[0]; col++) {}

Here we use shape() to get the geometry of the matrix, assuming that the datatable is referenced with the data pointer and required data is in the Input column. Note the use of .ar to get the actual matrix; additionally the number of columns is the first element of the matrix returned by shape() -- this dimension order is different from the one used in, say, Python or Matlab. For details, see css list comprehension, DataTable css, Matrix css

Now that we have col to denote the current column number, it is possible to call each of the vec_mean and vec_std_dev methods with the appropriate matrix obtained with

data["Input"].ar[col,:,:]

ar[col,:,:] (instead of ar[col,:]) is due to the fact that the data is 2-dimensional.

Mean and standard deviation values returned by the above two methods should be stored in temporary 1 \times 8 matrices. First we create a new variable with min type of float_Matrix and then set it to point to a new float_Matrix. Additionally since each of the matrices will be referenced in the loop with explicit indices, their geometries should be set in advance with

bool taMatrix::SetGeom()

call.

Changing Data

Again, a simple for loop is sufficient to iterate over all rows of the data and set the values according to the formula:

for (row = 0; row < data.rows; row++) {}

Here data.rows call is used to get the total number of rows.

We cannot set the rows directly using the syntax like data["Input"].ar[:,row,:] -= data_mean (because of the strange geometry mismatch error). What we can do, however, is to get values of each row and then set the modified values back using two special methods:

taMatrix* DataTable::GetValAsMatrix(const Variant& col, int row)

to get the data and

bool DataTable::SetValAsMatrix(const taMatrix* val, const Variant& col, int row)

to set it to the new value. See Matrix from DataTable for details.

Quircks

The standard deviation obtained with the call

taMath_float::vec_std_dev(data["Input"].ar[col,:,:]);

is internally the square root of

float taMath_float::vec_var();

which in fact returns the value the number of columns times smaller than the real one (due to a bug in vector size calculation?). So the true standard deviation score is the vec_std_dev()'s value times the square root of the number of columns:

data_sd * sqrt(data_col_n)

Final CSS code

// Normalize
/* globals added to hardvars:
Program::RunState run_state; // our program's run state
int ret_val;
// vars: global (non-parameter) variables
DataTable* src_data; 
DataTable* data; 
*/
void __Init() {
  // init_from vars
  // run our init code
}
void __Prog() {
  // init_from vars
  // prog_code
  taDataProc::CopyData(data, src_data);
  // local variables
  int data_col_n; data_col_n = 0;
  float_Matrix* data_mean; data_mean = NULL;
  float_Matrix* data_sd; data_sd = NULL;
  float_Matrix* data_temp; data_temp = NULL;
  double mean; mean = 0;
  double sd; sd = 0;
  int col; col = 0;
  int row; row = 0;
  data_col_n = shape(data["Input"].ar)[0];
  data_mean = new float_Matrix;
  data_sd = new float_Matrix;
  data_temp = new float_Matrix;
  data_mean->SetGeom(2, data_col_n, 1);
  data_sd->SetGeom(2, data_col_n, 1);
  for(col = 0; col < data_col_n; col++) {
    mean = taMath_float::vec_mean(data["Input"].ar[col,:,:]);
    data_mean[col, 0] = mean;
    sd = taMath_float::vec_std_dev(data["Input"].ar[col,:,:]);
    data_sd[col, 0] = sd;
  }
  data_sd = data_sd * sqrt(data_col_n);
  for(row = 0; row < data.rows; row++) {
    data_temp = data.GetValAsMatrixColName("Input", row);
    data.SetValAsMatrix(data_temp - data_mean, "Input", row);
  }
  for(row = 0; row < data.rows; row++) {
    data_temp = data.GetValAsMatrixColName("Input", row);
    data.SetValAsMatrix(data_temp / data_sd, "Input", row);
  }
  /*******************************************************************
  // Check in console with
  // > print taMath_float::vec_mean(.NormalizedData["Input"].ar[0,:,:])
  // > (Real) = 2.43696e-08
  // > print taMath_float::vec_std_dev(.NormalizedData["Input"].ar[0,:,:]) * sqrt(8)
  // > (Real) = 0.999999
  *******************************************************************/
  StopCheck(); // process pending events, including Stop and Step events
}

Checking correctness

We can check that the new data values are indeed correct by computing their mean and standard deviation (should be 1 and 0 correspondingly) and by computing the mean and the standard deviation in another program, such as R, and comparing the values with the ones computed by emergent.

To get the statistics in R, one would do something like:

> pima = read.table("pima-indians-diabetes.data", sep=',')
> colMeans(pima)
         V1          V2          V3          V4          V5          V6
  3.8450521 120.8945312  69.1054688  20.5364583  79.7994792  31.9925781
         V7          V8          V9
  0.4718763  33.2408854   0.3489583
> sd(pima)
         V1          V2          V3          V4          V5          V6
  3.3695781  31.9726182  19.3558072  15.9522176 115.2440024   7.8841603
         V7          V8          V9
  0.3313286  11.7602315   0.4769514

while in emergent console issuing

emergent> print taMath_float::vec_mean(.StdInputData["Input"].ar[0,:,:])
emergent> (Real)  = 3.84505
emergent> print taMath_float::vec_std_dev(.StdInputData["Input"].ar[0,:,:]) * sqrt(8)
emergent> (Real)  = 3.36739

thus convincing herself that at least for the first column the computations are correct.

One final check is to see whether mean is zero and sd is one:

emergent> print taMath_float::vec_mean(.NormalizedData["Input"].ar[0,:,:])
emergent> (Real)  = 2.43696e-08
emergent> print taMath_float::vec_std_dev(.NormalizedData["Input"].ar[0,:,:]) * sqrt(8)
emergent> (Real)  = 0.999999

Demonstration with pima dataset

As has been noted, running certain algorithms on normalized data can increase their overall performance. For a quick demonstration, see a tutorial on Linear Classification with SLP