Normalization
One of the common tasks in data processing is normalization. Running an algorithm on the normalized data may, depending on the particular algorithm, substantially increase its performance.
There are several ways to normalize the data, in particular by computing its standard score ("zvalue"), defined as
 ,
where is the mean and is the standard deviation of the population.
A dataset can be thought of as a set of observations, each having a common number of properties (features), where each feature is drawn from a particular probability distribution. To normalize the data is then to find a mean and a standard deviation of each feature (most usually of a column if the data is given in the matrix form) and for each cell of each row to subtract its mean and divide it by the standard deviation of the appropriate feature. The resulting data will thus have zero mean and standard deviation equal to one.
Contents
Overview
The algorithm may be summarized as follows:
 Duplicate the initial datatable.
 For each column (feature), calculate its mean and standard deviation.
 Loop over rows and modify the data according to the formula using and from the previous step.
Program Elements
Although normalization is conceptually very simple, its implementation in css is not immediately obvious (especially to novice users), although straightforward.
Copying the Data
Instead of modifying the source datatable directly, we operate on its copy, leaving the original data intact. To create a copy of a datatable,

taDataProc::CopyData()
method call is used. Needless to day, both source and destination tables should already exist and be referenced in the program's vars section.
Calculating Mean and SD
Mean and standard deviation can be calculated using the following methods:
 mean:
float taMath_float::vec_mean(const float_Matrix* vec)
 standard deviation:
float taMath_float::vec_std_dev(const float_Matrix* vec, float mean = 0, bool use_mean = false, bool use_est = false)
We want to call them on each column of the dataset. This can be done using a simple for loop:

for (col = 0; col < shape(data["Input"].ar)[0]; col++) {}
Here we use shape() to get the geometry of the matrix, assuming that the datatable is referenced with the data pointer and required data is in the Input column. Note the use of .ar to get the actual matrix; additionally the number of columns is the first element of the matrix returned by shape()  this dimension order is different from the one used in, say, Python or Matlab. For details, see css list comprehension, DataTable css, Matrix css
Now that we have col to denote the current column number, it is possible to call each of the vec_mean and vec_std_dev methods with the appropriate matrix obtained with

data["Input"].ar[col,:,:]
ar[col,:,:] (instead of ar[col,:]) is due to the fact that the data is 2dimensional.
Mean and standard deviation values returned by the above two methods should be stored in temporary matrices. First we create a new variable with min type of float_Matrix and then set it to point to a new float_Matrix. Additionally since each of the matrices will be referenced in the loop with explicit indices, their geometries should be set in advance with

bool taMatrix::SetGeom()
call.
Changing Data
Again, a simple for loop is sufficient to iterate over all rows of the data and set the values according to the formula:

for (row = 0; row < data.rows; row++) {}
Here data.rows call is used to get the total number of rows.
We cannot set the rows directly using the syntax like data["Input"].ar[:,row,:] = data_mean (because of the strange geometry mismatch error). What we can do, however, is to get values of each row and then set the modified values back using two special methods:

taMatrix* DataTable::GetValAsMatrix(const Variant& col, int row)
to get the data and

bool DataTable::SetValAsMatrix(const taMatrix* val, const Variant& col, int row)
to set it to the new value. See Matrix from DataTable for details.
Quircks
The standard deviation obtained with the call

taMath_float::vec_std_dev(data["Input"].ar[col,:,:]);
is internally the square root of

float taMath_float::vec_var();
which in fact returns the value the number of columns times smaller than the real one (due to a bug in vector size calculation?). So the true standard deviation score is the vec_std_dev()'s value times the square root of the number of columns:

data_sd * sqrt(data_col_n)
Final CSS code
// Normalize
/* globals added to hardvars:
Program::RunState run_state; // our program's run state
int ret_val;
// vars: global (nonparameter) variables
DataTable* src_data;
DataTable* data;
*/
void __Init() {
// init_from vars
// run our init code
}
void __Prog() {
// init_from vars
// prog_code
taDataProc::CopyData(data, src_data);
// local variables
int data_col_n; data_col_n = 0;
float_Matrix* data_mean; data_mean = NULL;
float_Matrix* data_sd; data_sd = NULL;
float_Matrix* data_temp; data_temp = NULL;
double mean; mean = 0;
double sd; sd = 0;
int col; col = 0;
int row; row = 0;
data_col_n = shape(data["Input"].ar)[0];
data_mean = new float_Matrix;
data_sd = new float_Matrix;
data_temp = new float_Matrix;
data_mean>SetGeom(2, data_col_n, 1);
data_sd>SetGeom(2, data_col_n, 1);
for(col = 0; col < data_col_n; col++) {
mean = taMath_float::vec_mean(data["Input"].ar[col,:,:]);
data_mean[col, 0] = mean;
sd = taMath_float::vec_std_dev(data["Input"].ar[col,:,:]);
data_sd[col, 0] = sd;
}
data_sd = data_sd * sqrt(data_col_n);
for(row = 0; row < data.rows; row++) {
data_temp = data.GetValAsMatrixColName("Input", row);
data.SetValAsMatrix(data_temp  data_mean, "Input", row);
}
for(row = 0; row < data.rows; row++) {
data_temp = data.GetValAsMatrixColName("Input", row);
data.SetValAsMatrix(data_temp / data_sd, "Input", row);
}
/*******************************************************************
// Check in console with
// > print taMath_float::vec_mean(.NormalizedData["Input"].ar[0,:,:])
// > (Real) = 2.43696e08
// > print taMath_float::vec_std_dev(.NormalizedData["Input"].ar[0,:,:]) * sqrt(8)
// > (Real) = 0.999999
*******************************************************************/
StopCheck(); // process pending events, including Stop and Step events
}
Checking correctness
We can check that the new data values are indeed correct by computing their mean and standard deviation (should be 1 and 0 correspondingly) and by computing the mean and the standard deviation in another program, such as R, and comparing the values with the ones computed by emergent.
To get the statistics in R, one would do something like:
> pima = read.table("pimaindiansdiabetes.data", sep=',') > colMeans(pima) V1 V2 V3 V4 V5 V6 3.8450521 120.8945312 69.1054688 20.5364583 79.7994792 31.9925781 V7 V8 V9 0.4718763 33.2408854 0.3489583 > sd(pima) V1 V2 V3 V4 V5 V6 3.3695781 31.9726182 19.3558072 15.9522176 115.2440024 7.8841603 V7 V8 V9 0.3313286 11.7602315 0.4769514
while in emergent console issuing
emergent> print taMath_float::vec_mean(.StdInputData["Input"].ar[0,:,:]) emergent> (Real) = 3.84505 emergent> print taMath_float::vec_std_dev(.StdInputData["Input"].ar[0,:,:]) * sqrt(8) emergent> (Real) = 3.36739
thus convincing herself that at least for the first column the computations are correct.
One final check is to see whether mean is zero and sd is one:
emergent> print taMath_float::vec_mean(.NormalizedData["Input"].ar[0,:,:]) emergent> (Real) = 2.43696e08 emergent> print taMath_float::vec_std_dev(.NormalizedData["Input"].ar[0,:,:]) * sqrt(8) emergent> (Real) = 0.999999
Demonstration with pima dataset
As has been noted, running certain algorithms on normalized data can increase their overall performance. For a quick demonstration, see a tutorial on Linear Classification with SLP