DataTable

From emergent
Jump to: navigation, search
Reference info for type DataTable: Wiki | Emergent Help Browser

A Datatable is a tabular data structure that holds data in rows and columns. Features include:

  • Each table has a fixed number of columns; each column holds items of a specific data type.
  • The table can have 0 or more rows of data.
  • New rows data can be appended at the end, or inserted anywhere in the table.
  • Any row or rows can be deleted (temporarily or permanently - see #Flattening the table).
  • #Sorting based on one or more columns
  • #Filtering based on cell contents
  • Datatables can be saved in files, and imported and exported.
  • As of 6.3.2, there is a (hidden) list of row numbers that are used in all access to the data, which enables much more efficient filtering, sorting, etc -- only this row number list needs to be sorted, for example, not all the raw data itself. All access of the data at all levels (even down to the low-level Matrix data that is used for storing each column worth of data) uses these row numbers to access the data, so you cannot tell that they are there -- everything works as before (except faster!), except you can now quickly undo row-level operations like filtering and sorting, which enables you to more rapidly filter and un-filter in the course of exploring and viewing your data.

The Datatable plays a key role in emergent. If you learn to think in terms of datatables, and operations on them, you will find that many things are much easier to do.

See DataTableTips for a few key tips, and DataProc for data processing functions of all sorts -- this page has more systematic documentation of the DataTable itself.

Columns

Each column is of a fixed type (see below); the total number of rows is the same for all columns. A row-column location is called a cell. Each cell can be either a single value (a #Scalar Column), or multiple values (a #Matrix Column). Internally, all data is stored as a Matrix object, with the outer-most dimension of the matrix being the row number (e.g., a scalar column is a 1D matrix with the dimension index specifying the row number).

Scalar Column

The most common type of column is a scalar column; in this column, each cell holds a single value. A datatable of all scalar columns would be similar to a spreadsheet or a database table.

Matrix Column

A matrix column has cells that are a matrix of some basic type. Each cell in the column has exactly the same dimensions.

Matrix columns are especially useful for representing input or output patterns of network layers; in this case, the cell typically has a 2d geometry that is the same as that of the layer. (If you are using layer groups, then you can use either 2d or 4d cells.) Internally, if a matrix column has 2d cells, then the overall matrix is 3d, with the 3rd dimension being the rows.

In Programs, you can access the entire matrix associated with a column through the member .ar or the method .AR() (stands for "array" because historically these were stored as arrays instead of matrices).

Tabular data types

All the values in a column are of a single type. Below are listed the available data types for columns.

Int

An int is a 32-bit integral value that can store numbers from −2,147,483,648 to +2,147,483,647.

Int is a good choice when dealing with counts, or indexes.

Float

A float is a (32-bit) floating point number that can represent about 7 significant digits of precision, with a magnitude between about 10^-38 to 10^38.

Float is a good choice when dealing with any kind of calculated or statistical value, or real-valued parameters and values.

Double

A double is a (64-bit) floating point number that can represent about 15 significant digits of precision, with a magnitude between about 10^-308 to 10^308.

Double may be a good choice when dealing with numbers that require extended accuracy, or whose values could be extremely large or small. Note that computational operations on doubles are significantly slower than on floats.

String

A String can hold a textual value (in ANSI encoding) of any practical length (up to available memory.)

Byte

A Byte holds a small unsigned 8-bit number from 0 to 255. (This type is not often used, and is primarily provided to efficiently store large image arrays.)

Variant (any type)

A Variant is a flexible type that can store a value of any other type. Each Variant keeps track of the type of the value, and the value itself. Although Variants are flexible, they take more memory than the specific types. In general, you should use a specific type unless you have a reason to require a Variant. In addition to being able to store any of the types above, Variants can also store 64-bit signed and unsigned integers, so if you need that data type, then use a Variant column. (But note that Double can often be used to represent very large integral values due to its large dynamic range.)

Creating and Modifying Datatables

To create a datatable

  1. select the data group in which you want to create the table
  2. right click, and select New
  3. click Ok
  4. select the new table, and create columns

To add a column

  1. select the datatable
  2. right click, and select Columns/NewCol (or NewColMatrix for matrix col)
  3. choose the data type for the column
  4. enter a name (see Object Naming for rules)

To delete a column

  1. select the column in the tree view or the column panel
  2. right click, and select Delete

To change the data type of a column

  1. select the column in the tree view or the column panel
  2. right click, and select Columns/Change Col Type
  3. select the new data type

Emergent will preserve data where it can, such as when changing from float to double.

To change a column from scalar to matrix, or change the matrix geometry

  1. select the column in the tree view or the column panel
  2. right click, and select Columns/Change Col Cell Geom
  3. enter the new geometry

Emergent will preserve data where it can; but if you add or remove whole dimensions, the resulting position of the data is undefined. Note that you cannot change both the type and matrix geometry in one step.

To move a column

You can change the order of columns. The easiest way is by drag and drop

  1. select the column you wish to move
  2. hold down the mouse button and drag the column to the new position you would like (if you want it at the end, drag it onto the table itself)
  3. release the mouse; click Move Here (or if moving to end, when on table, click Move Into)

Sorting & Filtering the View

Sometimes you don't want to see all rows or you want to reorder them.

Sorting

  • from the column context menu you can sort rows in either ascending or descending order
  • from the DataProc menu above the DataTable you can access a dialog to sort based on up to 6 columns
  • you can "undo" to get back to the past ordering or to return to the original table sort select "Show All Rows" from the DataProc menu

Filtering

  • from the column context menu you can filter based on the cell values of the selected column
  • from the DataProc menu you can access a dialog to filter based on cell values of up to 3 columns

ShowAllRows

  • to reveal rows hidden by a filtering operation (or any other row-level operation) select "Show All Rows" from the DataProc menu.
  • this will revert to the full raw data stored in the table -- it is not an incremental undo of just the last operation -- to achieve that, it is better to do a standard undo.
  • to enable quick exploratory filtering and then showing all rows, just do a Flatten operation first -- that sets the full raw data to be identical to what you see in the view, and thus this is what the Show All Rows will return you to.

Flattening

When rows are filtered from view or specifically deleted they are really just hidden. If you want to permanently remove the hidden rows you need to "Flatten" the table. The Flatten menu item is in the DataProc menu. If you want to keep the original table as well as the new view you can make a copy before you call Flatten. Flatten also makes the current sort order the new permanent row order.

Adding and changing data

Emergent provides many ways to add and modify data in Datatables. You can:

  • edit the data manually, using the table editor
  • import or paste data from an external source, such as a spreadsheet
  • use a DataProc operation
  • write a Program to generate the data

Edit data using the data editor

The data editor in the Edit Panel lets you:

  • append, insert, or delete a row or rows
  • edit the data value in cells, or in matrix cells
  • copy or paste the data to/from the clipboard

To add (or insert) rows

  1. click on an existing row or rows (the left-hand buttons beside each row of data) - the number of rows you select will be the number of rows added (inserted)
  2. right click, and select Append Rows (or Insert Rows)

The rows will be appended (or inserted at the selection point).

You can also add rows from the table context menu or edit panel menu: Rows/Add (or Insert) Rows.

To edit cells

  1. click in the cell whose value you wish to change - for matrix cells, you must first click the matrix cell, then you can use the matrix cell editor at the bottom of the panel to edit the matrix cell value you want
  2. type the new value
  3. click or tab to a new cell, to save your changes to the table

To delete rows

  1. select the row or rows you wish to delete
  2. right click, and select Delete Rows

The rows will be deleted.

You can also delete rows from the table context menu or edit panel menu: Rows/Delete Rows.

Copy and paste data

You can copy and paste data between emergent datatables, emergent matrices, and many external programs that support tabular data, such as spreadsheets (Excel, OpenOffice Calc, etc.).

  1. select the block of data you wish to copy in the source datatable, program, or matrix
  2. in the destination datatable, click on the upper left cell of the area in which you want to paste
  3. right click, and select Paste

If you copy more cells than can fit in the area you select, the remainder will be ignored.

When you copy data from a range of cells that includes matrix columns, emergent treats each row as having a number of subrows equal to the largest matrix cell; when copying from a datatable, "missing" values in smaller cells are blank; when copying to a datatable, data corresponding to these "missing" locations is ignored. To avoid confusion, you may wish to operate on such data one column at a time.

Automatically calculate cell values

Cell values can be calculated based on the value in another column. For example, you might want to create a column that is the mean of two other columns. To do this select the column that will have the calculated values and on its property page check the box "calc" and then enter the expression in the "calc expr" field. To get the average of values in columns named "one" and "two" you would enter

(one + two)/2

column names are not quoted and don't add an "=".

There is a checkbox on the DataTable properties labeled "Auto Calc" which controls calculation for the entire table. If you want to turn off auto calculation (perhaps when you are generating a large table running a model) you can turn it off and then either turn it on later or click the "Calc All Rows" button at the bottom of the panel.

For more on this feature see DataTut_DataCalcLoopDoc

Specialized Data Processing Operations

There are several collections of data processing operations that can be performed on data tables -- see DataProc for details.

You can also write a small Program to generate or modify data. You can access the operations available interactively via the data_proc collection in your program. In addition, the GSL (Gnu Scientific Library) is available, as well as many built-in statistical and analytical operations.

Graphical Views

See GridTableView and GraphTableView for the "grid view" and "graph view" of the DataTable contents.

Local Read and Write Index Iterator Values

DataTables contain their own read and write index values, which can be set by ReadItem and WriteItem methods, and accessed by ReadIndex and WriteIndex. This allows a data table to have its own "current row" value that persists across different Programs, for example. This is used in the application of InputData to a Network, for example.

Accessing Matrix Data within DataTables

Saving, Loading, and Autoloading DataTables

Emergent DataTables can be saved and loaded in a special native emergent format that preserves all column type information and supports emergent matrix columns. The tables can also be automatically compressed/uncompressed to save space for large tables. When working with data exclusively in emergent, this is the preferred format.

When you Load a previously saved data into a DataTable, emergent automatically makes sure the proper columns exist and in the case of matrix columns, have the correct dimensions. **This can result in a change to your existing table schema.** Note that columns that exist in your DataTable that do not exist in the load file are not deleted -- they will remain; if you are reloading the data, the values in these columns will be the default empty value (0, blank, etc.); if you are appending data, the existing rows will retain their values and the new rows will contain empty values.

Only columns with the SAVE DATA flag checked (default is checked) will be saved during a Save (or Export) operation.

To Save Data

  1. from the Data menu choose Save Data
  2. select or enter the filename to which you want to save -- check the Compress box if you want the file automatically compressed/decompressed (an .gz extension will automatically be appended)

The table will be saved to the file.

To Save/Append Data

You can save a DataTable to the end of an existing save file, instead of replacing the contents of that file. The file must already have been created by the Save Data or Save Headers for the current DataTable being appended.

  1. from the Data menu choose Save Append Data
  2. select or enter the filename to which you want to append -- check the Compress box if you want the file automatically compressed/decompressed (an .gz extension will automatically be appended)

The table data will be added to the end of the file.

To Load Data

  1. from the Data menu choose Load Data
  2. select the emergent .dat or .dat.gz (compressed) file you want to load
  3. if you want the data to replace the current data in the table, check "remove_rows", otherwise the data will be appended to the end

The data will be loaded into the table.

Once you have saved a DataTable to an emergent .dat file, you can set the DataTable to automatically load that data on project open, and no longer save the data in the project file itself. This is useful for larger data files that do not change often, and/or when you want to share the same data across several projects. You can use absolute or relative filename paths.

To Autoload Data

First, save the data to an emergent data file.

  1. select the Properties sub panel in the DataTable properties window
  2. uncheck the SAVE ROWS flag, the "auto load" option box becomes available
  3. in "auto load" choose AUTO LOAD
  4. fill in the relative or absolute file path in the "auto load filename" box -- relative paths are generally preferred (i.e. in same folder: ex. "MyData.dat", or a super folder: ex. "../MyData.dat" or subfolder: ex. "data/MyData.dat" from project) so that users on different computers and operating systems can share the project
  5. save the project; it is a good idea to close the project, then reopen it to confirm the filename is correct and the data is being loaded -- you should see the data in your table after the project is reopened

You can also set the SAVE FILE flag if you want the saved table to be automatically updated -- when this flag is set, every time you save your project the DataTable contents are automatically resaved to the autoload file. RESTRICTION: auto saving only works with .dat emergent files, and on save, the default formats (TAB, quoting) are used, even if the file had been created with different options.

The autoload file is not restricted to emergent .dat files -- you can also specify a generic delimited text file (see Importing/Exporting below); it must have a header line -- emergent format is generally preferred however since it has detailed schema information for column types and matrix column geometry not present in generic files. But if the table schema will not change, and there are reasons for sharing a generic delimited text file, this solution can be appropriate.

Saving/Loading in a Program

  • for Save, use the DataTable::SaveData() routine
  • for Load, use the DataTable::LoadDataEx() routine -- it looks at the file and automatically does a Load for emergent files and Import for non-emergent files; it also determines the appropriate delimiter automatically
  • if you know the file is an emergent file, you may also use the DataTable::LoadData() routine, but you must set the "delim" and "quote_str" parameters correctly for the file (generally TAB, "true") -- using this routine will continue to work in existing code

If you want the user to be interactively prompted for the appropriate file name, then leave the filename parameter blank (""). This will (obviously) only work when emergent is running in gui mode.

Importing and Exporting Data

You can import and export data using standard delimited files, such as tab or comma delimited files, or import fixed width files. (At present, there is no GUI facility to import fixed width files -- you must use a small Program; there is presently no facility to export fixed width files.)

The import/export formats do not contain the detailed column type or matrix geometry that the native emergent format contains; however, these files are more convenient for working with other systems that may not support emergent's extended character set and column name decorations.

Scalar columns are stored as one would expect: the exported column name is the DataTable column name. Matrix data is saved one column per cell, in ascending row-major order, with the columns named like this: <column name>_<cell index>.

You can import or export with or without a header line. When importing without a header line, the structure of the data file must exactly match the structure of the emergent DataTable, up to the last column present in the file. You can skip columns by unchecking the SAVE DATA flag for the column. Matrix data must be one value per delimited column, in ascending row-major order.

For example, if you have data for training a standard supervised Network, typically the data table has a Name string column (name of pattern), then an Input float matrix column (input pattern) and then an Output float matrix column. Thus, each row of your data should start with a string value, followed by N input values (where N = size of input column matrix), followed by N output values (where N is now the size of the output column matrix). For example, if your input has 5 units and your output has 1, and you are using CSV (comma separated values), with strings quoted, then your data should look something like this:

"pattern1_name", 1,0,0,0,0,.1
"pattern2_name", 0,1,0,0,0,.2
"pattern3_name", 0,0,1,0,0,.3
...

(where the input has one-of-N units active, and the output is a graded number between 0-1, in this case encoding the idea that each input unit has a "value" increasing in increments of .1 from left to right -- just a nonesense problem for illustration purposes).

To Export data to a delimited text file

  1. from the Data menu choose Export
  2. choose the desired delimiter -- COMMA is a widely supported standard, and is the default
  3. check "quote_str" if you want string (and column names) to be quoted -- quoting is a widely supported standard, and is the default
  4. check "headers" if you want a header line
  5. choose a file name -- .csv suffix is common for Comma Separated Values and can be easily loaded in programs such as Excel, Open Office Spreadsheet, and database programs

To Import from a delimited text file

  1. if it doesn't already exist, create the data table object you want to import data into
  2. RECOMMENDED: it is suggested to create the correct number and type of columns before importing -- this lets you choose the correct data type -- if a column of the correct name does not exist in the DataTable, then a Variant column will be created
  3. REQUIRED: if your import file contains matrix data, it must use the <column name>_<cell index> naming convention AND you must create a suitable matrix column BEFORE importing -- emergent cannot create matrix columns when importing data, you will end up with N scalar columns with the literal name in the header (ex. MyMatData_0, MyMatData_1, etc.) -- the emergent column can be multi-dimensional, the <cell index> will refer to the ascending row-major order cell number in the matrix
  4. check "headers" if your file has a header line -- NOTE: Emergent cannot automatically determine if a header exists or not!
  5. you may leave the delimiter and quote parameters on AUTO -- emergent will look at the file and usually choose the correct values -- if for some reason this doesn't work, you can choose them manually -- if you choose manual values emergent will warn you if it thinks the file is incompatible with your choices

Here is the export from a table with one scalar column ("Name") and one matrix column ("Input") which had 2x3 cells. Notice that the dimensions of the matrix column are not specified. It could be 2x3 or 3x2 or 6x1, etc. So before you import the data create all matrix columns with the proper dimensions. If you don't create the columns each matrix value will be in a separate column.

Name,"Input_0","Input_1","Input_2","Input_3","Input_4","Input_5"
"checks",1,0,0,0,1,0
"mitvert",0,0,1,0,0,0

CSV import tip

If you want to import a csv file that has no headers you must first add headers. The following command will add them to your csv file (be sure to fix the name). They will be named _0, _1, _2 etc... for each column:

csv=SX106.csv;sed -ie "1i$(seq -s_, 0 $(($(head -n1 $csv | tr -dc , | wc -c)+1)))_" $csv

Importing from files with fixed-width columns

If your import file has a fixed-width format, particularly if there is no single column separator, you can use a FixedWidthSpec object in a Program to import your data. Note that your DataTable must have all the necessary columns created, the import will not create columns. (Your table can have additional columns not in the import -- those cells will get set to the default empty value for the type.)

  • create a new Program
  • add a FixedWidthSpec object to the obj collection -- if you want to skip header line(s) in your import file, set n_skip_lines to be the number of lines to skip
  • add a FixedWidthColSpec object to the col_specs collection for each column you want to import -- you only need a ColSpec object for data you will import, not data you will ignore
  • set the column name (in your DataTable), the starting column in the file (1-based), and the width of the column -- you can specify -1 for the last column if it has variable width
  • use the DataTable::LoadDataFixed() call

Exporting/Importing in a Program

  • for Export, use the DataTable::ExportData() routine
  • for Import, use the DataTable::ImportData() routine
  • you can also use the DataTable::LoadDataEx() routine if you aren't sure if the file will be emergent-native or a simple delimited file

If you want the user to be interactively prompted for the appropriate file name, then leave the filename parameter blank (""). This will (obviously) only work when emergent is running in gui mode.

Emergent Data Format

Emergent uses an efficient proprietary text-based data format to save and load data -- the format preserves column type information, and matrix geometry of matrix cells. By default, TAB is used as the column delimiter and strings are quoted. Each line in the file starts with a marker: _H: for the header line, if present, and _D: for a data line. Data is stored one scalar value or one matrix cell value per column. Header columns are formatted as follows: <type marker><Col Name>[mat index]<mat dims> -- the [mat index] component is only used for matrix columns; the <mat dims> only appears for the first column instance of a matrix column series The type markers are as follows:

 $ - String
 % - float
 # - double
 | - int
 @ - byte
 & - Variant

Compressed files are compressed using the zlib compression format, and are compatible with the gzip compression program.

If you want to import multidimensional data into emergent, the fastest way to determine the format is to create a DataTable to your specifications, save it to a file, and then inspect the file. Here is an example of 2x2 DataTable of type Matrix, where each of the four matrices are also 2x2 and contain Ints.

_H:     &test[2:0,0]<2:2,2>     &test[2:1,0]    &test[2:0,1]    &test[2:1,1]    &test2[2:0,0]<2:2,2>    &test2[2:1,0]   &test2[2:0,1]   &test2[2:1,1]
_D:     1       2       3       4       9       10      11      12
_D:     5       6       7       8       13      14      15      16

H: stands for header, and _D: stands for data. &test is a reference to the name of the datatable, test.

Datatable1.png
Datatable2.png