Semantic import and Datasets
The semantic import and datasets functionality is implemented with the help of the Tablesaw dataframe and visualization library. We’ll use Tablesaw to look at data about Tornadoes.
Exploring Tornadoes
To give a better sense of how SemanticImport
and Dataset
works, we’ll use a tornado data set from NOAA. Here’s what we’ll cover:
- Read a CSV file
- Viewing dataset metadata
- Previewing data
- Sorting
- Running descriptive stats (mean, min, max, etc.)
- Filtering rows
- Totals and sub-totals
- Saving your data
All the data is in the Symja data folder.
Read a CSV file
Here we read a csv
file of tornado data. The SemanticImport
function infers the column types by sampling the data and returns a Dataset
variable named ds
.
Note: that the file is addressed relative to the current working directory. You may have to change it for your code.
If you would like to create smaller datasets you can use the SemanticImportString
function, which creates a Dataset
from a String representation:
Viewing table metadata
Often, the best way to start is to print the column names for reference:
The Dimensions
function displays the row and column counts:
Structure
shows the index, name and type of each column. Like many other Dataset
functions, Structure
returns a Dataset
.
You can then produce a string representation for display. For convenience, calling ToString
on a Dataset
produces a string representation of the table.
You can also perform other table operations on it. For example, the code below removes all columns whose type isn’t DOUBLE:
Of course, that also returned a Dataset
. We’ll cover selecting rows in more detail later.
Previewing data
The Span
operator ;;
returns a new table containing the first 3 rows.
This will create a new table containing all rows but only the Distillery
column
This will create a new table containing all rows but only the columns 3 to 5
This will create a new table containing rows 1
to 10
but only the columns Distillery
, Latitude
and Longitude
The Normal
function converts a table into a list of Symja associations <|"column-name1"->value1, ... |>
.
The InputForm
function shows that the column names are converted to keys of type String
in the associations.
Sorting
Now lets sort the table in reverse order by the id column. The negative sign before the name indicates a descending sort.
Descriptive statistics
Descriptive statistics are calculated using the Summary
function:
Filtering
You can write your own Select
function to filter rows.
The next example returns a Dataset
containing only the columns named in the parameter list, rather than all the columns in the original.
Totals and sub-totals
Column metrics can be calculated using methods like Total, Mean, Max, etc.
You can apply those methods to a table, calculating results on one column, grouped by the values in another.
Saving your data
To save a table, you can write it as a CSV file:
If you would like to create a String
representation you can use the function ExportString