WOLFRAM

45 Datasets

45Datasets
Especially in larger organizations, computing often centers around dealing with large amounts of structured data. The Wolfram Language has a very powerful way to deal with structured data, using what it calls datasets.
A simple example of a dataset is formed from an association of associations.
Create a simple dataset that can be viewed as having 2 rows and 3 columns:
In[1]:=
Out[1]=
The Wolfram Language displays most datasets in tabular form. You can extract parts from datasets just like you would from associations.
Get the element from “row b” and “column z”:
In[2]:=
Out[2]=
You can first extract the whole “b row”, then get the “z” element of the result:
In[3]:=
Out[3]=
You can also just get the whole “b row” of the dataset. The result is a new dataset, which for ease of reading happens to be displayed in this case as a column.
Generate a new dataset from the “b row” of the original dataset:
In[4]:=
Out[4]=
Here is the dataset that corresponds to the “z column” for all “rows”.
Generate a dataset consisting of the “z column” for all rows:
In[5]:=
Out[5]=
Get totals for each row by applying Total to all columns for all the rows:
In[6]:=
Out[6]=
If we use f instead of Total, we can see what’s going on: the function is being applied to each of the “row” associations.
Apply the function f to each row:
In[7]:=
Out[7]=
Apply a function that adds the x and z elements of each association:
In[8]:=
Out[8]=
You can use any function; here’s PieChart:
In[9]:=
Out[9]=
You can give a function to apply to all rows too.
This extracts the value of each “z column”, then applies f to the association of results:
In[10]:=
Out[10]=
Apply f to the totals of all columns:
In[11]:=
Out[11]=
In[12]:=
Out[12]=
You can always “chain” queries, for example first finding the totals for all rows, then picking out the result for the “b row”.
Find totals for all rows, then pick out the total for the “b row”:
In[13]:=
Out[13]=
It’s equivalent to this:
In[14]:=
Out[14]=
Select numbers greater than 5 from a list:
In[15]:=
Out[15]=
Another way to get the same answer, using the operator form of Select:
In[16]:=
Out[16]=
The operator form of Select is a function which can be applied to actually perform the Select operation.
Make a dataset by selecting only rows whose “z column” is greater than 5:
In[17]:=
Out[17]=
For each row, select columns whose values are greater than 5, leaving a ragged structure:
In[18]:=
Out[18]=
Normal turns the dataset into an ordinary association of associations:
In[19]:=
Out[19]=
Many Wolfram Language functions have operator forms.
Sort according to the values of a function applied to each element:
In[20]:=
Out[20]=
SortBy has an operator form:
In[21]:=
Out[21]=
Sort rows according to the value of the difference of the x and y columns:
In[22]:=
Out[22]=
Sort the rows, and find the total of all columns:
In[23]:=
Out[23]=
Sometimes you want to apply a function to each element in the dataset.
Apply f to each element in the dataset:
In[24]:=
Out[24]=
Sort the rows before totaling the squares of their elements:
In[25]:=
Out[25]=
A dataset formed from a list of associations:
In[26]:=
Out[26]=
It’s OK for entries to be missing:
In[27]:=
Out[27]=
Now that we’ve seen some simple examples, it’s time to look at something slightly more realistic. Let’s import a dataset giving properties of planets and moons. The dataset has a hierarchical structure, with each planet having a mass and radius of its own, and then also having a collection of moons, each of which have their own properties. This general structure is extremely common in practice (think students and grades, customers and orders, etc.).
Get a hierarchical dataset of planets and moons from the cloud:
In[28]:=
Out[28]=
In[29]:=
Out[29]=
Make a bar chart of planet radii:
In[30]:=
Out[30]=
If we ask about the moons of Mars, we get a dataset, which we can then query further.
Get a dataset about the moons of Mars:
In[31]:=
Out[31]=
“Drill down” to make a table of radii of all the moons of Mars:
In[32]:=
Out[32]=
In[33]:=
Out[33]=
Find the total mass of all moons for each planet:
In[34]:=
Out[34]=
Get the same result, but only for planets with more than 10 moons:
In[35]:=
Out[35]=
Make a pie chart of the result:
In[36]:=
Out[36]=
For all moons, select ones whose mass is greater than 0.01 times the mass of the Earth:
In[37]:=
Out[37]=
Get the list of keys (i.e. moon names) in the resulting association for each planet:
In[38]:=
Out[38]=
Get the underlying association:
In[39]:=
Out[39]=
Join together, or “catenate”, the lists for all keys:
In[40]:=
Out[40]=
Here’s the whole computation in one line:
In[41]:=
Out[41]=
Make number line plots of the logarithms of masses for moons of each planet:
In[42]:=
Out[42]=
As a final example, let’s make a word cloud of names of moons, sized according to the masses of the moons. To do this, we need a single association that associates the name of each moon with its mass.
When given an association, WordCloud determines sizes from values in the association:
In[43]:=
Out[43]=
The function Association combines associations:
In[44]:=
Out[44]=
Generate the word cloud of moon masses:
In[45]:=
Out[45]=
We’ve seen before that we can write something like f[g[x]] as f@g@x or x//g//f. We can also write it f[g[#]]&[x]. But what about f[g[#]]&? Is there a short way to write this? The answer is that there is, in terms of the function composition operators @* and /*.
f@*g@*h represents a composition of functions to be applied right-to-left:
In[46]:=
Out[46]=
h/*g/*f represents a composition of functions to be applied left-to-right:
In[47]:=
Out[47]=
Here’s the previous code rewritten using composition @*:
In[48]:=
Out[48]=
And using right composition /*:
In[49]:=
Out[49]=
To get the main dataset that’s mentioned here, just use ResourceData.
Get the dataset just by giving its name to ResourceData:
In[50]:=
Out[50]=
In[51]:=
Out[51]=
Make a histogram of the altitudes:
In[52]:=
Out[52]=
Dataset[data] a dataset
Normal[dataset] convert a dataset to normal lists and associations
Catenate[{assoc1, ...}] catenate associations, combining their elements
f@*g composition of functions ( f[g[x]] when applied to x)
f/*g right composition (g[f[x]] when applied to x)
Note: These exercises use the dataset planets=CloudGet["https://wolfr.am/7FxLgPm5"].
45.1Make a word cloud of the planets, with weights determined by their number of moons. »
Sample expected output:
Out[]=
45.2Make a bar chart of the number of moons for each planet. »
Sample expected output:
Out[]=
45.3Make a dataset of the masses of the planets, sorted by their number of moons. »
Sample expected output:
Out[]=
45.4Make a dataset of planets and the mass of each one’s most massive moon. »
Sample expected output:
Out[]=
Sample expected output:
Out[]=
Sample expected output:
Out[]=
45.7For each planet, make a list of moons larger in mass than 0.0001 Earth masses. »
Sample expected output:
Out[]=
45.8Make a word cloud of countries in Central America, with the names of countries proportional to the lengths of the Wikipedia article about them. »
Sample expected output:
Out[]=
45.9Find the maximum observed altitude in the Fireballs & Bolides dataset. »
Expected output:
Out[]=
45.10Find a dataset of the 5 largest observed altitudes in the Fireballs & Bolides dataset. »
Expected output:
Out[]=
45.11Make a histogram of the differences in successive peak brightness times in the Fireballs & Bolides dataset. »
Expected output:
Out[]=
45.12Plot the nearest cities for the first 10 entries in the Fireballs & Bolides dataset, labeling each city. »
Expected output:
Out[]=
45.13Plot the nearest cities for the 10 entries with largest altitudes in the Fireballs & Bolides dataset, labeling each city. »
Expected output:
Out[]=
What kinds of data can datasets contain?
Any kinds. Not just numbers and text but also images, graphs and lots more. There’s no need for all elements of a particular row or column to be the same type.
Can I adjust the formatting of datasets?
Yes. There are many options. For example, Dataset[dataset, MaxItems 5] will display 5 items per “page”.
Can I turn spreadsheets into datasets?
Yes. SemanticImport is often a good way to do it.
What are databases and how do they relate to Dataset?
Databases are a traditional way to store structured data in a computer system. Databases are often set up to allow both reading and writing of data. Dataset is a way to represent data that might be stored in a database so that it’s easy to manipulate with the Wolfram Language.
How does data in Dataset compare to data in an SQL (relational) database?
SQL databases are strictly based on tables of data arranged in rows and columns of particular types, with additional data linked in through “foreign keys”. Dataset can have any mixture of types of data, with any number of levels of nesting, and any hierarchical structure, somewhat more analogous to a NoSQL database, but with additional operations made possible by the symbolic nature of the language.
Can I use datasets to set up entities and values for them?
  • Dataset supports a new kind of symbolic database structure which generalizes both relational and hierarchical databases.
  • Dataset has many additional mechanisms and capabilities that we haven’t discussed.
  • Everything that can be done with queries on datasets can also be done by using functions like Map and Apply on underlying lists and associationbut it’s typically much simpler with dataset queries.
  • You can connect the Wolfram Language directly to SQL databasesand do queries with SQL syntaxusing DatabaseLink.
Next Section