Apache Superset: An Open-Source Visualization Tool

The articles document the views of the DataOps apprentices views and opinions on Apache Superset and may not exhaustively cover all features of the tool. The intent behind this article is to provide an introductory overview of the tool, and compare it against common alternative visualization tools.

1. What is Superset ?

Fig 1.1 Apache Superset Logo Retrieved from Apache Superset Github.

Apache Superset can be classified as a business intelligence tool. It can serve for data visualisation. Apache is fast and lightweight allowing users for easy usage. It is loaded with options that can enable users of all skill sets to visualise charts and data. There are also options for SQL if the user is used to using SQL queries. Apache is also open source and supports a wide variety of databases. Overall, Apache Supertset is relatively user friendly but requires some time to learn the intuition of the usage.

More information can be found in Apache Superset’s Documentation.

 

2. Motivation for superset

Fig 2.1 Apache Superset Features Retrieved from Apache Superset’s Homepage .

Apache Superset provides dashboards which can provide a clear picture to the audiences for business use cases. The dashboards have configurable options that allow the users to adjust the flow in which the charts are presented. The charts themselves have customisable options that allow the user to choose the appropriate configurations. An example of a good business use case is the use of the filter. During business presentations, on the go questions may be raised such as the comparison of certain items of the data. The filter options allows the user to update all the charts in the dashboard directly in an efficient manner. Apache Superset is also easy to integrate with existing modern data infrastructure and supports a wide range of databases used in businesses.

Although this also can be done in a python jupyter notebook, Apache Superset provides the edge in presentation for business use cases in an efficient way that the standard EDA may lack.

3. Installation

Apache Superset has two ways of installation:

  1. Installing Superset Locally Using Docker Compose
  2. Installing Superset from Scratch

3.1 Installing Superset Using Docker Compose

Image Retrieved from the following link.

Linux users can install docker through the following link found in the docker’s installation documentation for Linux. The docker compose is already included in this documentation and need not require further installation as stated in Apache Superset’s documentation.

Image Retrieved from the following link.

Macbook Users can refer to the docker’s installation documentation for Mac. This installation is also used in Apache Superset’s Documentation.

Image Retrieved from the following link.

Windows Users can refer to the docker’s installation for Windows. It is important to choose WSL2 as the backend for the configuration during the installation.

After installing the Docker, head to the working folder and clone the Superset github repository with the following code.

git clone https://github.com/apache/superset.git

After cloning, navigate to the superset folder with the first line of code. Run the docker compose with the next two lines of code and Superset is ready.

cd superset

docker-compose -f docker-compose-non-dev.yml pull

docker-compose -f docker-compose-non-dev.yml up

Access the Superset through the local host.

https://localhost:8088/

There are other configurations such as Superset configurations, additional network settings, event logging and so on. These configurations can be found in the documentation, navigate to the configuration of interest in the tabs on the left.

3.2. Installing Superset from Scratch

This method involves installing Superset in a python virtual environment. For more information, please refer to this guide.

4. Data loading

Installing Superset via Docker Compose installs a local Postgres SQL server, whilst installing it from scratch installs a default SQLite data base at “~/. superset/superset.db”. To scale the underlying database, Superset also supports a plethora of database connections. The various connection methods will be discussed below.

4.1. Connect to a database (DB)

Connecting Superset to a database requires an existing (1) functional SQLAlchemy dialect and (2) Python driver. A Python database driver is to be installed for each database the user wants to connect to. Depending on the installation method, there are different ways to install the database driver.

4.1.1. DB connection for docker compose installations

When setting up Superset locally via docker-compose, the necessary drivers and packages need to be stated in a “superset/docker/requirements-local.txt” file to be installed automatically. Please refer to this guide for more information.

4.1.2. DB connection for installations from scratch

Using this reference, pip install the Python DB driver packages necessary to connect to the database. Thereafter, these steps can be followed:

  1. Click on the “+” icon on the top right corner of the Superset UI, then select “Connect database”

  1. Select the database type. If the database is not on the list but an interface exists, the user may file an issue on the Superset Github repository.
  1. Either fill in the connection details using the input form or choose the option to use SQL Alchemy URI connection string instead.

4.2 Upload a .csv file

1. Navigating to the Database Connection Settings

There is a feature that allows a csv file to be uploaded to the database. In order to activate this feature head to the settings sign at the top right hand corner and select database connection. This will navigate the user to their database tab as shown.

2. Navigating to the Database settings


Under the actions column, Click the edit button to navigate to the database configurations and select the “ADVANCED” tab.


3. Selecting the upload option

Check the “Allow files upload to database” as shown. The upload csv to database option will be available thereafter.

5. Data wrangling

An individual running Superset with its default local database may choose to do the bulk of data wrangling before data is loaded into the database. This can give the flexibility to process the data using the Python Pandas package for example.

However, Superset also provides an excellent SQL IDE for data wrangling. Some basic examples will be covered below.

5.1. Calculated columns

Example of casting a string column to float data type using a combination of SQL and the Superset UI.

  1. Select the “Datasets” tab on the top of the Superset UI.
  2. Put the cursor over the data table and select the edit icon.
  3. In the edit menu that pops up, go to the “Calculated Columns” tab, then click on “+ Add Item”.
  4. As an example, a string column “remaining_years” with values in the format of “xx years xx months”, where “xx” refers to zero-padded integers, extract the first 2 integers in the string representing the number of years, then cast it into a float data type calculated column via SQL as follows:

  1. Scroll down and set the data type to “Numeric”.

5.2. Temporal data type casting

Apart from casting data types by creating a calculated column, temporal data type casting specifically can be done through the Superset UI.

  1. Select the “Datasets” tab on the top of the Superset UI.
  2. Put the cursor over the data table and select the edit icon.
  3. In the edit menu that pops up, go to the “Columns” tab, then click on the temporal column of the user choice. In this example, it is the string column “month” that has values “YYYY-MM”.
  4. Check the “Is temporal” checkbox, and click the “Default datetime” radio button.
  5. Scroll down and fill in the Python datetime string pattern that corresponds to the format of the “month” column values to facilitate the parsing of the string column into datetime. Do remember to include separators if applicable.

  1. Do note that any charts using temporal columns should have “Time Grain” set to “Original Value”.

5.3. Creating views / virtual tables

Views can also be created using tables stored within the underlying database.

  1. Click on the “SQL” tab on the top of the Superset UI and select “SQL Lab”.
  2. Type in the SQL query to return the desired output view, then click on the dropdown arrow beside the save button.

  1. Click on “Save dataset” and input the desired view name. The view will now be accessible within Superset as a dataset.

6. Dashboard creation

The Dashboard is quite intuitive on its own. To access the dashboard head to the dashboard tab, click on the dashboard and finally click on edit dashboard. If creating a new dashboard click the + Dashboard button and the user will be directed to an empty Dashboard

The dashboard is best learnt through usage so most of the functions will not be covered.

There are a few special functionalities that perhaps one may find useful. The three dots highlighted will give access to options like edit CSS which gives the option to input CSS code. There is also the edit properties button where color schemes for the dashboard, owners and certifications can be defined.

Another more uncommon tool that could be used is the Text label under the layout elements tab. An image can be displayed in the text box by using the following simple code below.

<img src="Input Image Link Here">

Overall, the Dashboard is straightforward and just takes a bit of exploration for the user to gain the intuition to use it.

7. Chart creation

Superset supports a variety of charts for different visualization needs. These charts have been conveniently categorized according to their use cases:

Chart Category

Use Case

Correlation

Plot relationships between features

Distribution

Plot volume / frequency of features

Evolution

Time series plots

Flow

Visualize connections between features

KPI

Visualize key metrics

Map

Geospatial plots

Part of whole

Provides visual breakdown of a population

Ranking

Visualizations that impose an order or score to data

To access the Chart creation menu, click on the “Charts” tab on the top of the Superset UI, then click on the “+ Chart” button. A few examples will be discussed below.

7.1. Line charts

For this example, we will be creating a time-series line chart on Singapore’s public housing (HDB) resale prices across time, by the different apartment flat types.


The steps to create this chart are as follows:

  1. Select the temporal column of choice.
  2. Select the granularity of time series: Select “Original value” as other granularities were observed to throw errors.
  3. Select the feature metric to measure on the vertical axis.
  4. Select any categorical feature dimensions to use to slice the line series.
  5. Click on “Update Chart”.

7.2. Annotation layers

Annotation layers are layers that are put on top of a chart to provide supplementary information on the visualizations. For this example, annotations will be added for events that may have an effect on the resale prices from the example in 7.1.

  1. On the top right corner of the Superset UI, click on “Settings”, and select “Annotation Layers”.
  2. Provide the desired annotation layer name in the window that pops up and press “Add”. for this example.

  1. Within the annotation layer created, labels can be added associated with that label. For example, we can add a separate label for each lockdown that occurred in Singapore. Click on the “+ Annotation” button on the top right corner of the annotation layer page.
  1. Provide an annotation name. This will be part of the label that shows on the chart. The other part of the label is derived from the name that was specified on the chart creation menu which will be shown in a later step.
  2. Provide a date timeframe that corresponds to the lockdown event.

  1. Click on save and go back to the chart. Within the “Annotations and Layers” segment, click on “Add annotation layer” and fill in the inputs as follows:

  1. Annotation layer type: Either choose “Event” or “Interval”, which will reference the date timeframe that was configured in the previous step. “Formula” allows the user to input a function of the horizontal axis feature, whilst “Time Series” adds a layer based on another chart.
  2. Annotation source: Select “Superset annotation” to indicate an annotation layer to be used.
  3. Annotation layer: Select “Covid Lockdown” to reference the annotation layer that was created in the earlier step.
  1. Select “OK” and then “Update Chart”. The annotation should now appear on the chart.

7.3. Geospatial charts

Superset relies on API calls to a 3rd party map service, Mapbox, to generate maps for geospatial plots. Consequently, the user is required to set up a free account with Mapbox, generate an API token, and copy paste it into the “superset/docker/pythonpath_dev/superset_config.py” file as a new line as shown below:

Additionally, it is essential that the underlying dataset has geospatial features such as longitude and latitude in order to plot them on a map grid.

The following steps illustrate how to generate a “deck.gl screen grid” chart, where each square polygon is an aggregate of Singapore HDB resale price, with darker shades representing higher prices.

  1. Select the temporal column of choice.
  2. Select the geospatial columns with longitude and latitude data.
  3. Select the feature metric to represent as a color gradient.
  4. Click on “Update Chart”.

8. Superset vs. Tableau vs. Python

Superset is a useful visualization tool, however it should not be viewed in isolation from other solutions. A list of relevant features were used as a basis for comparison between Superset, Tableau, and Python in the table below:

        

Features

Python

Superset

Tableau

Cost

Free

Free

Limited free license; subscription-based

Chart interactivity

Limited interactivity using plotly

Supports a variety of interactive filters

Supports a variety of interactive filters and tooltips

Ease of use

Potentially involved – everything is in code

User-friendly UI available; SQL knowledge required

User-friendly UI available; SQL / Tableau functions required for advanced use cases

Variety of charts

Supports basic and niche charts

Does not compare to Python by supports most charts

Does not compare to Python by supports most charts

Security

Not applicable

Role-based or dashboard / chart level access management

Role-based access management to control site / project / dashboard access.

Compute speed

Dependent on package optimization

Caching utility available

Has Hyper DB which facilitates quick dashboard rendering

Hosting

Local / server

Local / server / cloud

Desktop / server / cloud

Support

Not applicable

Limited support for immediate production issues

Technical support available for issues with server and desktop software

Community

Large, active and helpful

Active and helpful, but not large

Large, active and helpful

9. Conclusion

With the variety of visualization tools available, a prudent adopter should carefully consider their idiosyncrasies to determine not the best, but the right tool for an organization and its users. Specifically, Superset fills a niche gap between the open-sourced but code-based Python, and the licensed but UI-based Tableau. Whilst esoteric with inflexibilities inherent to abstracted UI-based solutions, Superset is cost-free, easy to install, and is intuitive to use – there is little reason not to give it a try.

Authors