Clusterfun is a python plotting library to explore image and audio data. Play around with a live demo on https://clusterfun.app.
Clusterfun can be installed with pip:
pip install clusterfun
Clusterfun requires Python 3.8 or higher.
Plots accept data in the form of a pandas DataFrame, which will be installed automatically if not already present. No account, payment, or internet connection is required to use clusterfun. Clusterfun is open source and free to use.
import pandas as pd
import clusterfun as clt
df = pd.read_csv("https://raw.githubusercontent.com/gietema/clusterfun-data/main/wiki-art.csv")
clt.scatter(df, x="x", y="y", media="img_path", color="painter")
Data can be hosted locally or on AWS S3.
As you can see, a clusterfun plot takes as input a pandas dataframe and column names indicating which columns to use for the visualisation. In this way, it is similar to the seaborn or the plotly library. But in clusterfun, you can:
- Click and drag to select data to visualise it in a grid
- Hover over data points to see them on the right side of the page
- Click on data points to view zoomed in versions of the image related to the data point
This makes clusterfun ideal for quickly visualising image data, which can be useful in the context of building datasets, exploring edge cases and debugging model performance.
The default parameters for the plot types are as follows:
-
df: pd.DataFrame(required)The dataframe used for the data to plot. Most other parameters are column names in this dataframe (e.g. media, color, etc.).
-
media: str(required)The column name of the media to display in the plot. See data loading for more information about the type of media that can be displayed.
-
show: bool = TrueWhether to show the plot or not. If show is set to True, clusterfun will start a local server to display the plot in a web browser. More specifically, we start a FastAPI server where we mount the webpage as a static file. The application itself does not require an internet connection. All data is loaded locally and does not leave your machine/browser. If show is set to False, clusterfun only saves the required data to serve the plot later on and return the path to where the data is stored. If you want to serve the plot yourself later on, you can run
clusterfun {path - to - data}|{uuid}in the command line to start a local server for the plot you are interested in. -
color: Optional[str] = NoneIf given, points will be colored based on the values in the given column. Powerful for visualising clusters or classes of data.
-
title: Optional[str] = NoneThe title to use for the plot.
-
bounding_box: Optional[str] = NoneYou can visualise bounding boxes on top of your images by with the
bounding_boxparameter. For this to work, you need to have a bounding box column in the dataframe used to plot the data. Each cell in the dataframe needs to contain a dictionary or a list of dictionaries with bounding box values: xmin, ymin, xmax, ymax, label (optional), color (optional). The keys of the expected dictionary are:xmin: float | intymin: float | intxmax: float | intymax: float | intlabel: Optional[str] = Nonecolor: Optional[str] = None
If no color is provided, a default color scheme will be used. The color value can be a color name or hex value. The label will be displayed in the top left of the bounding box. Example:
single_bounding_box = { "xmin": 12, "ymin": 10, "xmax": 100, "ymax": 110, "color": "green", "label": "ground truth" }
The following plot types are available:
- Bar chart
- Confusion matrix
- Grid
- Histogram
- Pie chart
- Scatterplot
- Violin plot
def bar_chart(
df: pd.DataFrame,
x: str,
media: str,
color: Optional[str] = None,
...
) -> Path:df: pd.DataFrameThe dataframe with the data to plotx: strThe column name of the data for the bar chart. One bar per unique value will be plotted.media: strThe column name of the media to displaycolor: Optional[str] = NoneIf added, the color will be used to create a stacked bar chart.
import pandas as pd
import clusterfun as clt
df = pd.read_csv("https://raw.githubusercontent.com/gietema/clusterfun-data/main/wiki-art.csv")
clt.bar_chart(df, x="painter", media="img_path", color="style")def confusion_matrix(
df: pd.DataFrame,
y_true: str,
y_pred: str,
media: str,
...
) -> Path:-
df: pd.DataFrameThe dataframe with the data to plot
-
y_true: strThe ground truth label. Values can be integers or strings.
-
y_pred: strThe column name of the predicted label. Values can be integers or strings.
-
media: strThe column name of the media to display
import pandas as pd
import clusterfun as clt
df = pd.read_csv("https://raw.githubusercontent.com/gietema/clusterfun-data/main/cifar10.csv")
clt.confusion_matrix(df, y_true="label", y_pred="pred", media="img_path")def grid(
df: pd.DataFrame,
media: str,
...
) -> Path:-
df: pd.DataFrameThe dataframe with the data to plot
-
media: strThe column name of the media to display
import pandas as pd
import clusterfun as clt
df = pd.read_csv("https://raw.githubusercontent.com/gietema/clusterfun-data/main/wiki-art.csv")
clt.grid(df, media="img_path")def histogram(
df: pd.DataFrame,
x: str,
media: str,
bins: int = 20,
...
) -> Path:-
df: pd.DataFrameThe dataframe with the data to plot
-
x: strThe column name of the data for the histogram
-
media: strThe column name of the media to display
-
bins: int = 20The number of bins to use for the histogram
import pandas as pd
import clusterfun as clt
df = pd.read_csv("https://raw.githubusercontent.com/gietema/clusterfun-data/main/wiki-art.csv")
clt.histogram(df, x="brightness", media="img_path")def pie(
df: pd.DataFrame,
color: str,
) -> Path:-
df: pd.DataFrameThe dataframe with the data to plot
-
colorColumn for the pies of the pie chart
import pandas as pd
import clusterfun as clt
df = pd.read_csv("https://raw.githubusercontent.com/gietema/clusterfun-data/main/wiki-art.csv")
clt.pie_chart(df, color="painter", media="img_path") df: pd.DataFrame,
x: str,
y: str,
...
) -> Path:
-
df: pd.DataFrameThe dataframe with the data to plot
-
x: strThe column name of the data for the x-axis
-
y: strThe column name of the data for the y-axis
import pandas as pd
import clusterfun as clt
df = pd.read_csv("https://raw.githubusercontent.com/gietema/clusterfun-data/main/wiki-art.csv")
clt.scatter(df, x="x", y="y", media="img_path")def violin(
df: pd.DataFrame,
y: str,
...
) -> Path:-
df: pd.DataFrameThe dataframe with the data to plot
-
y: strThe column name of the data for the y-axis
import pandas as pd
import clusterfun as clt
df = pd.read_csv("https://raw.githubusercontent.com/gietema/clusterfun-data/main/wiki-art.csv")
df = df[df.painter.isin(["Pablo Picasso", "Juan Gris", "Georges Braque", "Fernand Leger"])]
clt.violin(df, y="brightness", media="img_path")Clusterfun supports AWS S3 and local data storage and loading. The dataframe column corresponding to the media value in the plot will be used to determine where to load the media from.
import clusterfun as clt
df = pd.read_csv("https://raw.githubusercontent.com/gietema/clusterfun-data/main/wiki-art.csv")
clt.grid(df, media="img_column")AWS S3 media should start with s3://.
Make sure to set a AWS_REGION environment variable to the region where your data is stored.
Support for Google Cloud Storage is coming soon.





