Using Altair on data aggregated from large datasets

Question:

I am trying to histogram counts of a large (300,000 records) temporal data set. I am for now just trying to histogram by month which is only 6 data points, but doing this with either json or altair_data_server storage makes the page crash. Is this impossible to handle well with pure Altair? I could of course preprocess in pandas, but that ruins the wonderful declarative nature of altair.

If so is this a missing feature of altair or is it out of scope? I’m learning that vegalite stores the entire underlying data and applies the transformation at run time, but it seems like altair could (and maybe does) have a way to store only the relevant data for the chart.

alt.Chart(df).mark_bar().encode(
    x=alt.X('month(timestamp):T'),
    y='count()'
)
Asked By: mahnamahna

||

Answers:

Altair charts work by sending the entire dataset to your browser and processing it in the frontend; for this reason it does not work well for larger datasets, no matter how the dataset is served to the frontend.

In cases like yours, where you are aggregating the data before displaying it, it would in theory be possible to do that aggregation in the backend, and only send aggregated data to the frontend renderer. There are some projects that hope to make this more seamless, including scalable Vega and altair-transform, but neither approach is very mature yet.

In the meantime, I’d suggest doing your aggregations in Pandas, and sending the aggregated data to Altair to plot.

Edit 2023-01-25: VegaFusion addresses this problem by automatically pre-aggregating the data on the server and is mature enough for production use. Version 1.0 is available under the same license as Altair.

Answered By: jakevdp

Try below :-

alt.data_transformers.enable('default', max_rows=None)

and then

alt.Chart(df).mark_bar().encode(
    x=alt.X('month(timestamp):T'),
    y='count()'
)

you will get the chart but make sure to save all of your work if the browser will crash.

Answered By: ak3191

Using the following works for me:

alt.data_transformers.enable('data_server')

Answered By: Yu Shen