Find array values in C# DataFrame (equivalent to .isin in Python)?

Question:

I want to convert a well-working Python scripts into C#.

I have a C# DataFrame, using Microsoft.Data.Analysis; library.
Column names are [time] , [site], [samples], [temperature]

I need to process two sequential tasks:

  1. Group rows with same [time] AND [site] –> sum the values in [sample] and keep only 1 value for [temperature] column, last one.
    In Python (Pandas) I have done this:

    dF_out= df_in.groupby(['time','site'], as_index=False).agg({'sample':'sum', 'temperature':'last'})

  2. Find matching values of [sample] with ANY (ALL !) of the values in a constant array of integers, in Python I’ve done the following:

    df_out= df_out.loc[df_out['samples'].isin(int_array)]

In Python, which I am more confident with, methods .GrouBy(...) .isin(...) methods are straightforward and very well described in Pandas docs. Can anyone help me converting this in C# in the most efficient way ?

Thank you in advance

Asked By: Lorenzo Bassetti

||

Answers:

  1. Access row values via indexer to group then by time and site.
  2. Assuming second task follows the first one, you can perform both in a single Select() operation:
  • Sum the samples from local grouping, save it as SamplesSum. In order to sum it, you’ll need to cast in to appropriate type, I used int as an example.
  • Get last temperature from last grouping entry, save it as LastTemperature
  • Lastly, create an intersection of two collections (int_array and local grouping of samples), save it as MatchingValues. Here too, don’t forget about proper casting when selecting sample value from a data frame row

I’m a bit worried about selecting last temperature without sorting it first. The last one will be simply the last one from the grouping, without the certainty that it will be the smallest one or highest value.

var int_array = new int[] { 1, 2, 3 };
var dF_out = df_in.Rows
    .GroupBy(row => new { Time = row[0], Site = row[1] })
    .Select(group => new
    {
        SamplesSum = group.Sum(row => (int)row[2]),
        LastTemperature = group.Last()[3],
        MatchingValues = int_array.Intersect(group.Select(row => (int)row[2])),
    });

Resulting dF_out collection will have such structure:

[
   {
      "SamplesSum":25,
      "LastTemperature":28.0,
      "MatchingValues":[
         21,
         4
      ]
   },
   {
      "SamplesSum":3,
      "LastTemperature":27.0,
      "MatchingValues":[
         3
      ]
   }
]
Answered By: Prolog

I went through a similar task so i can report a possible solution for other readers:

using System.Linq;
using Microsoft.Data.Analysis;

// Assume that df_in is a DataFrame with columns [time], [site], [samples], and [temperature]

var df_out = df_in.AsEnumerable()
    .GroupBy(row => new { Time = row.Field<DateTime>("time"), Site = row.Field<string>("site") })
    .Select(g => new
    {
        Time = g.Key.Time,
        Site = g.Key.Site,
        Samples = g.Sum(row => row.Field<int>("samples")),
        Temperature = g.Last().Field<float>("temperature")
    })
    .ToDataFrame();

then for the second task,

using System.Linq;

// Assume that df_out is a DataFrame with a column [samples] and int_array is an array of integers

var filtered_df = df_out.AsEnumerable()
    .Where(row => int_array.Any(i => i == row.Field<int>("samples")))
    .ToDataFrame();
Answered By: Lorenzo Bassetti
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.