Convert SAS proc sql to Python(pandas)

Question:

I rewrite some code from SAS to Python using Pandas library.

I’ve got such code, and I have no idea what should I do with it?

Can you help me, beacase its too complicated for me to do it correct. I’ve changed the name of columns (for encrypt sensitive data)

This is SAS code:

proc sql;
   create table &work_lib..opk_do_inf_4 as 
   select distinct         
            *,
            min(kat_opk) as opk_do_inf,
            count(nr_ks) as ilsc_opk_do_kosztu_infr
from &work_lib..opk_do_inf_3
group by kod_ow, kod_sw, nr_ks, nr_ks_pr, nazwa_zabiegu_icd_9, nazwa_zabiegu
having kat_opk = opk_do_inf
;
quit;

This is my try in Pandas:

df = self.opk_do_inf_3() -> create DF using other function
df['opk_do_inf'] = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu'])['kat_opk'].min()
df['ilsc_opk_do_kosztu_infr'] = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu'])['nr_ks'].count()
df_groupby = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu']).filter(lambda x: x['kat_opk']==x['opk_do_inf'])
df = df_groupby.reset_index()
df = df.drop_duplcates()
return df
Asked By: Ith

||

Answers:

You can use the sqldf function from pandasql package to run the sql query on dataframe. example below

”’ from pandasql import sqldf
query = "select top 10 * from df "
newdf = sqldf(query, locals())
”’

Answered By: SKP

First, calling SELECT * in an aggregate GROUP BY query is not valid SQL. SAS may allow it but can yield unknown results. Usually SELECT columns should be limited to columns in GROUP BY clause.

With that said, aggregate SQL queries can generally be translated in Pandas with groupby.agg() operations with WHERE (filter before aggregation) or HAVING (filter after aggregation) conditions handled using either .loc or query.

SQL

SELECT col1, col2, col3, 
       MIN(col1) AS min_col1,
       AVG(col2) AS mean_col2, 
       MAX(col3) AS max_col3, 
       COUNT(*)  AS count_obs

FROM mydata
GROUP BY col1, col2, col3
HAVING col1 = min(col1)

Pandas

General

agg_data = (mydata.groupby(["col1", "col2", "col3"], as_index=False)
                  .agg(min_col1 = ("col1", "min"),
                       mean_col2 = ("col2", "mean"),
                       max_col3 = ("col3", "max"),
                       count_obs = ("col1", "count"))
                  .query("col1 == min_col1")
           )

Specific

opk_do_inf_4 = (mydata.groupby(["kat_opk", "kod_ow", "kod_sw", "nr_ks", "nr_ks_pr", 
                                "nazwa_zabiegu_icd_9", "nazwa_zabiegu"], 
                                as_index=False)
                      .agg(opk_do_inf = ("kat_opk", "min"),
                           ilsc_opk_do_kosztu_infr = ("nr_ks", "count"))
                      .query("kat_opk == opk_do_inf")
               )
Answered By: Parfait

Can you tell how you created the function used in first line of SAS code

Answered By: user20000534