Counting distinct values across columns using sqlite3 in python

Question:

I’m trying to count distinct values across columns using sqlite3 in python but cant seem to get the correct results. Am only able to get the count of distinct values in 1 column.

I’ve create a database and imported the csv file as a table

conn = sqlite3.connect('test.db')
curr = conn.cursor()

curr.execute('DROP TABLE IF EXISTS test')
curr.execute('CREATE TABLE social_table (from_id INTEGER, to_id INTEGER)')
conn.commit()

with open ('test.csv') as f:
    reader = csv.reader(f)
    next(reader, None)

    for row in reader:
        from_id = row[0]
        to_id = row[1]

        curr.execute('INSERT INTO test (from_id, to_id) VALUES (?, ?)',
                    (from_id, to_id))
    conn.commit()
from_id to_id
0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8
0 10
0 11

The unique users in this table across columns should be 11 but my code

curr.execute("Select Count(*) from (Select DISTINCT from_id, to_id from test)")

gives me 10 instead.

Anyone able to help me with this?

Asked By: Tia Kang Jun

||

Answers:

You should be counting distinct users from both the from_id and to_id columns. We can use a union for this purpose:

SELECT COUNT(DISTINCT id) AS cnt
FROM
(
    SELECT from_id AS id FROM social_table
    UNION ALL
    SELECT to_id FROM social_table
) t;
Answered By: Tim Biegeleisen

It seems that your query is not correct, instead try this:

curr.execute("""
    SELECT COUNT(DISTINCT user_id) 
    FROM (
        SELECT from_id as user_id FROM social_table 
        UNION 
        SELECT to_id as user_id FROM social_table
    )
""")

Just select all distinct values of from_id and then unioning them with all distinct values of to_id, then count the number of distinct values in this unioned set, which should give you the total number of unique users across both columns.

Answered By: Youcef LAIDANI

another idea is to concat the columns a string

from_id || '-' || to_id 

So you query will be

curr.execute("Select Count(*) from (Select DISTINCT from_id || '-' || to_id  from social_table)")
Answered By: Dickens A S

I guess that from_id and to_id should not be nullable, so you should correct the table definition to:

CREATE TABLE social_table (from_id INTEGER NOT NULL, to_id INTEGER NOT NULL);

In this case, instead of COUNT(DISTINCT ...) which is less efficient you should use a simple COUNT(*):

sql = """
    SELECT COUNT(*) AS count 
    FROM (
      SELECT from_id FROM social_table 
      UNION 
      SELECT to_id FROM social_table
    )
"""
curr.execute(sql)

because UNION removes duplicates from the resultset.

See the demo.

Answered By: forpas
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.