Counting distinct values across columns using sqlite3 in python
Question:
I’m trying to count distinct values across columns using sqlite3 in python but cant seem to get the correct results. Am only able to get the count of distinct values in 1 column.
I’ve create a database and imported the csv file as a table
conn = sqlite3.connect('test.db')
curr = conn.cursor()
curr.execute('DROP TABLE IF EXISTS test')
curr.execute('CREATE TABLE social_table (from_id INTEGER, to_id INTEGER)')
conn.commit()
with open ('test.csv') as f:
reader = csv.reader(f)
next(reader, None)
for row in reader:
from_id = row[0]
to_id = row[1]
curr.execute('INSERT INTO test (from_id, to_id) VALUES (?, ?)',
(from_id, to_id))
conn.commit()
from_id
to_id
0
1
0
2
0
3
0
4
0
5
0
6
0
7
0
8
0
10
0
11
The unique users in this table across columns should be 11 but my code
curr.execute("Select Count(*) from (Select DISTINCT from_id, to_id from test)")
gives me 10 instead.
Anyone able to help me with this?
Answers:
You should be counting distinct users from both the from_id
and to_id
columns. We can use a union for this purpose:
SELECT COUNT(DISTINCT id) AS cnt
FROM
(
SELECT from_id AS id FROM social_table
UNION ALL
SELECT to_id FROM social_table
) t;
It seems that your query is not correct, instead try this:
curr.execute("""
SELECT COUNT(DISTINCT user_id)
FROM (
SELECT from_id as user_id FROM social_table
UNION
SELECT to_id as user_id FROM social_table
)
""")
Just select all distinct values of from_id
and then unioning them with all distinct values of to_id
, then count the number of distinct values in this unioned set, which should give you the total number of unique users across both columns.
another idea is to concat the columns a string
from_id || '-' || to_id
So you query will be
curr.execute("Select Count(*) from (Select DISTINCT from_id || '-' || to_id from social_table)")
I guess that from_id
and to_id
should not be nullable, so you should correct the table definition to:
CREATE TABLE social_table (from_id INTEGER NOT NULL, to_id INTEGER NOT NULL);
In this case, instead of COUNT(DISTINCT ...)
which is less efficient you should use a simple COUNT(*)
:
sql = """
SELECT COUNT(*) AS count
FROM (
SELECT from_id FROM social_table
UNION
SELECT to_id FROM social_table
)
"""
curr.execute(sql)
because UNION
removes duplicates from the resultset.
See the demo.
I’m trying to count distinct values across columns using sqlite3 in python but cant seem to get the correct results. Am only able to get the count of distinct values in 1 column.
I’ve create a database and imported the csv file as a table
conn = sqlite3.connect('test.db')
curr = conn.cursor()
curr.execute('DROP TABLE IF EXISTS test')
curr.execute('CREATE TABLE social_table (from_id INTEGER, to_id INTEGER)')
conn.commit()
with open ('test.csv') as f:
reader = csv.reader(f)
next(reader, None)
for row in reader:
from_id = row[0]
to_id = row[1]
curr.execute('INSERT INTO test (from_id, to_id) VALUES (?, ?)',
(from_id, to_id))
conn.commit()
from_id | to_id |
---|---|
0 | 1 |
0 | 2 |
0 | 3 |
0 | 4 |
0 | 5 |
0 | 6 |
0 | 7 |
0 | 8 |
0 | 10 |
0 | 11 |
The unique users in this table across columns should be 11 but my code
curr.execute("Select Count(*) from (Select DISTINCT from_id, to_id from test)")
gives me 10 instead.
Anyone able to help me with this?
You should be counting distinct users from both the from_id
and to_id
columns. We can use a union for this purpose:
SELECT COUNT(DISTINCT id) AS cnt
FROM
(
SELECT from_id AS id FROM social_table
UNION ALL
SELECT to_id FROM social_table
) t;
It seems that your query is not correct, instead try this:
curr.execute("""
SELECT COUNT(DISTINCT user_id)
FROM (
SELECT from_id as user_id FROM social_table
UNION
SELECT to_id as user_id FROM social_table
)
""")
Just select all distinct values of from_id
and then unioning them with all distinct values of to_id
, then count the number of distinct values in this unioned set, which should give you the total number of unique users across both columns.
another idea is to concat the columns a string
from_id || '-' || to_id
So you query will be
curr.execute("Select Count(*) from (Select DISTINCT from_id || '-' || to_id from social_table)")
I guess that from_id
and to_id
should not be nullable, so you should correct the table definition to:
CREATE TABLE social_table (from_id INTEGER NOT NULL, to_id INTEGER NOT NULL);
In this case, instead of COUNT(DISTINCT ...)
which is less efficient you should use a simple COUNT(*)
:
sql = """
SELECT COUNT(*) AS count
FROM (
SELECT from_id FROM social_table
UNION
SELECT to_id FROM social_table
)
"""
curr.execute(sql)
because UNION
removes duplicates from the resultset.
See the demo.