Minimise search time for text in a large CSV file
Question:
I have a CSV file with about 700 or so rows and 8 columns, the last column however, has a very big block of text (with enough for multiple long paragraphs inside each).
I’d like to implement through python a text-search function that gives me back all the lines that have text that matches from inside the data from the 8th column (meaning it’d need to go through the whole thing).
What could possibly be the quickest way to approach this and minimise search-time?
Answers:
You could dump your csv file into an sqlite database and use sqlite’s full text search capabilities to do the search for you.
This example code shows how it could be done. There are a few things to be aware of:
- it assumes that the csv file has a header row. If this isn’t the case, you’ll need to provide column names (or just use generic names like "col1", "col2" etc).
- it searches all columns in the csv; if that’s undesirable, filter out the other columns (and header values) before creating the SQL statements.
- If you want to be able to match the results to rows in the csv file, you’ll need create a column that contains the line number.
import csv
import sqlite3
import sys
def create_table(conn, cols, name='mytable'):
stmt = f"""CREATE VIRTUAL TABLE "{name}" USING fts5({cols})"""
with conn:
conn.execute(stmt)
return
def populate_table(conn, reader, cols, ncols, name='mytable'):
placeholders = ', '.join(['?'] * ncols)
stmt = f"""INSERT INTO "{name}" ({cols})
VALUES ({placeholders})
"""
# Filter out any blank rows in the csv
reader = filter(None, reader)
with conn:
conn.executemany(stmt, reader)
return
def search(conn, term, cols, name='mytable'):
stmt = f"""SELECT {cols}
FROM "{name}"
WHERE "{name}" MATCH ?
"""
with conn:
cursor = conn.cursor()
cursor.execute(stmt, (term,))
result = cursor.fetchall()
return result
def main(path, term):
result = 'NO RESULT SET'
try:
conn = sqlite3.connect(':memory:')
with open(path, 'r') as f:
reader = csv.reader(f)
# Assume headers are in the first row
headers = next(reader)
ncols = len(headers)
cols = ', '.join([f'"{x.strip()}"' for x in headers])
create_table(conn, cols)
populate_table(conn, reader, cols, ncols)
result = search(conn, term, cols)
finally:
conn.close()
return result
if __name__ == '__main__':
print(main(*sys.argv[1:]))
I have a CSV file with about 700 or so rows and 8 columns, the last column however, has a very big block of text (with enough for multiple long paragraphs inside each).
I’d like to implement through python a text-search function that gives me back all the lines that have text that matches from inside the data from the 8th column (meaning it’d need to go through the whole thing).
What could possibly be the quickest way to approach this and minimise search-time?
You could dump your csv file into an sqlite database and use sqlite’s full text search capabilities to do the search for you.
This example code shows how it could be done. There are a few things to be aware of:
- it assumes that the csv file has a header row. If this isn’t the case, you’ll need to provide column names (or just use generic names like "col1", "col2" etc).
- it searches all columns in the csv; if that’s undesirable, filter out the other columns (and header values) before creating the SQL statements.
- If you want to be able to match the results to rows in the csv file, you’ll need create a column that contains the line number.
import csv
import sqlite3
import sys
def create_table(conn, cols, name='mytable'):
stmt = f"""CREATE VIRTUAL TABLE "{name}" USING fts5({cols})"""
with conn:
conn.execute(stmt)
return
def populate_table(conn, reader, cols, ncols, name='mytable'):
placeholders = ', '.join(['?'] * ncols)
stmt = f"""INSERT INTO "{name}" ({cols})
VALUES ({placeholders})
"""
# Filter out any blank rows in the csv
reader = filter(None, reader)
with conn:
conn.executemany(stmt, reader)
return
def search(conn, term, cols, name='mytable'):
stmt = f"""SELECT {cols}
FROM "{name}"
WHERE "{name}" MATCH ?
"""
with conn:
cursor = conn.cursor()
cursor.execute(stmt, (term,))
result = cursor.fetchall()
return result
def main(path, term):
result = 'NO RESULT SET'
try:
conn = sqlite3.connect(':memory:')
with open(path, 'r') as f:
reader = csv.reader(f)
# Assume headers are in the first row
headers = next(reader)
ncols = len(headers)
cols = ', '.join([f'"{x.strip()}"' for x in headers])
create_table(conn, cols)
populate_table(conn, reader, cols, ncols)
result = search(conn, term, cols)
finally:
conn.close()
return result
if __name__ == '__main__':
print(main(*sys.argv[1:]))