How to use S3 Select with tab separated csv files
Question:
I’m using this script to query data from a CSV file that’s saved on an AWS S3 Bucket. It works well with CSV files that were originally saved in Comma Separated format but I have a lot of data saved with tab delimiter (Sep=’t’) which makes the code fail.
The original data is very massive which makes it difficult to rewrite it. Is there a way to query data where we specify the delimiter/separator for the CSV file?
I used it from this post: https://towardsdatascience.com/how-i-improved-performance-retrieving-big-data-with-s3-select-2bd2850bc428 … I’d like to thank the writer for the tutorial which helped me save a lot of time.
Here’s the code:
import boto3
import os
import pandas as pd
S3_KEY = r'source/df.csv'
S3_BUCKET = 'my_bucket'
TARGET_FILE = 'dataset.csv'
aws_access_key_id= 'my_key'
aws_secret_access_key= 'my_secret'
s3_client = boto3.client(service_name='s3',
region_name='us-east-1',
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key)
query = """SELECT column1
FROM S3Object
WHERE column1 = '4223740573'"""
result = s3_client.select_object_content(Bucket=S3_BUCKET,
Key=S3_KEY,
ExpressionType='SQL',
Expression=query,
InputSerialization={'CSV': {'FileHeaderInfo': 'Use'}},
OutputSerialization={'CSV': {}})
# remove the file if exists, since we append filtered rows line by line
if os.path.exists(TARGET_FILE):
os.remove(TARGET_FILE)
with open(TARGET_FILE, 'a+') as filtered_file:
# write header as a first line, then append each row from S3 select
filtered_file.write('Column1n')
for record in result['Payload']:
if 'Records' in record:
res = record['Records']['Payload'].decode('utf-8')
filtered_file.write(res)
result = pd.read_csv(TARGET_FILE)
Answers:
The InputSerialization option also allows you to specify:
RecordDelimiter – A single character used to separate individual records in the input. Instead of the default value, you can specify an arbitrary delimiter.
So you could try:
result = s3_client.select_object_content(
Bucket=S3_BUCKET,
Key=S3_KEY,
ExpressionType='SQL',
Expression=query,
InputSerialization={'CSV': {'FileHeaderInfo': 'Use', 'RecordDelimiter': 't'}},
OutputSerialization={'CSV': {}})
Actually, I had a TSV file, and I used this InputSerialization:
InputSerialization={'CSV': {'FileHeaderInfo': 'None', 'RecordDelimiter': 'n', 'FieldDelimiter': 't'}}
It works for files and have Enters between records, and not tabs, but tabs between fields.
I’m using this script to query data from a CSV file that’s saved on an AWS S3 Bucket. It works well with CSV files that were originally saved in Comma Separated format but I have a lot of data saved with tab delimiter (Sep=’t’) which makes the code fail.
The original data is very massive which makes it difficult to rewrite it. Is there a way to query data where we specify the delimiter/separator for the CSV file?
I used it from this post: https://towardsdatascience.com/how-i-improved-performance-retrieving-big-data-with-s3-select-2bd2850bc428 … I’d like to thank the writer for the tutorial which helped me save a lot of time.
Here’s the code:
import boto3
import os
import pandas as pd
S3_KEY = r'source/df.csv'
S3_BUCKET = 'my_bucket'
TARGET_FILE = 'dataset.csv'
aws_access_key_id= 'my_key'
aws_secret_access_key= 'my_secret'
s3_client = boto3.client(service_name='s3',
region_name='us-east-1',
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key)
query = """SELECT column1
FROM S3Object
WHERE column1 = '4223740573'"""
result = s3_client.select_object_content(Bucket=S3_BUCKET,
Key=S3_KEY,
ExpressionType='SQL',
Expression=query,
InputSerialization={'CSV': {'FileHeaderInfo': 'Use'}},
OutputSerialization={'CSV': {}})
# remove the file if exists, since we append filtered rows line by line
if os.path.exists(TARGET_FILE):
os.remove(TARGET_FILE)
with open(TARGET_FILE, 'a+') as filtered_file:
# write header as a first line, then append each row from S3 select
filtered_file.write('Column1n')
for record in result['Payload']:
if 'Records' in record:
res = record['Records']['Payload'].decode('utf-8')
filtered_file.write(res)
result = pd.read_csv(TARGET_FILE)
The InputSerialization option also allows you to specify:
RecordDelimiter – A single character used to separate individual records in the input. Instead of the default value, you can specify an arbitrary delimiter.
So you could try:
result = s3_client.select_object_content(
Bucket=S3_BUCKET,
Key=S3_KEY,
ExpressionType='SQL',
Expression=query,
InputSerialization={'CSV': {'FileHeaderInfo': 'Use', 'RecordDelimiter': 't'}},
OutputSerialization={'CSV': {}})
Actually, I had a TSV file, and I used this InputSerialization:
InputSerialization={'CSV': {'FileHeaderInfo': 'None', 'RecordDelimiter': 'n', 'FieldDelimiter': 't'}}
It works for files and have Enters between records, and not tabs, but tabs between fields.