Convert list items to defined data type RDD

Question

Actually I’m working in workspace of Apache Spark Python in Databricks of Cloudera. The idea is to read a csv and format each field.

So, the first step was to read the csv:

uber = sc.textFile("dbfs:/mnt/uber/201601/pec2/uber_curated.csv")

The next step was to convert each line to a list of values:

uber_parsed = uber.map(lambda lin:lin.split(","))
print (uber_parsed.first())

The result was:

[u'B02765', u'2015-05-08 19:05:00', u'B02764', u'262', u'Manhattan',u'Yorkville East']

But, now I need to convert each item of the next list of values to next format String, Date, String, Integer, String, String.

[[u'B02765', u'2015-05-08 19:05:00', u'B02764', u'262', u'Manhattan', u'Yorkville East'],
[u'B02767', u'2015-05-08 19:05:00', u'B02789', u'400', u'New York', u'Yorkville East']]

Somebody knows how to do it?

Asked By: charlytag

||

Source

Answer 1

You can use csv reader. In Spark 1.x you’ll need an external dependency (spark-csv).

from pyspark.sql.types import *

sqlContext.read.format("csv").schema(StructType([
    StructField("_1", StringType()),
    StructField("_2", TimestampType()),
    StructField("_3", StringType()),
    StructField("_4", IntegerType()),
    StructField("_5", StringType()),
    StructField("_6", StringType()),
])).load("dbfs:/mnt/uber/201601/pec2/uber_curated.csv").rdd

or

sqlContext.read.format("csv").schema(StructType([
    StructField("_1", StringType()),
    StructField("_2", DateType()),
    StructField("_3", StringType()),
    StructField("_4", IntegerType()),
    StructField("_5", StringType()),
    StructField("_6", StringType()),
])).option("dateFormat", "yyyy-dd-MM HH:mm:ss").load(
    "dbfs:/mnt/uber/201601/pec2/uber_curated.csv"
).rdd

You can replace (_1, _2 .._n) with descriptive field names.

Answered By: user7337271

Convert list items to defined data type RDD

Question:

Answers: