datatype for handling big numbers in pyspark
Question:
I am using spark with python.After uploading a csv file,I needed to parse a column in a csv file which has numbers that are 22 digits long. For parsing that column I used LongType() . I used map() function for defining column.
Following are my commands in pyspark.
>>> test=sc.textFile("test.csv")
>>> header=test.first()
>>> schemaString = header.replace('"','')
>>> testfields = [StructField(field_name, StringType(), True) for field_name in schemaString.split(',')]
>>> testfields[5].dataType = LongType()
>>> testschema = StructType(testfields)
>>> testHeader = test.filter(lambda l: "test_date" in l)
>>> testNoHeader = test.subtract(testHeader)
>>> test_temp = testNoHeader.map(lambda k: k.split(",")).map(lambda
p:(p[0],p[1],p[2],p[3],p[4],***float(p[5].strip('"'))***,p[6],p[7]))
>>> test_temp.top(2)
Note: I have also tried ‘long’ and ‘bigint’ in place of ‘float’ in my variable test_temp, but the error in spark was ‘keyword not found’
And following is the output
[('2012-03-14', '7', '1698.00', 'XYZ02abc008793060653', 'II93', ***8.27370028700801e+21*** , 'W0W0000000000007', '879870080088815007'), ('2002-03-14', '1', '999.00', 'ABC02E000050086941', 'II93', 8.37670028702205e+21, 'A0B0080000012523', '870870080000012421')]
The value in my csv file is as follows:
8.27370028700801e+21 is 8273700287008010012345
8.37670028702205e+21 is 8376700287022050054321
When I create a data frame out of it and then query it,
>>> test_df = sqlContext.createDataFrame(test_temp, testschema)
>>> test_df.registerTempTable("test")
>>> sqlContext.sql("SELECT test_column FROM test").show()
the test_column
gives value ‘null’ for all the records.
So, how to solve this problem of parsing big number in spark, really appreciate your help
Answers:
Well, types matter. Since you convert your data to float
you cannot use LongType
in the DataFrame
. It doesn’t blow only because PySpark is relatively forgiving when it comes to types.
Also, 8273700287008010012345
is too large to be represented as LongType
which can represent only the values between -9223372036854775808 and 9223372036854775807.
If you want to convert your data to a DataFrame
you’ll have to use DoubleType
:
from pyspark.sql.types import *
rdd = sc.parallelize([(8.27370028700801e+21, )])
schema = StructType([StructField("x", DoubleType(), False)])
rdd.toDF(schema).show()
## +-------------------+
## | x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+
Typically it is a better idea to handle this with DataFrames
directly:
from pyspark.sql.functions import col
str_df = sc.parallelize([("8273700287008010012345", )]).toDF(["x"])
str_df.select(col("x").cast("double")).show()
## +-------------------+
## | x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+
If you don’t want to use Double
you can cast to Decimal
with specified precision:
str_df.select(col("x").cast(DecimalType(38))).show(1, False)
## +----------------------+
## |x |
## +----------------------+
## |8273700287008010012345|
## +----------------------+
decimal(precision,scale),make sure scale is appropriate
I am using spark with python.After uploading a csv file,I needed to parse a column in a csv file which has numbers that are 22 digits long. For parsing that column I used LongType() . I used map() function for defining column.
Following are my commands in pyspark.
>>> test=sc.textFile("test.csv")
>>> header=test.first()
>>> schemaString = header.replace('"','')
>>> testfields = [StructField(field_name, StringType(), True) for field_name in schemaString.split(',')]
>>> testfields[5].dataType = LongType()
>>> testschema = StructType(testfields)
>>> testHeader = test.filter(lambda l: "test_date" in l)
>>> testNoHeader = test.subtract(testHeader)
>>> test_temp = testNoHeader.map(lambda k: k.split(",")).map(lambda
p:(p[0],p[1],p[2],p[3],p[4],***float(p[5].strip('"'))***,p[6],p[7]))
>>> test_temp.top(2)
Note: I have also tried ‘long’ and ‘bigint’ in place of ‘float’ in my variable test_temp, but the error in spark was ‘keyword not found’
And following is the output
[('2012-03-14', '7', '1698.00', 'XYZ02abc008793060653', 'II93', ***8.27370028700801e+21*** , 'W0W0000000000007', '879870080088815007'), ('2002-03-14', '1', '999.00', 'ABC02E000050086941', 'II93', 8.37670028702205e+21, 'A0B0080000012523', '870870080000012421')]
The value in my csv file is as follows:
8.27370028700801e+21 is 8273700287008010012345
8.37670028702205e+21 is 8376700287022050054321
When I create a data frame out of it and then query it,
>>> test_df = sqlContext.createDataFrame(test_temp, testschema)
>>> test_df.registerTempTable("test")
>>> sqlContext.sql("SELECT test_column FROM test").show()
the test_column
gives value ‘null’ for all the records.
So, how to solve this problem of parsing big number in spark, really appreciate your help
Well, types matter. Since you convert your data to float
you cannot use LongType
in the DataFrame
. It doesn’t blow only because PySpark is relatively forgiving when it comes to types.
Also, 8273700287008010012345
is too large to be represented as LongType
which can represent only the values between -9223372036854775808 and 9223372036854775807.
If you want to convert your data to a DataFrame
you’ll have to use DoubleType
:
from pyspark.sql.types import *
rdd = sc.parallelize([(8.27370028700801e+21, )])
schema = StructType([StructField("x", DoubleType(), False)])
rdd.toDF(schema).show()
## +-------------------+
## | x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+
Typically it is a better idea to handle this with DataFrames
directly:
from pyspark.sql.functions import col
str_df = sc.parallelize([("8273700287008010012345", )]).toDF(["x"])
str_df.select(col("x").cast("double")).show()
## +-------------------+
## | x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+
If you don’t want to use Double
you can cast to Decimal
with specified precision:
str_df.select(col("x").cast(DecimalType(38))).show(1, False)
## +----------------------+
## |x |
## +----------------------+
## |8273700287008010012345|
## +----------------------+
decimal(precision,scale),make sure scale is appropriate