Splitting a text file based on empty lines in Spark
Question:
I am working on a really big file which is a very large text document almost 2GBs.
Something like this –
#*MOSFET table look-up models for circuit simulation
#t1984
#cIntegration, the VLSI Journal
#index1
#*The verification of the protection mechanisms of high-level language machines
#@Virgil D. Gligor
#t1984
#cInternational Journal of Parallel Programming
#index2
#*Another view of functional and multivalued dependencies in the relational database model
#@M. Gyssens, J. Paredaens
#t1984
#cInternational Journal of Parallel Programming
#index3
#*Entity-relationship diagrams which are in BCNF
#@Sushil Jajodia, Peter A. Ng, Frederick N. Springsteel
#t1984
#cInternational Journal of Parallel Programming
#index4
I want to read them in spark and split them based on the empty blocks in spark and create blocks of these data in PySpark.
#*Entity-relationship diagrams which are in BCNF #@Sushil Jajodia, Peter A. Ng, Frederick N. Springsteel #t1984 #cInternational Journal of Parallel Programming #index4
The code I currently wrote is
rdd = sc.textFile('acm.txt').flatMap( lambda x : x.split("nn") )
Answers:
From what I understand, you want to read this text file in spark and have one record per paragraph. For that, you can change the record delimiter (which is n
by default) like this:
In scala:
sc.hadoopConfiguration.set("textinputformat.record.delimiter","nn")
val rdd = sc.textFile("acm.txt")
In python (you need to access the java spark context to have access to the hadoop configuration):
sc._jsc.hadoopConfiguration().set("textinputformat.record.delimiter","nn")
rdd = sc.textFile("acm.txt")
I am working on a really big file which is a very large text document almost 2GBs.
Something like this –
#*MOSFET table look-up models for circuit simulation
#t1984
#cIntegration, the VLSI Journal
#index1
#*The verification of the protection mechanisms of high-level language machines
#@Virgil D. Gligor
#t1984
#cInternational Journal of Parallel Programming
#index2
#*Another view of functional and multivalued dependencies in the relational database model
#@M. Gyssens, J. Paredaens
#t1984
#cInternational Journal of Parallel Programming
#index3
#*Entity-relationship diagrams which are in BCNF
#@Sushil Jajodia, Peter A. Ng, Frederick N. Springsteel
#t1984
#cInternational Journal of Parallel Programming
#index4
I want to read them in spark and split them based on the empty blocks in spark and create blocks of these data in PySpark.
#*Entity-relationship diagrams which are in BCNF #@Sushil Jajodia, Peter A. Ng, Frederick N. Springsteel #t1984 #cInternational Journal of Parallel Programming #index4
The code I currently wrote is
rdd = sc.textFile('acm.txt').flatMap( lambda x : x.split("nn") )
From what I understand, you want to read this text file in spark and have one record per paragraph. For that, you can change the record delimiter (which is n
by default) like this:
In scala:
sc.hadoopConfiguration.set("textinputformat.record.delimiter","nn")
val rdd = sc.textFile("acm.txt")
In python (you need to access the java spark context to have access to the hadoop configuration):
sc._jsc.hadoopConfiguration().set("textinputformat.record.delimiter","nn")
rdd = sc.textFile("acm.txt")