How can we skip lines automatically in excel and csv file until header row is found in python (pyspark)

Question:

Can we make the code to skip unnecessary lines until the header row is found in excel and csv files by using python (any module)

Note : if anyone can help me in doing that in pyspark would be appreciated

Example :
enter image description here

In the above sample excel we need to skip the first 3 rows automatically and start reading the file from 4th line starting with "G/L" which is the main header line of the document.

Asked By: Goutham Boine

||

Answers:

I would advise to use Pandas to read the CSV and XLSX files as it has the easiest interface and later on convert it to PySpark. I also added two alternatives that you can try out depending on your setup and preferences.

Direct Pandas

pandas_df = pd.read_excel('path/file.xlsx', sheet_name='Table 1', skiprows=3)

skiprows=3 defines, how many lines shall be skipped before Pandas starts to read the file.
Documentation

PySpark

If you want to use PySpark directly, you’ll need to install two libraries on your cluster:

  • com.crealytics:spark-excel_2.12:0.13.7 (from Maven)
  • xlrd (from PyPI)

You can use then spark.read as follows:

spark_df = spark.read.format("com.crealytics.spark.excel") 
    .option("header", "true") 
    .option("inferSchema", "true") 
    .option("dataAddress", "'Table 1'!A4") 
    .load(filePath)

Here can you tell spark to start in field A4 in the option dataAddress.

PySpark over Pandas

Using the simple interface but directly converting to PySpark is also possible since Pyspark 3.2.0. You can use the syntax as follows:

spark_df = spark.pandas.read_excel('path/file.xlsx', sheet_name='Table 1', skiprows=3)

Documentation

Answered By: elyptikus
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.