No FileSystem for scheme: s3 with pyspark

Question

I’m trying to read a txt file from S3 with Spark, but I’m getting thhis error:

No FileSystem for scheme: s3

This is my code:

from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("first")
sc = SparkContext(conf=conf)
data = sc.textFile("s3://"+AWS_ACCESS_KEY+":" + AWS_SECRET_KEY + "@/aaa/aaa/aaa.txt")

header = data.first()

This is the full traceback:

An error occurred while calling o25.partitions.
: java.io.IOException: No FileSystem for scheme: s3
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:61)
    at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)

How can I fix this?

Asked By: Filipe Ferminiano

||

Source

Answer 1

If you are using a local machine you can use boto3:

s3 = boto3.resource('s3')
# get a handle on the bucket that holds your file
bucket = s3.Bucket('yourBucket')
# get a handle on the object you want (i.e. your file)
obj = bucket.Object(key='yourFile.extension')
# get the object
response = obj.get()
# read the contents of the file and split it into a list of lines
lines = response[u'Body'].read().split('n')

(do not forget to setup your AWS S3 credentials).

Another clean solution if you are using an AWS Virtual Machine (EC2) would be granting S3 permissions to your EC2 and launching pyspark with this command:

pyspark --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2

If you add other packages, make sure the format is: ‘groupId:artifactId:version’ and the packages are separated by commas.

If you are using pyspark from Jupyter Notebooks this will work:

import os
import pyspark
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell'
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext()
sqlContext = SQLContext(sc)
filePath = "s3a://yourBucket/yourFile.parquet"
df = sqlContext.read.parquet(filePath) # Parquet file read example

Answered By: gilgorio

Answer 2

If you’re using a jupyter notebook, you must two files to the class path for spark:

/home/ec2-user/anaconda3/envs/ENV-XXX/lib/python3.6/site-packages/pyspark/jars

the two files are :

hadoop-aws-2.10.1-amzn-0.jar
aws-java-sdk-1.11.890.jar

Answered By: welkinwalker

Answer 3

The above answers are correct regarding the need to specify Hadoop <-> AWS dependencies.

The answers do not include the newer versions of Spark, so I will post whatever worked for me, especially that it has changed as of Spark 3.2.x when spark upgraded to Hadoop 3.0.

Spark 3.0.3

--packages: org.apache.hadoop:hadoop-aws:2.10.2,org.apache.hadoop:hadoop-client:2.10.2
--exclude-packages: com.google.guava:guava

Spark 3.2+

now it’s released together with spark, so use the same version as your spark

--packages: org.apache.spark:spark-hadoop-cloud_2.12:3.2.0

According to Spark documentation

According to spark documentation you should use the org.apache.spark:hadoop-cloud_2.12:<SPARK_VERSION> library. The problem with this is that this library does not exist in the central maven repository.

How to set extra dependencies?

in `spark-submit`

Use --packages and --exclude-packages parameters

in `SparkSession.builder`

Use spark.jars.packages and spark.jars.excludes spark configs

spark = (
    SparkSession
    .builder
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.10.2,org.apache.hadoop:hadoop-client:2.10.2")
    .config("spark.jars.excludes", "com.google.guava:guava")
    .getOrCreate()
)

`s3` vs `s3a`

The above adds the S3AFileSystem to spark’s classpath. When you set this spark config not only s3a://... but also the s3://... paths will work:

spark.hadoop.fs.s3.impl: org.apache.hadoop.fs.s3a.S3AFileSystem

set is using SparkSession.builder.config() of via --conf when using spark-submit

Answered By: botchniaque

No FileSystem for scheme: s3 with pyspark

Question:

Answers:

Spark 3.0.3

Spark 3.2+

According to Spark documentation

How to set extra dependencies?

in `spark-submit`

in `SparkSession.builder`

`s3` vs `s3a`

No FileSystem for scheme: s3 with pyspark

Question:

Answers:

Spark 3.0.3

Spark 3.2+

According to Spark documentation

How to set extra dependencies?

in spark-submit

in SparkSession.builder

s3 vs s3a

in `spark-submit`

in `SparkSession.builder`

`s3` vs `s3a`