Cannot output result at Python lines.first() from SparkContext

Question:

I am writing my first test.py at spark.

Code

from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local").setAppName("My Test")
sc = SparkContext(conf = conf)

lines = sc.textFile("file:///home/hduser/spark-1.5.2-bin-hadoop2.6/README.md") # Create an RDD called lines

lines.count()
lines.first()

Output:

hduser@borischow-VirtualBox:~/spark-1.5.2-bin-hadoop2.6$ bin/spark-submit test.py
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hduser/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/12/28 17:42:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
****15/12/28 17:42:46 WARN Utils: Your hostname, borischow-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface eth0)
15/12/28 17:42:46 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/12/28 17:42:48 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
15/12/28 17:42:48 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.****
hduser@borischow-VirtualBox:~/spark-1.5.2-bin-hadoop2.6$ 

Questions:

  1. I cannot generate the expected output from lines.count() & lines.first(). Why?

  2. What are the reasons behind of the warning messages?

15/12/28 17:42:46 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform… using builtin-java classes where
applicable

15/12/28 17:42:46 WARN Utils: Your hostname, borischow-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on
interface eth0)

15/12/28 17:42:46 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to >another address

15/12/28 17:42:48 WARN Utils: Service ‘SparkUI’ could not bind on port 4040. Attempting port 4041.

15/12/28 17:42:48 WARN MetricsSystem: Using default name DAGScheduler
for source because spark.app.id is not set.

Thanks a lot!

Asked By: B. Chow

||

Answers:

You don’t see an output because it is not count or first methods don’t output anything to std.out.

Just use print:

from __future__ import print_function
from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local").setAppName("My Test")
sc = SparkContext(conf = conf)

lines = sc.textFile("file:///home/hduser/spark-1.5.2-bin-hadoop2.6/README.md")

print(lines.count())
print(lines.first())
Answered By: zero323
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.