amazon-emr

Unable to read data from mongoDB using Pyspark or Python in AWS EMR

Unable to read data from mongoDB using Pyspark or Python in AWS EMR Question: I am trying to read data from 3 node MongoDB cluster(replica set) using PySpark and native python in AWS EMR. I am facing issues while executing the codes with in AWS EMR cluster as explained below but the same codes are …

Total answers: 1

Perform preprocessing operations from pandas on Spark dataframe

Perform preprocessing operations from pandas on Spark dataframe Question: I have a rather large CSV so I am using AWS EMR to read the data into a Spark dataframe to perform some operations. I have a pandas function that does some simple preprocessing: def clean_census_data(df): """ This function cleans the dataframe and drops columns that …

Total answers: 2

MRJob: I'm having a client error while using EMR

MRJob: I'm having a client error while using EMR Question: I’m a newbie in mrjob and EMR and I’m still trying to figure out how things work. So I’m having this error when I’m running my script: python3 MovieSimilarities.py -r emr –items=ml-100k/u.item ml-100k/u.data > sims2t.txt No configs found; falling back on auto-configuration No configs specified …

Total answers: 2

Amazon EMR: Pyspark having strange dependency issues

Amazon EMR: Pyspark having strange dependency issues Question: I have been having issues with getting a pyspark job to run on an EMR cluster, so I logged into the master node and ran spark-submit directly there I have a python file that I submit to pyspark and in this file I have: import subprocess from …

Total answers: 2