An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe

I am new to spark and facing an error while converting .csv file to dataframe. I am using pyspark_csv module for the conversion but gives an error, here is the stack trace for the error, can any...

pyspark error: AttributeError: 'SparkSession' object has no attribute 'parallelize'

I am using pyspark on Jupyter notebook. Here is how Spark setup: import findspark findspark.init(spark_home='/home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive',...

installing pyspark on windows

I have a few questions which I would like to clarify before installation. Please bear with me as I am still new to data science and installation packages. I can do a pip install pyspark on my...

How to Distribute Multiprocessing Pool to Spark Workers

I am trying to use multiprocessing to read 100 csv files in parallel (and subsequently process them separately in parallel). Here is my code running in Jupyter hosted on my EMR master node in AWS....

Unable to SaveAsTextFile AttributeError: 'list' object has no attribute 'saveAsTextFile'

I have submitted a similar question relating to saveAsTextFile, but I'm not sure if one question will provide the same answer as I now have a new error messagae: I have compiled the following...

How get the percentage of totals for each count after a groupBy in pyspark?

Given the following DataFrame: import findspark findspark.init() from pyspark.sql import SparkSession spark = SparkSession.builder.master("local").appName("test").getOrCreate() df =...

Setting up environment

I am using Google Colaboratory to learn about Pyspark. For some reason, when running the set up environment, I am getting an error message. This seems to happen when moving from one notebook to...

Spark DataFrame limit function takes too much time to show

import pyspark from pyspark.sql import SparkSession from pyspark.conf import SparkConf import findspark from pyspark.sql.functions import countDistinct spark = SparkSession.builder...

Unable to install PySpark on Google Colab

I am trying to install PySpark on Google Colab using the code given below but getting the following error. tar: spark-2.3.2-bin-hadoop2.7.tgz: Cannot open: No such file or directory tar: Error is...

findspark.init() failing - Cannot get SPARK_HOME environment variables set correctly

I'm new to using Spark and I'm attempting play with Spark on my local (windows) machine using Jupyter Notebook I've been following several tutorials for setting environment variables, as well as...

What is the best way to read Hive Table through Spark SQL?

I execute Spark SQL reading from Hive Tables and it is lengthy in execution(15 min). I am interested in optimizing the query execution so I am asking about if the execution for those queries uses...

Requirement failed: Nothing has been added to this summarizer

I am trying to test that pyspark is running properly on my system, but when I try to call fit on my data I get and error, "Requirement failed: Nothing has been added to this summarizer" import...

pyspark.sql.utils.AnalysisException: Failed to find data source: kafka

I am trying to read a stream from kafka using pyspark. I am using spark version 3.0.0-preview2 and spark-streaming-kafka-0-10_2.12 Before this I just stat zookeeper, kafka and create a new...

How to execute a stored procedure in Azure Databricks PySpark?

I am able to execute a simple SQL statement using PySpark in Azure Databricks but I want to execute a stored procedure instead. Below is the PySpark code I tried. #initialize pyspark import...

Spark 3.x Integration with Kafka in Python

Kafka with spark-streaming throws an error: from pyspark.streaming.kafka import KafkaUtils ImportError: No module named kafka I have already setup a kafka broker and a working spark environment...

Not Able to Run Pyspark in Google Colab

hi I am trying to run pyspark on google colab using following code : !apt-get install openjdk-8-jdk-headless -qq > /dev/null !wget -q...

Connecting to Redshift via PySpark, how do we get drivers to work?

I was trying to connect to Redshift using Pyspark and I would get the Failed to find data source: com.databricks.spark.redshift error. I was able to finally get rid of that error by manually...

Using pyspark in Google Colab

This is my first question here after using a lot of StackOverflow so correct me if I give inaccurate or incomplete info Up until this week I had a colab notebook setup to run with pyspark...

Docker Spark 3.0.0 pyspark py4j.protocol.Py4JError

I created a docker image with spark 3.0.0 that is to be used for executing pyspark from a jupyter notebook. The issue I'm having though, when running the docker image locally and testing the...

Using xgboost in Pyspark gives ImportError: cannot import name 'JavaPredictionModel'

This is my first attempt to use xgboost in pyspark so my experience with Java and Pyspark is still in learning phase. I saw an awesome article in towards datascience with title PySpark ML and...

Elephas tutorial error - ValueError: Could not interpret optimizer identifier

I'm trying to run this elephas tutorial on Colab. I prepared the environment with !apt-get install openjdk-8-jdk-headless -qq > /dev/null !wget -q...

ModuleNotFoundError: No module named 'pyspark'

I recently installed pyspark on Linux and get the error when importing pyspark: ModuleNotFoundError: No module named 'pyspark' Pyspark is in my 'pip list' I addded the following lines to my...

Conda Install command failing

I am using conda 4.6.11 and below is the requirements file that I am using # This file may be used to create an environment...

pyspark pandas UDF EOFError on macOS

Running pandas UDF on macOS (Big Sur) result in the error below, while the exact same code works perfectly fine on Google Colab. Moreover, spark UDFs work fine. 20/12/09 14:02:22 ERROR...

How come PySpark can't find my SPARK_HOME

I am trying to run a Jupyter notebook from Archives Unleashed locally on my machine. When the notebooks builds PySpark, it runs into this exception: Exception: Unable to find py4j, your SPARK_HOME...

conda install pyspark 2.3.1 InvalidVersionSpecError: Invalid version spec: =2.7

I am trying to install pyspark to my virtual environment on a Linux box. Following code was used and it was previously working fine. All of sudden we are facing an issue. conda install -q -y -c...

TypeError: 'JavaPackage' object is not callable on google collab

I am learning apache spark and I ran below code on google colab. #installed based upon...

Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:61698)

When I was doing Alternating Least Square method to perform Matrix Factorization, I encounter the error with Java Server in Spark, I don't know why this happens, below is the error...

kafka integration with Pyspark structured streaming (Windows)

After installing anaconda on my windows 10 machine, and then I followed the following tutorial to set it up on my machine and run it with jupyter :...

py4j.protocol.Py4JJavaError: An error occurred while calling o32.load

While executing the below code I am getting : py4j.protocol.Py4JJavaError: An error occurred while calling o32.load. import time import findspark ...