Remove blank space from data frame column values in Spark

I have a data frame (business_df) of schema: |-- business_id: string (nullable = true) |-- categories: array (nullable = true) | |-- element: string (containsNull = true) |-- city: string...

How to exclude multiple columns in Spark dataframe in Python

I found PySpark has a method called drop but it seems it can only drop one column at a time. Any ideas about how to drop multiple columns at the same time? df.drop(['col1','col2']) TypeError ...

PySpark: how to resample frequencies

Imagine a Spark Dataframe consisting of value observations from variables. Each observation has a specific timestamp and those timestamps are not the same between different variables. This is...

pyspark : NameError: name 'spark' is not defined

I am copying the pyspark.ml example from the official document website: http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.Transformer data = [(Vectors.dense([0.0, 0.0]),),...

PySpark sampleBy using multiple columns

I want to carry out a stratified sampling from a data frame on PySpark. There is a sampleBy(col, fractions, seed=None) function, but it seems to only use one column as a strata. Is there any way...

PySpark - Adding a Column from a list of values using a UDF

I have to add column to a PySpark dataframe based on a list of values. a= spark.createDataFrame([("Dog", "Cat"), ("Cat", "Dog"), ("Mouse", "Cat")],["Animal", "Enemy"]) I have a list called...

Read first line of huge Json file with Spark using Pyspark

I'm pretty new to Spark and to teach myself I have been using small json files, which work perfectly. I'm using Pyspark with Spark 2.2.1 However I don't get how to read in a single data line...

The system cannot find the path specified error while running pyspark

I just downloaded spark-2.3.0-bin-hadoop2.7.tgz. After downloading I followed the steps mentioned here pyspark installation for windows 10.I used the comment bin\pyspark to run the spark & got...

PySpark logging from the executor in a standalone cluster

This question has answers related to how to do this on a YARN cluster. But what if I am running a standalone spark cluster? How can I log from executors? Logging from the driver is easy using the...

Read SAS file to get meta information

Very new to data science technologies. Currently working on reading a SAS File (.sas7dbat). Able to read the file using : SAS7BDAT('/dbfs/mnt/myMntScrum1/sasFile.sas7bdat') as f: for row in...

Total zero count across all columns in a pyspark dataframe

I need to find the percentage of zero across all columns in a pyspark dataframe. How to find the count of zero across each columns in the dataframe? P.S: I have tried converting the dataframe into...

pyarrow error: toPandas attempted Arrow optimization

when I set pyarrow to true we using spark session, but when I run toPandas(), it throws the error: "toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to...

Failing to decompress streaming data by using UDF on Azure Databricks - Python

I am trying to read Azure EventHub GZIP compressed messages using Azure DataBricks and python (PySpark), but using UDF isn't working with BinaryType data. Well, here is the part where I check what...

Does toPandas() speed up as a pyspark dataframe gets smaller?

I figured I would ask the question. I've found a clever way to reduce the size of a PySpark Dataframe and convert it to Pandas and I was just wondering, does the toPandas function get faster as...

Using tensorflow.keras model in pyspark UDF generates a pickle error

I would like to use a tensorflow.keras model in a pysark pandas_udf. However, I get a pickle error when the model is being serialized before sending it to the workers. I am not sure I am using the...

Functions not recognised inside Pandas UDF function

I am using Pandas UDF on Pyspark. I have a main file __main_.py with: from pyspark.sql import SparkSession from run_udf import compute def main(): spark = SparkSession.builder.getOrCreate() ...

How to read JSON strings from CSV properlly with Pyspark?

I'm working with The Movies Dataset from https://www.kaggle.com/rounakbanik/the-movies-dataset#movies_metadata.csv. The credits.csv file has three columns, cast, crew, and id. The cast and crew...

py spark writing invalid date successfully but throwing exception when reading it

I'm using py-spark to process some data and writing processed files to S3 with parquet format. Code for the batch process is running on Ec2 in a docker container(Linux). This data contains some...

Pyspark EMR Notebook - Unable to save file to EMR environment

This seems really basic but I can't seem to figure it out. I am working in a Pyspark Notebook on EMR and have taken a pyspark dataframe and converted it to a pandas dataframe using...

Replace null with empty string when writing Spark dataframe

Is there a way to replace null values in a column with empty string when writing spark dataframe to file? Sample data: +----------------+------------------+ | UNIQUE_MEM_ID| ...

How to open spark web ui while running pyspark code in pycharm?

I am running pyspark program in pycharm local on windows 10 machine . I want to open spark web ui to monitor job and understand metrics showed over spark web ui . While running same code on...

How to avoid PySpark from_json to return an entire null row on csv reading when some json typed columns have some null attributes

I'm actually facing an issue I hope I can explain. I'm trying to parse a CSV file with PySpark. This csv file has some JSON columns. Those Json columns have the same Schema, but are not filled the...

Spark SQL - Hive "Cannot overwrite table" workaround

I'm working on a Spark cluster using PySpark and Hive. I've seen a lot of questions here on SO regarding "Cannot overwrite table that is also being read from" Hive error. I understood this comes...

'OneHotEncoder' object has no attribute 'transform'

I am using Spark v3.0.0. My dataframe is: indexer.show() +------+--------+-----+ |row_id| city|index| +------+--------+-----+ | 0|New York| 0.0| | 1| Moscow| 3.0| | 2| Beijing| ...

How to calculate age from birth date in pyspark?

I am calculating age from birth date in pyspark : def run(first): out = spark.sql(""" SELECT p.birth_date, FROM table1 p LEFT JOIN table2 a USING(id) ...

Improve PySpark implementation for finding connected components in a graph

I am currently working on the implementation of this paper describing Map Reduce Algorithm to fing connected component : https://www.cse.unr.edu/~hkardes/pdfs/ccf.pdf As a beginner in Big Data...

PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable

I am getting this error after running the function below. PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast...

Get s3 folder partitions with AWS Glue create_dynamic_frame.from_options

I'm creating a dynamic frame with create_dynamic_frame.from_options that pulls data directly from s3. However, I have multiple partitions on my raw data that don't show up in the schema this way,...

Databricks Connect java.lang.ClassNotFoundException

I updated our databricks cluster to DBR 9.1 LTS on Azure Databricks, but a package I use regularly is giving me an error when I try to run it in VS Code with Databricks-connect, where it didn't...

How to run spark 3.2.0 on google dataproc?

Currently, google dataproc does not have spark 3.2.0 as an image. The latest available is 3.1.2. I want to use the pandas on pyspark functionality that spark has released with 3.2.0. I am doing...