What is Memory reserved on Yarn

I managed to launch a spark application on Yarn. However memory usage is kind of weird as you can see below : http://imgur.com/1k6VvSI What does memory reserved mean ? How can i manage to...

Remove blank space from data frame column values in Spark

I have a data frame (business_df) of schema: |-- business_id: string (nullable = true) |-- categories: array (nullable = true) | |-- element: string (containsNull = true) |-- city: string...

How to exclude multiple columns in Spark dataframe in Python

I found PySpark has a method called drop but it seems it can only drop one column at a time. Any ideas about how to drop multiple columns at the same time? df.drop(['col1','col2']) TypeError ...

How to list all cassandra tables

There are many tables in cassandra database, which contain column titled user_id. The values user_id are referred to user stored in table users. As some users are deleted, I would like to delete...

pyspark : NameError: name 'spark' is not defined

I am copying the pyspark.ml example from the official document website: http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.Transformer data = [(Vectors.dense([0.0, 0.0]),),...

R dplyr filter rows on numeric values for given column

Working on a Spark platform, using R and RStudio Server, I want to filter my tbl where a given column (string) meets the condition of being numeric. Hence, the column contains both numeric/integer...

sparkR: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':

I am struggling solving this problem when I try to use sparkR. sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "1g")) Error in handleErrors(returnStatus, conn) : ...

SparklyR removing a tbl from Spark Context

Similar to: https://stackoverflow.com/questions/41025094/sparklyr-removing-a-table-from-spark-context, but different because: The above question asks how to remove a "table" from spark, here...

Spark Sql: Loading the file from excel sheet (with extension .xlsx) can not infer the schema of a date-type column properly

I have a xlsx file containing date/time filed (My Time) in following format and sample records - 5/16/2017 12:19:00 AM 5/16/2017 12:56:00 AM 5/16/2017 1:17:00 PM 5/16/2017 5:26:00 PM 5/16/2017...

Optimization when Shuffle write is large and spark task become super slow

There's a SparkSQL which will join 4 large tables (50 million for first 3 table and 200 million for the last table) and do some group by operation which consumes 60 days of data. and this SQL will...

Left join operation runs forever

I have to DataFrames that I want to join applying Left joining. df1 = +----------+---------------+ |product_PK| rec_product_PK| +----------+---------------+ | 560| 630| | ...

How to update few records in Spark

i have the following program in Scala for the spark: val dfA = sqlContext.sql("select * from employees where id in ('Emp1', 'Emp2')" ) val dfB = sqlContext.sql("select * from employees where id...

Specifying "basePath" option in Spark Structured Streaming

Is it possible to set the basePath option when reading partitioned data in Spark Structured Streaming (in Java)? I want to load only the data in a specific partition, such as basepath/x=1/, but I...

The system cannot find the path specified error while running pyspark

I just downloaded spark-2.3.0-bin-hadoop2.7.tgz. After downloading I followed the steps mentioned here pyspark installation for windows 10.I used the comment bin\pyspark to run the spark & got...

PySpark logging from the executor in a standalone cluster

This question has answers related to how to do this on a YARN cluster. But what if I am running a standalone spark cluster? How can I log from executors? Logging from the driver is easy using the...

Read SAS file to get meta information

Very new to data science technologies. Currently working on reading a SAS File (.sas7dbat). Able to read the file using : SAS7BDAT('/dbfs/mnt/myMntScrum1/sasFile.sas7bdat') as f: for row in...

Could not find S3 endpoint or NAT gateway for subnetId

I am unable to connect AWS Glue with RDS VPC S3 endpoint validation failed for SubnetId: subnet-7e8a2. VPC: vpc-4d2d25. Reason: Could not find S3 endpoint or NAT gateway for subnetId:...

How can I mock DynamoDB access via Spark in Scala?

I have a Spark job written in Scala that ultimately writes out to AWS DynamoDB. I want to write some unit tests around it, but the only problem is I don't have a clue how to go about mocking the...

Spark is not loading all multiline json objects in a single file even with multiline option set to true

My json file looks like below, it has got two multiline json objects (in a single file) { "name":"John Doe", "id":"123456" } { "name":"Jane Doe", "id":"456789" } So when i load...

Using tensorflow.keras model in pyspark UDF generates a pickle error

I would like to use a tensorflow.keras model in a pysark pandas_udf. However, I get a pickle error when the model is being serialized before sending it to the workers. I am not sure I am using the...

How to read JSON strings from CSV properlly with Pyspark?

I'm working with The Movies Dataset from https://www.kaggle.com/rounakbanik/the-movies-dataset#movies_metadata.csv. The credits.csv file has three columns, cast, crew, and id. The cast and crew...

AWS DMS SQL Server to s3 parquet - change-data-type transformation rule and 'Parquet type not supported: INT32 (UINT_8)'

We use AWS DMS to dump SQL Server DBs into S3 as parquet files. Idea is to run some analytics with spark over parquets. When a full load is complete then it's not possible to read parquets since...

How to get year and week number aligned for a date

While trying to get year and week number of a range of dates spanning multiple years, I am getting into some issues with the start/end of the year. I understand the logic for weeknumber and the...

Replace null with empty string when writing Spark dataframe

Is there a way to replace null values in a column with empty string when writing spark dataframe to file? Sample data: +----------------+------------------+ | UNIQUE_MEM_ID| ...

How to open spark web ui while running pyspark code in pycharm?

I am running pyspark program in pycharm local on windows 10 machine . I want to open spark web ui to monitor job and understand metrics showed over spark web ui . While running same code on...

How to avoid PySpark from_json to return an entire null row on csv reading when some json typed columns have some null attributes

I'm actually facing an issue I hope I can explain. I'm trying to parse a CSV file with PySpark. This csv file has some JSON columns. Those Json columns have the same Schema, but are not filled the...

Spark SQL - Hive "Cannot overwrite table" workaround

I'm working on a Spark cluster using PySpark and Hive. I've seen a lot of questions here on SO regarding "Cannot overwrite table that is also being read from" Hive error. I understood this comes...

SparkSession doesn't work if org.apache.hive:hive-service is put in dependencies

I'm implementing a simple program in Java that uses Spark SQL to read from a Parquet file, and build an ArrayList of FieldSchema objects (in hive metastore) where each object represents a column...

PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable

I am getting this error after running the function below. PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast...

How to run spark 3.2.0 on google dataproc?

Currently, google dataproc does not have spark 3.2.0 as an image. The latest available is 3.1.2. I want to use the pandas on pyspark functionality that spark has released with 3.2.0. I am doing...