Not able to store integer data using ParquetStorer

I am facing very weird issue. I have multiple column data processing using Pig. Pig uses HCatalogLoader to load data in pig script. The columns contain multiple integer data, string data and also...

Persisting Spark Streaming output

I'm collecting the data from a messaging app, I'm currently using Flume, it sends approx 50 Million records per day I wish to use Kafka, consume from Kafka using Spark Streaming and persist it to...

Parquet file compression

What would be the most optimized compression logic for Parquet files when using in Spark? Also what would be the approximate size of a 1gb parquet file after compression with each compression type?

Using pyarrow how do you append to parquet file?

How do you append/update to a parquet file with pyarrow? import pandas as pd import pyarrow as pa import pyarrow.parquet as pq table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo',...

Required field 'uncompressed_page_size' was not found in serialized data! Parquet

I am getting below error while trying to save parquet file from local directory using pyspark. I tried spark 1.6 and 2.2 both give same error It display's schema properly but gives error at the...

Left join operation runs forever

I have to DataFrames that I want to join applying Left joining. df1 = +----------+---------------+ |product_PK| rec_product_PK| +----------+---------------+ | 560| 630| | ...

How to update few records in Spark

i have the following program in Scala for the spark: val dfA = sqlContext.sql("select * from employees where id in ('Emp1', 'Emp2')" ) val dfB = sqlContext.sql("select * from employees where id...

Specifying "basePath" option in Spark Structured Streaming

Is it possible to set the basePath option when reading partitioned data in Spark Structured Streaming (in Java)? I want to load only the data in a specific partition, such as basepath/x=1/, but I...

Comparison of loading from different file formats in BigQuery

We currently load most of our data into BigQuery either via csv or directly via the streaming API. However, I was wondering if there were any benchmarks available (or maybe a Google engineer could...

Convert Parquet to CSV

How to convert Parquet to CSV from a local file system (e.g. python, some library etc.) but WITHOUT Spark? (trying to find as simple and minimalistic solution as possible because need to automate...

How to infer parquet schema by hive table schema without inserting any records?

Now given a hive table with its schema, namely: hive> show create table nba_player; OK CREATE TABLE `nba_player`( `id` bigint, `player_id` bigint, `player_name` string, `admission_time`...

How many Kafka consumers does a streaming query use for execution?

I was surprised to see that Spark consumes the data from Kafka with only one Kafka consumer, and this consumer runs within the driver container. I rather expected to see, that Spark creates as...

Pandas cannot read parquet files created in PySpark

I am writing a parquet file from a Spark DataFrame the following way: df.write.parquet("path/myfile.parquet", mode = "overwrite", compression="gzip") This creates a folder with multiple files in...

Is it possible to read parquet files in chunks?

For example, pandas's read_csv has a chunk_size argument which allows the read_csv to return an iterator on the CSV file so we can read it in chunks. The Parquet format stores the data in chunks,...

Efficiently reading only some columns from parquet file on blob storage using dask

How can I efficiently read only some of the columns of a parquet file that is hosted in a cloud blob storage (e.g. S3 / Azure Blob Storage)? The columnar structure is one of the parquet file...

Filepath on Mac for local Parquet files in Spark program in Java

I have a small Spark program in Java, that reads parquet files from a local directory on a Mac. I have been trying to do this multiple ways, but nothing seems to be working. Dataset<Row>...

How to convert a CSV file to Parquet using C#

I am new to C#, I want to convert a CSV file to Parquet format, I searched some sites but I am not getting the expected one. Is there anyway way to do in C#?

AWS DMS SQL Server to s3 parquet - change-data-type transformation rule and 'Parquet type not supported: INT32 (UINT_8)'

We use AWS DMS to dump SQL Server DBs into S3 as parquet files. Idea is to run some analytics with spark over parquets. When a full load is complete then it's not possible to read parquets since...

How to convert parquet file to CSV using .NET Core?

I have a parquet file and I am trying to convert it to a CSV file, it seems as though most recommend using Spark, however I need to use C# to accomplish this task, specifically I need to use .NET...

py spark writing invalid date successfully but throwing exception when reading it

I'm using py-spark to process some data and writing processed files to S3 with parquet format. Code for the batch process is running on Ec2 in a docker container(Linux). This data contains some...

Is is possible to read csv or parquet file using same code

Does any one know if it is possible to read either a csv or parquet file into a spark using the same code. My use case here is that in production, I will be using large parquet files, but for unit...

How to save spark dataset in encrypted format?

I am saving my spark dataset as parquet file in my local machine. I would like to know if there are any ways I could encrypt the data using some encryption algorithm. The code I am using to save...

Azure Synapse Serverless. HashBytes: The query references an object that is not supported in distributed processing mode

I am receiving the error "The query references an object that is not supported in distributed processing mode" when using the HASHBYTES() function to hash rows in Synapse Serverless SQL Pool. The...

Pyspark Dataframe read is shifting column contents by inconsistent number

Code Versions: Python==3.7 Spark Version==2.4.7 Pyspark==2.4.5 Hive==2.3.7 Hello, hoping someone can help me with this. I'm using PySpark to read several large files (around 80 GB each, 6 or so...

pandas read_parquet imports date field incorrectly

I have a parquet file with a date field in it called BusinessDate. When I import it to a dataframe it automatically determines the field BusinessDate being a date (datetime64[ns, UTC]). However,...

AWS Athena CTAS query failing, suggests emptying empty bucket

I am running a "CREATE TABLE AS SELECT (CTAS) query" (https://docs.aws.amazon.com/athena/latest/ug/ctas.html), query copied at bottom. I am getting the following error...

SparkSession doesn't work if org.apache.hive:hive-service is put in dependencies

I'm implementing a simple program in Java that uses Spark SQL to read from a Parquet file, and build an ArrayList of FieldSchema objects (in hive metastore) where each object represents a column...

Loading data into Catboost Pool object

I'm training a Catboost model and using a Pool object as following: pool = Pool(data=x_train, label=y_train, cat_features=cat_cols) eval_set = Pool(data=x_validation, label=y_validation['Label'],...

PARSER - Nosuchfield error while loading data froom hdfs in spark

I am trying to run the following code. It looks like a dependency issue to me mostly. Dataset<Row> ds = spark.read().parquet("hdfs://localhost:9000/test/arxiv.parquet"); I am getting the...

Azure Data Factory removing spaces from column names of csv file

I'm a bit new to azure data factory so apologies if I'm missing anything obvious. I've done several searches and I can't find anything that quite fits. So the situation is that we have an...