Spark dataframe to arrow

I have been using Apache Arrow with Spark for a while in Python and have been easily able to convert between dataframes and Arrow objects by using Pandas as an intermediary. Recently, however,...

Read multiple parquet files in a folder and write to single csv file using python

I am new to python and I have a scenario where there are multiple parquet files with file names in order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder. I need to read...

pyarrow error: toPandas attempted Arrow optimization

when I set pyarrow to true we using spark session, but when I run toPandas(), it throws the error: "toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to...

Pandas cannot read parquet files created in PySpark

I am writing a parquet file from a Spark DataFrame the following way: df.write.parquet("path/myfile.parquet", mode = "overwrite", compression="gzip") This creates a folder with multiple files in...

Using in-memory filesystem in `pyarrow` tests

I have some pyarrow Parquet dataset writing code. I want to have an integration test that ensures the file is written correctly. I'd like to do that by writing a small example data chunk to an...

How to write on HDFS using pyarrow

I'm using python with pyarrow library and I'd like to write a pandas dataframe on HDFS. Here is the code I have import pandas as pd import pyarrow as pa fs = pa.hdfs.connect(namenode, port,...

Pyarrow Dataset read specific columns and specific rows

Is there a way to use pyarrow parquet dataset to read specific columns and if possible filter data instead of reading a whole file into dataframe?

Is there Spark Arrow Streaming = Arrow Streaming + Spark Structured Streaming?

Currently we have spark structured streaming In arrow doc, I found arrow streaming, where we can create a stream in Python, produce the data, and use StreamReader to consume the stream in...

Efficiently reading only some columns from parquet file on blob storage using dask

How can I efficiently read only some of the columns of a parquet file that is hosted in a cloud blob storage (e.g. S3 / Azure Blob Storage)? The columnar structure is one of the parquet file...

Problem running a Pandas UDF on a large dataset

I'm currently working on a project and I am having a hard time understanding how does the Pandas UDF in PySpark works. I have a Spark Cluster with one Master node with 8 cores and 64GB, along with...

"pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data" when sending data to BigQuery without schema

I'm working on a script where I'm sending a dataframe to BigQuery: load_job = bq_client.load_table_from_dataframe( df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE]) ) # Wait for the load job to...

PySpark pandas_udfs java.lang.IllegalArgumentException error

Does anyone have experience using pandas UDFs on a local pyspark session running on Windows? I've used them on linux with good results, but I've been unsuccessful on my Windows...

Error Installing streamlit It says "ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly"

When I try pip install streamlit it fails with the error message: ERROR: "Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly" I tried installing pip install...

Botocore error: HTTP Client raised an unhandled exception: sys.meta_path must be a list of import hooks

I am running this small snippet to upload a panda dataframe to s3 using parquet. But I get the error: Exception botocore.exceptions.HTTPClientError: HTTPClientError(u'An HTTP Client raised an...

how to load modin dataframe from pyarrow or pandas

Since Modin does not support loading from multiple pyarrow files on s3, I am using pyarrow to load the data. import s3fs import modin.pandas as pd from pyarrow import parquet s3...

Snowflake pandas Connector Kills Kernel

I'm having trouble with the pandas connector for Snowflake. The last line of this code causes the immediate death of the python kernel. Any suggestions on how to diagnose such a situation? import...

import pyarrow not working <- error is "ValueError: The pyarrow library is not installed, please install pyarrow to use the to_arrow() function."

I have tried installing it in the terminal and in juypter lab and it says that it has been successfully installed but when I run df = query_job.to_dataframe() I keep getting the error...

TWS API - store company snapshot and financial statements

My goal is to use a list of tickers together with the TWS API to extract parts of the company snapshot (reqFundamentalData() -> "ReportSnapshot") and the financial statements (reqFundamentalData()...

Shell script exporting Hadoop library classpath does not working

I am trying to develop Python client which interacts with Hadoop file system 3.3 using pyarrow package. My OS is CentOS 8 and IDE is Eclipse pydev. The sample code is simple. from pyarrow import...

Reading parquet file from ADLS gen2 using service principal

I am using azure-storage-file-datalake package to connect with ADLS gen2 from azure.identity import ClientSecretCredential # service principal credential tenant_id = 'xxxxxxx' client_id =...

Python error using pyarrow - ArrowNotImplementedError: Support for codec 'snappy' not built

Using Python, Parquet, and Spark and running into ArrowNotImplementedError: Support for codec 'snappy' not built after upgrading to pyarrow=3.0.0. My previous version without this error was...

Parquet `write_table` introduces keys of data type to the data when writing to output file

I have an issue when writing data to parquet files. I tried with different pyarrow versions (both 2.0 and 3.0) but the results look the same. Examples of how my data looks like: test_data = { ...

Cannot deploy dataflow template because of requirements file

I'm deploying a dataflow template in python from my local virtual environment, which threw a bunch of unintelligible issues that end like this: Discarding...

Pyarrow Error when Querying from Big Query

When running the code below, I am receiving a pyarrow error. I have installed pyarrow and I am still getting the same error. I am able to access the table and see the schemas, etc. but...

Loading data into Catboost Pool object

I'm training a Catboost model and using a Pool object as following: pool = Pool(data=x_train, label=y_train, cat_features=cat_cols) eval_set = Pool(data=x_validation, label=y_validation['Label'],...

Write Pandas Dataframe parquet metadata with partition columns

I am able to write a parquet file with partition_cols, but not the respective metadata. Seems there's a schema mismatch on the table vs metadata due to the columns in my partition. Need some help...

PySpark Environment Setup for Pandas UDF

-EDIT- This simple example just shows 3 records but I need to do this for billions of records so I need to use a Pandas UDF rather than just converting the Spark DF to a Pandas DF and using a...

How to store custom Parquet Dataset metadata with pyarrow?

How do I store custom metadata to a ParquetDataset using pyarrow? For example, if I create a Parquet dataset using Dask import dask dask.datasets.timeseries().to_parquet('temp.parq') I can then...

Package streamlit app and run executable on windows

this is my first question on Stackoverflow. I hope my question is clear, otherwise let me know and don't hesitate to ask me more details. I'm trying to package a streamlit app for a personal...

Fastest way to write numpy array in arrow format

I'm looking for fast ways to store and retrieve numpy array using pyarrow. I'm pretty satisfied with retrieval. It takes less than 1 second to extract columns from my .arrow file that contains...