HDF5 in Java: What are the difference between the availabe APIs?

I've just discovered the HDF5 format and I'm considering using it to store 3D data spread over a cluster of Java application servers. I have found out that there are several implementations...

How does HDFS write to a disk on the data node

I not an expert in how File Systems work, but this question can help me clear some vague concepts. How does HDFS write to the physical disk? I understand HDFS runs on ext3 file system disks...

How to get hadoop put to create directories if they don't exist

I have been using Cloudera's hadoop (0.20.2). With this version, if I put a file into the file system, but the directory structure did not exist, it automatically created the parent...

Loading JSON file with serde in Cloudera

I am trying to work with a JSON file with this bag structure : { "user_id": "kim95", "type": "Book", "title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.", ...

How are HDFS files getting stored on underlying OS filesystem?

HDFS is a logical filesystem in Hadoop with a Block size of 64MB. A file on HDFS is saved on the underlying OS filesystem, say ext4 with 4KiB as the block size. To my knowledge, for a file on the...

How to do automated functional testing of AWS components?

In my project we have implemented custom auto scaling module. This module takes advantage of AWS CloudWatch API and uses its custom logic to auto scale up/down the cluster. All this code is in...

The interaction between hadoop hdfs block size and linux file system block size

I understood hadoop block size is 64MB and linux FS is 4KB. My understanding from reading is that hadoop hdfs work on top of linux FS itself. How does hadoop file system actually work with linux...

hadoop Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit

I am seen this in the logs of the datanodes. This occurs probably because I am copying 5 million files into hdfs: java.lang.IllegalStateException:...

Get a list of file names from HDFS using python

Hadoop noob here. I've searched for some tutorials on getting started with hadoop and python without much success. I do not need to do any work with mappers and reducers yet, but it's more of an...

Hadoop: ...be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation

I'm getting the following error when attempting to write to HDFS as part of my multi-threaded application could only be replicated to 0 nodes instead of minReplication (=1). There are 1...

how to read and write to the same file in spark using parquet?

I am trying to read from a parquet file in spark, do a union with another rdd and then write the result into the same file I have read from (basically overwrite), this throws the following error: ...

Create table from CSV with values containing commas enclosed in quotes

I'm trying to create a table in Impala from a CSV that I've uploaded into an HDFS directory. The CSV contains values with commas enclosed inside quotes. Example: 1.66.96.0/19,"NTT...

Fail insert into Impala table WRITE access at least one HDFS path

Im trying to insert into Impala table..... ERROR: AnalysisException: Unable to INSERT into target table (log_wf) because Impala does not have WRITE access to at least one HDFS path:...

ES-v5.0.1 throw java.lang.SecurityException while snapshot

Elasticsearch version:v5.0.1 Plugins installed: [repository-hdfs] JVM version: java version "1.8.0_92" Java(TM) SE Runtime Environment (build 1.8.0_92-b14) Java HotSpot(TM) 64-Bit Server VM (build...

Azure Data Factory pipelines are failing when no files available in the source

Currently – we do our data loads from Hadoop on-premise server to SQL DW [ via ADF Staged Copy and DMG on-premise server]. We noticed that ADF pipelines are failing – when there are no files...

Cannot create directory in hdfs NameNode is in safe mode

I upgrade to the latest version of cloudera.Now I am trying to create directory in HDFS hadoop fs -mkdir data Am getting the following error Cannot Create /user/cloudera/data Name Node is in...

Schema comparison of two dataframes in scala

I am trying to write some test cases to validate the data between source (.csv) file and target (hive table). One of the validation is the Structure validation of the table. I have load the .csv...

Sqoop export job failed

Can't export HDFS contents to oracle DB. Oracle: create table DB1.T1 ( id1 number, id2 number ); Hive: create table DB1.T1 ( id1 int, id2 int ); ...

HDFS blocks Vs HDD storage blocks

I am working with Hadoop HDFS for quite some time and I am aware of the working of the HDFS blocks(64 Mb, 128 Mb).but still I am not clear with the blocks in the other file systems for example...

How to infer parquet schema by hive table schema without inserting any records?

Now given a hive table with its schema, namely: hive> show create table nba_player; OK CREATE TABLE `nba_player`( `id` bigint, `player_id` bigint, `player_name` string, `admission_time`...

How many Kafka consumers does a streaming query use for execution?

I was surprised to see that Spark consumes the data from Kafka with only one Kafka consumer, and this consumer runs within the driver container. I rather expected to see, that Spark creates as...

Jupyter Notebook - AccessControlException: Permission denied: user=livy

I am running an EMR cluster with Spark/Livy, and would like to test Spark Structured Streaming. I am using the Jupyter Notebook managed service (connects via Livy) however when I try this code in...

Elastic search snapshot restore another cluster

How to restore elastic search snapshot another cluster? without repository-s3, repository-hdfs, repository-azure, repository-gcs.

Missing optional dependency 'tables'. In pandas to_hdf

following code is giving me error. import pandas as pd df = pd.DataFrame({'a' : [1,2,3]}) df.to_hdf('temp.h5', key='df', mode='w') This is giving me error. Missing optional dependency...

spark throws error when reading hive table

i am trying to do select * from db.abc in hive,this hive table was loaded using spark it does not work shows an error: Error: java.io.IOException: java.lang.IllegalArgumentException: bucketId out...

Error reading delta file from spark structured streaming

we use spark structured streaming with Spark 2.2. at some point the streaming crashes and when it starts it tries reading from checkppoint and fails: java.lang.IllegalStateException: Error reading...

Spark: How to write files to s3/hdfs from each executor

I have a use case where I am running some modeling code on each executor and want to store the result in s3/hdfs immediately before waiting for all the executors to finish the tasks.

Hive : How to flatten an array?

I have this table CREATE TABLE `dum`( `val` map<string,array<string>>) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT ...

PARSER - Nosuchfield error while loading data froom hdfs in spark

I am trying to run the following code. It looks like a dependency issue to me mostly. Dataset<Row> ds = spark.read().parquet("hdfs://localhost:9000/test/arxiv.parquet"); I am getting the...

Spark illegal character in path

I am trying to start up Spark on my machine. But when I try to launch using spark-shell I get an error that there is an illegal character in the path. Caused by: java.net.URISyntaxException:...