hdfs namenode -format error (no such file or directory)

Trying to get hadoop 2.3.0 running locally on my ubuntu machine, attempting to format the hdfs namenode, I am getting the following...

Hadoop client.RMProxy: Connecting to ResourceManager

Hadoop client.RMProxy: Connecting to ResourceManager I setup single-node cluster on linux: http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/ When I run mapreduce application like...

Spark Kill Running Application

I have a running Spark application where it occupies all the cores where my other applications won't be allocated any resource. I did some quick research and people suggested using YARN kill or...

Apache Zeppelin - Disconnected status

I have successfully installed and started Zeppelin on ec2 cluster with spark 1.3 and hadoop 2.4.1 on yarn.(as given in https://github.com/apache/incubator-zeppelin) However, I see zeppelin started...

Spark : multiple spark-submit in parallel

I have a generic question about Apache Spark : We have some spark streaming scripts that consume Kafka messages. Problem : they are failing randomly without a specific error... Some script does...

ImportError: No module named numpy on spark workers

Launching pyspark in client mode. bin/pyspark --master yarn-client --num-executors 60 The import numpy on the shell goes fine but it fails in the kmeans. Somehow the executors do not have numpy...

How to run spark-shell with YARN in client mode?

I've installed spark-1.6.1-bin-hadoop2.6.tgz on a 15-node Hadoop cluster. All nodes run Java 1.8.0_72 and the latest version of Hadoop. The Hadoop cluster itself is functional, e.g. YARN can run...

Error : java.net.NoRouteToHostException no route to host

I run select * from customers in hive and i get the result. Now when I run select count(*) customers, the job status is failed. In JobHistory I found 4 failed maps. And in the map log file I have...

Spark yarn cluster vs client - how to choose which one to use?

The spark docs have the following paragraph that describes the difference between yarn client and yarn cluster: There are two deploy modes that can be used to launch Spark applications on YARN....

How to set Spark application exit status?

I'm writing a spark application and run it using spark-submit shell script (using yarn-cluster/yarn-client) As I see now, exit code of spark-submit is decided according to the related yarn...

Cloudera Manager : Failed to start service YARN (MR2 Included) , Failed to start Resource manager

Error starting ResourceManager java.lang.NullPointerException at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:188) at...

Scala via Spark with yarn - curly brackets string missing

I made some scala code and it looks like this. object myScalaApp { def main(args: Array[String]) : Unit = { val strJson = args.apply(0) println( "strJson : " + strJson) and...

Java heap space issue

I am trying to access hive parquet table and load it to a Pandas data frame. I am using pyspark and my code is as below: import pyspark import pandas from pyspark import SparkConf from pyspark...

Conflicting modules. LoggerFactory is not a Logback LoggerContext but Logback is on the classpath

I am getting the error found in the title when I try to run my project. I have read other threads on this error, and found a solution that got rid of the error, but killed all my...

Using winutils and HBaseMiniCluster to unit test code

I am trying to spawn a HbaseMinicluster by following this tutorial: http://blog.cloudera.com/blog/2013/09/how-to-test-hbase-applications-using-popular-tools/ The only difference is that I am doing...

Yarn - How does yarn.scheduler.capacity.root.queue-name.maximum-capacity works?

I have 4 queues under the root queue with the following configuration. |-------------|-----------------|---------------------|-------------------| | Queue Name | Capacity (in %) | Max Capacity...

Hadoop: Cannot set priority of resourcemanager process

I am very new to hadoop and am trying to set a psuedo-distributed mode execution with Hadoop-3.1.2. When I try to start yarn service I get the following error, please see the code snippet below. $...

Airflow get operator attribute from context callback

How to retrieve the yarn_application_id from the SparkSubmitHook ? I tried to using a custom operator and the task_instance property but I guess I missed something... def...

Exception in thread "main" org.apache.hadoop.ipc.RemoteException(java.io.IOException) for hadoop 3.1.3

I am trying to run a mapreduce job but I am getting error for Hadoop-3.1.3 hadoop jar WordCount.jar WordcountDemo.WordCount /mapwork/Mapwork /r_out Error 2020-04-04 19:59:11,379 INFO...

Flume Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

[email protected]:/usr/lib/apache-flume-1.6.0-bin/bin$ ./flume-ng agent --conf ./conf/ -f /usr/lib/apache-flume-1.6.0properties -Dflume.root.logger=DEBUG,console -n agent Info: Including Hadoop...

How to run Spark 3.0.0 on HDP (Horthonworks)?

Is there a way to run a Spark 3.0 on HDP3 (Horthonworks)? I'm aware that there is always a standalone option, but I would like to configure YARN as a scheduler.

Hadoop 3.3 and oozie 5.2.0

I am using hadoop 3.3 and oozie 5.2.0. I am getting below error: Exception in thread "main" java.lang.NullPointerException at...

HDP + ambari + yarn node lables and HDFS

we have Hadoop cluster ( HDP 2.6.4 cluster with ambari , with 5 datanodes machines ) we are using spark streaming application (spark 2.1 run over Hortonworks 2.6.x ) the current situation is that...

Start a Flink 1.11.2 - 1.14.0 session on Cloudera Yarn with `yarn.scheduler.minimum-allocation-mb=0`

I want to start Flink version grater than 1.11.2 on Yarn. Environment variables HADOOP_CLASSPATH, HADOOP_CONF_DIR have been set but when I'm launching ./bin/yarn-session.sh I have this error...

spark execution - a single way to access file contents in both the driver and executors

According to this question - https://stackoverflow.com/questions/47187533/files-option-in-pyspark-not-working/ the sc.addFiles option should work for accessing files in both the driver and...

need help on submitting hudi delta streamer job via apache livy

I am little confused with how to pass the arguments as REST API JSON. Consider below spark submit command. spark-submit --packages...

cannot import graphframes dependency in maven project

I have a maven project and i need import graphframe dependency to use spark grapx,this's my pom.xml <?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" ...

Spark AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'

I was using pyspark on AWS EMR (4 r5.xlarge as 4 workers, each has one executor and 4 cores), and I got AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'....

Dataproc Cluster Spark job submission fails in GPU clusters after restarting master VM

I followed the tutorial on https://cloud.google.com/dataproc/docs/concepts/compute/gpus and created a single-node n1-standard-16 Dataproc cluster (base image is: 1.5.35-debian10) and attached...

How to include external python modules with pyspark

I'm new in python and trying to launch my pyspark project on spark on AWS EMR. The project is disposed on AWS S3 and has several python files, like this: /folder1 - main.py /utils - utils1.py -...