Using spark to access HDFS failed

I am using Cloudera 4.2.0 and Spark. I just want to try out some examples given by Spark. // HdfsTest.scala package spark.examples import spark._ object HdfsTest { def main(args:...

cloudera hadoop mapreduce job GC overhead limit exceeded error

I am running a canopy cluster job (using mahout) on a cloudera cdh4. the content to be clustered has about 1m records (each record is less than 1k in size). the whole hadoop environment (including...

How to get hadoop put to create directories if they don't exist

I have been using Cloudera's hadoop (0.20.2). With this version, if I put a file into the file system, but the directory structure did not exist, it automatically created the parent...

HBase Import command

We are currently migrating from CDH3u4 to CDH5. We made new cluster and copied all data. Everything went smooth thanks to Cloudera manager. But we have problem with migrating data from HBase...

Loading JSON file with serde in Cloudera

I am trying to work with a JSON file with this bag structure : { "user_id": "kim95", "type": "Book", "title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.", ...

RODBC ERROR: Could not SQLExecDirect in mysql

I have been trying to write an R script to query Impala database. Here is the query to the database: select columnA, max(columnB) from databaseA.tableA where columnC in (select distinct(columnC)...

Spark on yarn jar upload problems

I am trying to run a simple Map/Reduce java program using spark over yarn (Cloudera Hadoop 5.2 on CentOS). I have tried this 2 different ways. The first way is the...

spark pyspark mllib model - when prediction rdd is generated using map, it throws exception on collect()

I am using spark 1.2.0 (cannot upgrade as I dont have control over it). I am using mllib to build a model points = labels.zip(tfidf).map(lambda t: LabeledPoint(t[0], t[1] )) train_data, test_data...

Spark : multiple spark-submit in parallel

I have a generic question about Apache Spark : We have some spark streaming scripts that consume Kafka messages. Problem : they are failing randomly without a specific error... Some script does...

Accessing Hue on Cloudera Docker QuickStart

I have installed the cloudera quickstart using docker based on the instructions given...

Setting HBase Region Server Heap Size Through Cloudera Manager

I am using cloudera version 5.5.2. I have been trying to set the region server max heap size in cloudera manager but irrespective of what value I set it to I always see it as 1.7 GB on the hbase...

Spark java.io.EOFException: Premature EOF: no length prefix available

I am trying to read parquet file and perform some operations on it and save the result as parquet on HDFS. I am doing it using Spark. While doing so I am getting following...

impala string function to extract text after a given separator

Say I have a string of variable length such as: '633000000HIQWA4:005160000UT334' '00YYSKSG004:00YJDJJDA3443' '300SGDK112WA4:00KFJJD900' which impala string function to use to extract text after...

Java heap space issue

I am trying to access hive parquet table and load it to a Pandas data frame. I am using pyspark and my code is as below: import pyspark import pandas from pyspark import SparkConf from pyspark...

Cannot create directory in hdfs NameNode is in safe mode

I upgrade to the latest version of cloudera.Now I am trying to create directory in HDFS hadoop fs -mkdir data Am getting the following error Cannot Create /user/cloudera/data Name Node is in...

cloudera impala PARQUET_FALLBACK_SCHEMA_RESOLUTION

It is possible to configure Cloudera Impala (5.12) to default to name instead of position for PARQUET_FALLBACK_SCHEMA_RESOLUTION? My Parquet files don't always have the same set of columns so we...

Impala Query Error - AnalysisException: operands of type INT and STRING are not comparable

I am trying to execute a query in Impala and getting the following error (AnalysisException: operands of type INT and STRING are not comparable: B.COMMENT_TYPE_CD = '100' ) can someone help me fix...

How to access remote HDFS cluster from my PC

I'm trying to access a remote cloudera HDFS cluster from my local PC (win7). As cricket_007 suggested in my last question I did the following things: (1) I created the next Spark session val...

Sqoop export job failed

Can't export HDFS contents to oracle DB. Oracle: create table DB1.T1 ( id1 number, id2 number ); Hive: create table DB1.T1 ( id1 int, id2 int ); ...

CDH cluster installation failing in "distributing" stage- failed due to stall on seeded torrent

Hi, We are trying to install CDH cluster on Redhat 7 remote server using cloudera-installer.bin file, in standalone mode( we have only 1 host) . We are specifying hostname/ip address of the...

Error parsing conf core-default.xml While running shadow jar of geotool with Spark

I have created a spark application that process lat/long and identifies the zone defined in custom shape files provided by client. Given this requirement, i have created a shadow jar file using...

Initialize Cloudera Hive Docker Container With Data

I am running the Cloudera suite in a Docker Container using the image described here: https://hub.docker.com/r/cloudera/quickstart/ I have the following configuration: Dockerfile FROM...

Hive with HBase (both Kerberos) java.net.SocketTimeoutException .. on table 'hbase:meta'

Error Receiving Timeout errors when trying to query HBase from Hive using HBaseStorageHandler. Caused by: java.net.SocketTimeoutException: callTimeout=60000, callDuration=68199: row...

UnsatisfiedLinkError in Apache Spark when writing Parquet to AWS S3 using Staging S3A Committer

I'm trying to write Parquet data to AWS S3 directory with Apache Spark. I use my local machine on Windows 10 without having Spark and Hadoop installed, but rather added them as SBT dependency...

Where can i find the right version of sparkxgb.zip for the xgboost4j-1.1.2.jar package for Pyspark?

I was using xgboost4j-0.90.jar in Pyspark alongside its working version of sparkxgb.zip. Everything was working well till i decided to update to xgboost4j-1.1.2.jar. Since i'm using scala 2.11 and...

Delegation token negative renewal time in Spark Structured Streaming

I have a Spark Structured Streaming (3.0.1) job running on Cloudera Cluster. The job is consuming data from a kerberized Kafka and putting it into the ADLS gen2. ADLS access is estabilished with...

Hibernate/Spring boot jpa on Impala/kudu with cloudera jdbc driver

I have an API in spring boot using hibernate. Initially, the database to request was Hive, it's now Kudu throw Impala. I followed recommendations and set the dialect to...

Can I use the same flow.xml.gz for two different Nifi cluster?

We have a 13 nodes nifi cluster with around 50k processors. The size of the flow.xml.gz is around 300MB. To bring up the 13 nodes Nifi cluster, it usually takes 8-10 hours. Recently we split the...

Installing Cloudera Quick start VM on M1 macOs

Currently I am learning Hadoop. Previously I used lab where I can access the Hadoop ecosystem. Recently I got M1 Mac and I want to run the same through Cloudera quick start VM. I do know that it...

After log4j changes Hive -e returns additional warning which has impact on the scripts - WARN JNDI lookup class is not available because this JRE

In project we use some technical scripts in python with usage of Subprocess to extract some data from hive, run msck repair table etc ( I know we should switch to beeline :p) unfortunately after...