Putting detailed REST error message in HTTP Warning header, good/bad idea?

We are developing a standard REST service using HTTP status codes as its response code if something went wrong. (e.g. invalid user input would return "400 Bad Request" to the client) However, we...

How to design a database with Revision History?

I am part of a team building a new Content Management System for our public site. I'm trying to find the easiest and best way to build-in a Revision Control mechanism. The object model is pretty...

What is the difference between Apache Pig and Apache Hive?

What is the exact difference between Pig and Hive? I found that both have same functional meaning because they are used for doing same work. The only thing is implimentation which is different for...

Loading JSON file with serde in Cloudera

I am trying to work with a JSON file with this bag structure : { "user_id": "kim95", "type": "Book", "title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.", ...

Persisting Spark Streaming output

I'm collecting the data from a messaging app, I'm currently using Flume, it sends approx 50 Million records per day I wish to use Kafka, consume from Kafka using Spark Streaming and persist it to...

How to get datatype of a column in spark SQL?

I want to find out the datatype of each column of a table? For example, let's say my table was created using this: create table X ( col1 string, col2 int, col3 int ) I want to do a command that...

Why does full outer join in HIVE gives weird result when one of the join fields is missing?

I'm comparing the behavior between SQL engines. Oracle has the behavior I would expect from a SQL engine for full outer joins: Oracle CREATE TABLE sql_test_a ( ID VARCHAR2(4000...

docker-compose for Detached mode

I have following docker command to run container docker run -d --name test -v /etc/hadoop/conf:/etc/hadoop/conf -v...

Read from a hive table and write back to it using spark sql

I am reading a Hive table using Spark SQL and assigning it to a scala val val x = sqlContext.sql("select * from some_table") Then I am doing some processing with the dataframe x and finally...

Hive 2.1.1 MetaException(message:Version information not found in metastore. )

I'm running Hadoop 2.7.3, MySQL 5.7.17 and Hive 2.1.1 on Ubuntu 16.04. When I run ./hive, I keep getting the following warning and exception: SLF4J: Class path contains multiple SLF4J...

Optimization when Shuffle write is large and spark task become super slow

There's a SparkSQL which will join 4 large tables (50 million for first 3 table and 200 million for the last table) and do some group by operation which consumes 60 days of data. and this SQL will...

Schema comparison of two dataframes in scala

I am trying to write some test cases to validate the data between source (.csv) file and target (hive table). One of the validation is the Structure validation of the table. I have load the .csv...

How to update few records in Spark

i have the following program in Scala for the spark: val dfA = sqlContext.sql("select * from employees where id in ('Emp1', 'Emp2')" ) val dfB = sqlContext.sql("select * from employees where id...

getting hive configuration from the command line

I wish to retrieve hive configurations from the command line. Is there any utility like hadoop getconf -confKey <Key_name> for Hive? i.e.: I wish to retrieve the warehouse dir from hive.xml, get...

Initialize Cloudera Hive Docker Container With Data

I am running the Cloudera suite in a Docker Container using the image described here: https://hub.docker.com/r/cloudera/quickstart/ I have the following configuration: Dockerfile FROM...

How to infer parquet schema by hive table schema without inserting any records?

Now given a hive table with its schema, namely: hive> show create table nba_player; OK CREATE TABLE `nba_player`( `id` bigint, `player_id` bigint, `player_name` string, `admission_time`...

org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.BinaryComparable

at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:563) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:83) ... 17...

Efficiently reading only some columns from parquet file on blob storage using dask

How can I efficiently read only some of the columns of a parquet file that is hosted in a cloud blob storage (e.g. S3 / Azure Blob Storage)? The columnar structure is one of the parquet file...

spark throws error when reading hive table

i am trying to do select * from db.abc in hive,this hive table was loaded using spark it does not work shows an error: Error: java.io.IOException: java.lang.IllegalArgumentException: bucketId out...

I want to connect by specifying the IP address with M2MQTT

I made mosquitto in docker a broker. Suppose my PC IP address is "10.0.0.11". I wrote code like this. MqttClient client = new MqttClient ("10.0.0.11", 1883, false, null, null,...

Superset with Apache Spark on Hive

I have Apache SuperSet installed via Docker on my local machine. I have a separate production 20 Node Spark cluster with Hive as the Meta-Store. I want my SuperSet to be able to connect to Hive...

Hive : How to flatten an array?

I have this table CREATE TABLE `dum`( `val` map<string,array<string>>) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT ...

Spark SQL - Hive "Cannot overwrite table" workaround

I'm working on a Spark cluster using PySpark and Hive. I've seen a lot of questions here on SO regarding "Cannot overwrite table that is also being read from" Hive error. I understood this comes...

schematool: command not found

I am trying to install Hive on my Ubuntu 19.10 machine . I am using this doc https://phoenixnap.com/kb/install-hive-on-ubuntu. As mentioned in step 6, where I am trying to initiate Derby Database,...

SparkSession doesn't work if org.apache.hive:hive-service is put in dependencies

I'm implementing a simple program in Java that uses Spark SQL to read from a Parquet file, and build an ArrayList of FieldSchema objects (in hive metastore) where each object represents a column...

pyspark: Insert overwrite into a partitioned Table but the whole table is overwrite

I am new in Hive and spark, trying to overwrite a partitioned table accounting to its partition column, this is the code: df.createOrReplaceGlobalTempView(tempTable) insertSql = "INSERT OVERWRITE...

HiveError: Cannot read, unknown typeId: 32. Did you forget to register an adapter?

I'm trying to create a try Hive database and facing some issues I can't solve. my model is @HiveType(typeId: 1) class Tasks { @HiveField(0) final String task; @HiveField(1) final bool...

Databricks - is not empty but it's not a Delta table

I run a query on Databricks: DROP TABLE IF EXISTS dublicates_hotels; CREATE TABLE IF NOT EXISTS dublicates_hotels ... I'm trying to understand why I receive the following error: Error in SQL...

How to run Spark SQL Thrift Server in local mode and connect to Delta using JDBC

I'd like connect to Delta using JDBC and would like to run the Spark Thrift Server (STS) in local mode to kick the tyres. I start STS using the following...

After log4j changes Hive -e returns additional warning which has impact on the scripts - WARN JNDI lookup class is not available because this JRE

In project we use some technical scripts in python with usage of Subprocess to extract some data from hive, run msck repair table etc ( I know we should switch to beeline :p) unfortunately after...