MongoDB Stored Procedure Equivalent

I have a large CSV file containing a list of stores, in which one of the field is ZipCode. I have a separate MongoDB database called ZipCodes, which stores the latitude and longitude for any given...

Mahout on Elastic MapReduce: Java Heap Space

I'm running Mahout 0.6 from the command line on an Amazon Elastic MapReduce cluster trying to canopy-cluster ~1500 short documents, and the jobs keep failing with a "Error: Java heap space"...

Amazon Elastic MapReduce - mass insert from S3 to DynamoDB is incredibly slow

I need to perform an initial upload of roughly 130 million items (5+ Gb total) into a single DynamoDB table. After I faced problems with uploading them using the API from my application, I decided...

What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

I am getting: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask While trying to make a copy of a partitioned table using the commands in the hive...

How to calculate Centered Moving Average of a set of data in Hadoop Map-Reduce?

I want to calculate Centered Moving average of a set of Data . Example Input format : quarter | sales Q1'11 | 9 Q2'11 | 8 Q3'11 | 9 Q4'11 | 12 Q1'12 | 9 Q2'12 |...

Spark on EC2 cannot utilize all cores available

I am running Spark on a EC2 cluster set up via spark-ec2.sh script. The 5 slave instances I launched have 40 cores intotal, but each instance just cannot utilize all the cores. From the slave log,...

Spark Configuration: SPARK_MEM vs. SPARK_WORKER_MEMORY

In spark-env.sh, it's possible to configure the following environment variables: # - SPARK_WORKER_MEMORY, to set how much memory to use (e.g. 1000m, 2g) export SPARK_WORKER_MEMORY=22g [...] # -...

Spark Standalone Mode: Workers not stopping properly

When stopping a whole cluster in spark (0.7.0) with $SPARK_HOME/bin/stop-all.sh not all workers are stopped correctly. More specifically, if I then want to restart the cluster...

How to see hadoop's heap use?

I am doing a school work to analyze the use of heap in hadoop. It involves running two versions of a mapreduce program to calculate the median of the length of forum comments: the first one is...

cloudera hadoop mapreduce job GC overhead limit exceeded error

I am running a canopy cluster job (using mahout) on a cloudera cdh4. the content to be clustered has about 1m records (each record is less than 1k in size). the whole hadoop environment (including...

Aggregate MongoDB results by ObjectId date

How can I aggregate my MongoDB results by ObjectId date. Example: Default cursor results: cursor = [ {'_id': ObjectId('5220b974a61ad0000746c0d0'),'content': 'Foo'}, {'_id':...

Not able to Start/Stop Spark Worker from Remote Machine

I have two machines A and B. I am trying to run Spark Master on machine A and Spark Worker on machine B. I have set machine B's host name in conf/slaves in my Spark directory. When I am executing...

data block size in HDFS, why 64MB?

The default data block size of HDFS/Hadoop is 64MB. The block size in the disk is generally 4KB. What does 64MB block size mean? ->Does it mean that the smallest unit of reading from disk is...

Java8: HashMap<X, Y> to HashMap<X, Z> using Stream / Map-Reduce / Collector

I know how to "transform" a simple Java List from Y -> Z, i.e.: List<String> x; List<Integer> y = x.stream() .map(s -> Integer.parseInt(s)) .collect(Collectors.toList()); Now I'd...

Reduce a key-value pair into a key-list pair with Apache Spark

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ..., Vn]). I feel like I should be able to do...

Why YARN java heap space memory error?

I want to try about setting memory in YARN, so I'll try to configure some parameter on yarn-site.xml and mapred-site.xml. By the way I use hadoop 2.6.0. But, I get an error when I do a mapreduce...

Apache Hadoop 2.6 Java Heap Space Error

I'm getting: 15/04/27 09:28:04 INFO mapred.LocalJobRunner: map task executor complete. 15/04/27 09:28:04 WARN mapred.LocalJobRunner: job_local1576000334_0001 java.lang.Exception:...

How to configure java memory heap space for hadoop mapreduce?

I've tried to run a mapreduce job on about 20 GB data, and I got an error on reduce shuffle phase. It says that because of memory heap space. Then, I've read on many source, that I have to...

Hive Error : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

I have got twitter data using flume on HDFS. Have 3 node cluster and MySQL Metastore for hive. When i execute below query select user_name.screen_name, user_name.followers_count from...

Why does vcore always equal the number of nodes in Spark on YARN?

I have a Hadoop cluster with 5 nodes, each of which has 12 cores with 32GB memory. I use YARN as MapReduce framework, so I have the following settings with...

Gradle archive contains more than 65535 entries

I am integrating hadoop2.5.0 for running mapreduce job and spring-boot-1.2.7 release and getting error while including this archive contains more than 65535 entries. My gradle jar...

hive difference between insert and load data

I am new to Hadoop and Hive, and I am confused about hive's insert into and load data statements. When I execute INSERT INTO TABLE_NAME (field1, field2) VALUES(value1, value2);, hiveserver will...

Group correspoding key and values

I have a use case to write a map reducing code where I have to group the values corresponding to the same queue: Input: A,B A,C B,A B,D Output: A {B,C} B {A,D} I have written this...

Finding most commonly used word in a string field throughout a collection

Let's say I have a Mongo collection similar to the following: [ { "foo": "bar baz boo" }, { "foo": "bar baz" }, { "foo": "boo baz" } ] Is it possible to determine which words appear most...

Sort field in hive

I have table about 20-25 million records, I have to put in another table based on some condition and also sorted. Example Create table X AS select * from Y where item <> 'ABC' Order By id; I...

How to increase the heap size when using hadoop jar?

I am running a program with the hadoop jar command. However, to make that program run faster, I need to increase Hadoop's heap size. I tried the following, but it didn't have any effect (I have...

Run spark program locally with intellij

I tried to run a simple test code in intellij IDEA. Here is my code: import org.apache.spark.sql.functions._ import org.apache.spark.{SparkConf} import org.apache.spark.sql.{DataFrame,...

What is "cold start" in Hive and why doesn't Impala suffer from this?

I'm reading the literature on comparing Hive and Impala. Several sources state some version of the following "cold start" line: It is well known that MapReduce programs take some time before all...

How exactly does the Java.reduce function with 3 parameters work?

I am currently learning about the java.reduce(), and I recently came across something while reading up on some of the material and going through videos. I understand that there are 3 ways to use...

Does the KDB+ gateway have to hold all the data?

I am trying to implement a gateway design to access/abstract the api to my database, which is simply a single HDB and RDB on the same server. Reading through documentation...