Hbase quickly count number of rows

Right now I implement row count over ResultScanner like this for (Result rs = scanner.next(); rs != null; rs = scanner.next()) { number++; } If data reaching millions time computing is large.I...

How to speed up GLM estimation?

I am using RStudio 0.97.320 (R 2.15.3) on Amazon EC2. My data frame has 200k rows and 12 columns. I am trying to fit a logistic regression with approximately 1500 parameters. R is using 7% CPU and...

MySQL partitioning vs changing to MangoDB

We have 4 pretty large tables in a MySQL database. They are about 50, 35, 6 and 5 Gb, other tables aren't so large. These tables are full of analytic data which is appended by cron tasks every 10...

Can I run a Time Series Database (TSDB) over Apache Spark?

I'm starting to learn about big data and Apache Spark and I have a doubt. In the future I'll need to collect data from IoT and this data will come to me as time series data. I was reading about...

Machine Learning on PostgreSQL

I am interested in running machine learning algorithms directly inside of PostgreSQL as described here. The basic gist of the paper is that I write my algorithm as a function which gives the nth...

Efficiently write JSON to sqlite database

I'm trying to write big JSON (minimum 500MB) files to a database. I wrote a script which works and is kind of memory friendly, but it is very slow. Any suggestion on how to make it more...

Create Sub folder in S3 Bucket?

Already i have Root bucket(Bigdate).now i want to create NEWFOLDER (year) inside Bigdate bucket in s3 bucket. then create NEWFOLDER(MONTH) inside year. aws s3 mb s3://bigdata -->Bucket created aws...

How do I archive and retrieve a large HTML dataset?

I am a fresher and I am about to participate in a contest this weekend. The problem is about archiving and retrieving a large HTML dataset and I have no idea about it. My friend suggested to me to...

Python: Cluster analysis on a monthly data with a lot of variables

I hope you guys can help me sort this out as I feel this is above me. It might be silly for some of you, but I am lost and I come to you for advice. I am new to statistics, data analysis and big...

Fastest way to export data of huge Python lists to a text file

I am searching for the most performant way to export the elements of up to ten Python lists [x1, x2, x3, ... xn], [y1, y2, y3, ... yn], [z1, z2, z3, ... zn], ... to a text file with a structure as...

How to increase the heap size when using hadoop jar?

I am running a program with the hadoop jar command. However, to make that program run faster, I need to increase Hadoop's heap size. I tried the following, but it didn't have any effect (I have...

Pandas: df.groupby() is too slow for big data set. Any alternatives methods?

I have a pandas.DataFrame with 3.8 Million rows and one column, and I'm trying to group them by index. The index is the customer ID. I want to group the qty_liter by the index: df =...

Sharing reactive data sets between user sessions in Shiny

I have a fairly large reactive data set that is derived from polling a file and then reading that file on a predefined interval. The data is updated frequently and requires constant reloading. ...

Hadoop Distcp aborting when copying data from one cluster to another

I am trying to copy data of a partitioned Hive table from one cluster to another. I am using distcp to copy the data but the data underlying data is of a partitioned hive table. I used the...

Read extremely big xlsx file in python

I need to read xlsx file 300gb. Count of rows ~ 10^9. I need to get values from one column. File consists of 8 columns. I want to do it as fast as it possible. from openpyxl import...

Elasticsearch partial bulk update

I have 6k of data to update in ElasticSearch. And I have to use PHP. I search in the documentation and I have found this, Bulk Indexing but this is not keeping the previous data. I have...

What is "cold start" in Hive and why doesn't Impala suffer from this?

I'm reading the literature on comparing Hive and Impala. Several sources state some version of the following "cold start" line: It is well known that MapReduce programs take some time before all...

How do I read only part of a column from a Parquet file using Parquet.net?

I am using Parquet.Net to read parquet files, but the only option I to read from the parquet file is. //get the first group Parquet.ParquetRowGroupReader rowGroup =...

Improve PySpark implementation for finding connected components in a graph

I am currently working on the implementation of this paper describing Map Reduce Algorithm to fing connected component : https://www.cse.unr.edu/~hkardes/pdfs/ccf.pdf As a beginner in Big Data...

Apache Atlas: HTTP ERROR 503 Service Unavailable

I have also seen the following two similar links, but they were different from mine that I will describe it in this post: Apache Atlas: Http 503 Service Unavailable Error when connecting from...

Standard Scaling it taking too much time in pyspark dataframe

I've tried standard scaler from spark.ml with the following function: def standard_scale_2(df, columns_to_scale): """ Args: df : spark dataframe columns_to_scale : list of columns...

problemin connecting apache superset running inside docker container to Kylin

I have a running apache-superset inside a docker container that i want to connect to a running apache-kylin (Not inside docker ). I am recieving the following error whenever i test connection...

PySpark Environment Setup for Pandas UDF

-EDIT- This simple example just shows 3 records but I need to do this for billions of records so I need to use a Pandas UDF rather than just converting the Spark DF to a Pandas DF and using a...

Neo4j: Unsupported administration command: CREATE DATABASE demo

I want create new database 'demo' in neo4j, but I see a bug: I was search but can't find result, can you help me? Thank all!

Clustering using DBSCAN in bigquery

I have a Bigquery table with only one column named 'point'. It contains location coordinates that I want to cluster using the ST_CLUSTERDBSCAN function in BigQuery. I use the following...

SparkException: Can't zip RDDs with unequal numbers of partitions: List(2, 1)

Possible steps to reproduce: Run spark.sql multiple times, get DataFrame list [d1, d2, d3, d4] Combine DataFrame list [d1, d2, d3, d4] to a DataFrame d5 by calling...

In the ompr package in R, how can I rephrase my objective/constraints/variables so as to avoid the "problem too large" error?

I am trying to learn to fit a linear integer programming optimization model in R using the ompr package that a colleague had previously fit using CPLEX/GAMS (specifically, the one described here:...

spark-shell throws java.lang.reflect.InvocationTargetException on running

When I execute run-example SparkPi, for example, it works perfectly, but when I run spark-shell, it throws these exceptions: WARNING: An illegal reflective access operation has occurred WARNING:...

Efficient way to get the average of past x events within d days per each row in SQL (big data)

I want to find the best and most efficient way to calculate the average of a score from the past 2 events within 7 days, and I need it per each row. I already have a query that works on 60M rows,...

spark sql Find the number of extensions for a record

I have a dataset as below col1 | extension_col1 | | -------- | -------------- | | 2345 | 2246 | | 2246 | 2134 | | 2134 | 2091 | | 2091 | Null ...