Processing 6 billion records in 4 seconds on a home PC with crappy pentium processor.

Processing 6 billion records in 4 seconds on a home PC.

Lately I see a lot of hype around Hadoop. However if you look close at the performance of Hadoop clusters it is not all that stellar. Actually it is more like a terrible waste of computing cycles. Let’s see if we can find a greener, more environmentally friendly solution to a problem of big data.

Our first ingredient is NVidia Titan GPU : it is a relatively cheap, massively parallel GPU.

Second ingredient would be an open source column based database that lets us to store and process each column separately using GPU : https://github.com/antonmks/Alenka

The third and the last ingredient would be a smart way to sort our data in order to minimize the processing.

The GPU memory is a limited resource. Even high-end GPUs have at most 8GB of memory, which means that our data needs to be split and processed in pieces that can fit a GPU memory. In Alenka we can specify the size of these pieces when we create a database.

TPC-H queries is a good way to measure the performance : lets implement a few of queries and compare the results to those obtained on a Hadoop cluster.

The largest table in our test is lineitem – at scale 1000 it contains 6 billion records. Query 6 looks like this :

select
    sum(l_extendedprice * l_discount) as revenue
from lineitem
where
    l_discount >= 0.05 AND l_discount <= 0.07 AND l_qty < 24
    AND l_shipdate >= 19940101 AND l_shipdate < 19950101;

To quickly select matching pieces(segments) from our database we need to presort the file on l_shipdate field before creating a database. Alenka stores min and max values of each segment – because of this it can quickly discard the segments that do not match out filter conditions. The segments that do match we pass to a GPU to calculate the revenue. The entire operation takes about 4 seconds.

Now lets look at query 1 :

select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem
where
    l_shipdate <=  19980902
group by
    l_returnflag,
    l_linestatus
order by
    l_returnflag,
    l_linestatus;

The filter condition lets us discard only a small part of segments so the majority of the data has to be sent to GPU. After data is filtered, it needs to be aggregated, which is not a problem for our 4.5 teraflops GPU. All operations take 72 seconds.

Query 3 is a bit different – it involves joining a few tables :

select
    l_orderkey,
    sum(l_extendedprice * (1 - l_discount)) as revenue,
    o_orderdate,
    o_shippriority
from
    customer,
    orders,
    lineitem
where
    c_mktsegment = "BUILDING"
    and c_custkey = o_custkey
    and l_orderkey = o_orderkey
    and o_orderdate < 19950315
    and l_shipdate > 19950315
group by
    l_orderkey,
    o_orderdate,
    o_shippriority
order by
    revenue desc,
    o_orderdate;

We can filter the segments very fast if lineitem table is presorted on l_shipdate and orders table is presorted on o_orderdate field. Remember that segments store min/max values!

Additionally, when creating a database we can specify the sort order of every segment of lineitem table on l_orderkey – this way we can quickly join tables lineitem and orders using sort/merge join algorithm (which requires joined tables to be sorted on a join key).  After that we perform a grouping operation and sort the results. The entire operation takes 23 seconds.

Notice that results were obtained on a PC with following specs :

Pentium G620(2 cores), 1 Vertex3 SSD 120GB drive, 1 2TB WD HD, 16GB of main memory and NVidia Titan(6GB) card.

Now lets compare these results with publicly available Hadoop Hive TPC-H results :

Hive queries were run on a 16 node cluster . Each node has dual Intel Xeon L5630 quad core CPU,32 GB of main memory, and 10 SAS 10K RPM 300GB hard drives.

Q1Q3Q6
Hadoop4431125166
Titan*72234

* – time is counted as total processing time minus disk time.

As we can see a single low end PC with a GPU is perfectly capable of outperforming a Hadoop cluster.

P.S. Alenka is still work in progress, but if you have a project with lots of data – let me know, I might be able to help you.

antonmks@gmail.com

 

https://github.com/antonmks/Alenka/blob/master/green.md