At Automattic we see over 131M unique visitors per month from the US alone. As part of the data team we are responsible for taking in the stream of Nginx logs and turning them into counts of views and unique visitors per day, week, and month on both a per blog and global basis.
To do all that we have a near-realtime pipeline that uses a myriad of technologies including PHP, Kafka, and various components from the Hadoop ecosystem. Unfortunately this system broke down last month and caused us to lose a portion of uniques data. After resolving the initial issue it became clear to us we will need to reprocess data from original log files in order to recover all of the data we’ve lost.
Problem is with of billions of hits a day, our data volume is comparable to that of the IceCube South Pole Neutrino Observatory, a detector made of one cubic kilometer of ice. Combine that with the fact that we’ve been leaking data over the course of a couple of days means we are left with a lot of logs to reprocess. Certainly not something we want to try and churn through on a single server. Luckily we already have a Hadoop cluster in place so we can tap into the powers of MapReduce to solve this problem.
Hive to the rescue
Apache Hive is a data warehouse infrastructure built on top of Hadoop. It allows processing of data with SQL like queries and it’s very pluggable so we can configure it to process our logs quite easily.
First, we will create a Hive table that’s configured to read raw compressed Nginx logs. To do this we will instruct Hive to create an external table at the location where we will copy our logs to on HDFS. We specify that the table is to be stored as
TEXTFILE which allows Hive to read the stored files in as either plain text or for those files with a
.gz extension, gziped plain text. We also want to apply a custom serialization/deserialization (serde) format to each log line so that each line is parsed into columns of data in our table. To do that we will use the included RegEx serde, it’s slow but given it simply applies a RegEx expression to each line in order to extract data it’s highly configurable.
Our create table statement looks something like this:
Once we have done this all we need to do is to copy our gziped logs to the specified HDFS location and query that Hive table. Hive even offers some very helpful functions like
parse_url() which we can use to extract parts of the URL or query parameters.
Of course there’s quite a bit of business logic associated with our log processing and not all of it can be replicated with the bundled Hive functions. Fortunately Hive’s very pluggable on the querying side as well. Hive has a concept of User Defined Functions (UDFs) where we can write a simple Java class that implements a common UDF interface and install it in our cluster at query time to be distributed out to all our mappers for processing.
Using these methods and the spare capacity of our Hadoop cluster we were able to reprocess our logs in a couple hours instead of having a single server take days or even weeks to churn through them. Just in time to make sure our bloggers get the fireworks they rightly deserve on their annual reports.
Hey, thanks for reading and if the above sounded at all interesting to you we’re hiring and would love to talk with you. 🙂