Log Analysis With Hive

At Automattic we see over 131M unique visitors per month from the US alone. As part of the data team we are responsible for taking in the stream of Nginx logs and turning them into counts of views and unique visitors per day, week, and month on both a per blog and global basis.

To do all that we have a near-realtime pipeline that uses a myriad of technologies including PHP, Kafka, and various components from the Hadoop ecosystem. Unfortunately this system broke down last month and caused us to lose a portion of uniques data. After resolving the initial issue it became clear to us we will need to reprocess data from original log files in order to recover all of the data we’ve lost.

Continue reading Log Analysis With Hive

Building a Faster ETL Pipeline with Flume, Kafka, and Hive

At WordPress.com we process a lot of events including some some events that are batched and sent asynchronously sometimes days later. But when querying this data we are likely to care more about when the events occurred rather then when it was sent to our servers. Knowing this we store our event data in Hive partitioned by when the events occurred rather then when they are ingested.

Continue reading Building a Faster ETL Pipeline with Flume, Kafka, and Hive