Building a Faster ETL Pipeline with Flume, Kafka, and Hive

At WordPress.com we process a lot of events including some some events that are batched and sent asynchronously sometimes days later. But when querying this data we are likely to care more about when the events occurred rather then when it was sent to our servers. Knowing this we store our event data in Hive partitioned by when the events occurred rather then when they are ingested.

Continue reading Building a Faster ETL Pipeline with Flume, Kafka, and Hive