At Automattic we run a diverse array of systems and as with many companies Kafka is the glue that ties them together; letting us to shuffle data back and forth. Our experience with Kafka have thus far been fantastic, it’s stable, provides excellent throughput, and the simple API makes it trivial to hook any of our systems up to it. In fact it’s been so popular that we’ve been steadily piping more and more data through it over the past year. Now we’re starting to run out of disk space necessitating an expansion of the cluster.
At WordPress.com we process a lot of events including some some events that are batched and sent asynchronously sometimes days later. But when querying this data we are likely to care more about when the events occurred rather then when it was sent to our servers. Knowing this we store our event data in Hive partitioned by when the events occurred rather then when they are ingested.