One of the most common questions we get from our WordPress VIP clients, many of whom are large media companies that publish constantly, is how they can bias their search results towards more recent content when scoring and sorting them. This type of problem is extremely hard to solve with a traditional RDBMS but we provide most of our VIP clients their own dedicated Elasticsearch index and as it happens ES comes with some powerful scoring functions for just this purpose.
At Automattic we run a diverse array of systems and as with many companies Kafka is the glue that ties them together; letting us to shuffle data back and forth. Our experience with Kafka have thus far been fantastic, it’s stable, provides excellent throughput, and the simple API makes it trivial to hook any of our systems up to it. In fact it’s been so popular that we’ve been steadily piping more and more data through it over the past year. Now we’re starting to run out of disk space necessitating an expansion of the cluster.
At Automattic we see over 131M unique visitors per month from the US alone. As part of the data team we are responsible for taking in the stream of Nginx logs and turning them into counts of views and unique visitors per day, week, and month on both a per blog and global basis.
To do all that we have a near-realtime pipeline that uses a myriad of technologies including PHP, Kafka, and various components from the Hadoop ecosystem. Unfortunately this system broke down last month and caused us to lose a portion of uniques data. After resolving the initial issue it became clear to us we will need to reprocess data from original log files in order to recover all of the data we’ve lost.