Log Analysis With Hive

At Automattic we see over 131M unique visitors per month from the US alone. As part of the data team we are responsible for taking in the stream of Nginx logs and turning them into counts of views and unique visitors per day, week, and month on both a per blog and global basis.

To do all that we have a near-realtime pipeline that uses a myriad of technologies including PHP, Kafka, and various components from the Hadoop ecosystem. Unfortunately this system broke down last month and caused us to lose a portion of uniques data. After resolving the initial issue it became clear to us we will need to reprocess data from original log files in order to recover all of the data we’ve lost.

Keith Ewing - Stacked Logs.jpg
Stacked Logs – Keith Ewing

Problem is with of billions of hits a day, our data volume is comparable to that of the IceCube South Pole Neutrino Observatory, a detector made of one cubic kilometer of ice. Combine that with the fact that we’ve been leaking data over the course of a couple of days means we are left with a lot of logs to reprocess. Certainly not something we want to try and churn through on a single server. Luckily we already have a Hadoop cluster in place so we can tap into the powers of MapReduce to solve this problem.

Hive to the rescue

Apache Hive is a data warehouse infrastructure built on top of Hadoop. It allows processing of data with SQL like queries and it’s very pluggable so we can configure it to process our logs quite easily.

First, we will create a Hive table that’s configured to read raw compressed Nginx logs. To do this we will instruct Hive to create an external table at the location where we will copy our logs to on HDFS. We specify that the table is to be stored as TEXTFILE which allows Hive to read the stored files in as either plain text or for those files with a .gz extension, gziped plain text. We also want to apply a custom serialization/deserialization (serde) format to each log line so that each line is parsed into columns of data in our table. To do that we will use the included RegEx serde, it’s slow but given it simply applies a RegEx expression to each line in order to extract data it’s highly configurable.

Our create table statement looks something like this:

Once we have done this all we need to do is to copy our gziped logs to the specified HDFS location and query that Hive table. Hive even offers some very helpful functions like parse_url() which we can use to extract parts of the URL or query parameters.

Of course there’s quite a bit of business logic associated with our log processing and not all of it can be replicated with the bundled Hive functions. Fortunately Hive’s very pluggable on the querying side as well. Hive has a concept of User Defined Functions (UDFs) where we can write a simple Java class that implements a common UDF interface and install it in our cluster at query time to be distributed out to all our mappers for processing.

Using these methods and the spare capacity of our Hadoop cluster we were able to reprocess our logs in a couple hours instead of having a single server take days or even weeks to churn through them. Just in time to make sure our bloggers get the fireworks they rightly deserve on their annual reports.

Hey, thanks for reading and if the above sounded at all interesting to you we’re hiring and would love to talk with you. :)

Building a Faster ETL Pipeline with Flume, Kafka, and Hive

At WordPress.com we process a lot of events including some some events that are batched and sent asynchronously sometimes days later. But when querying this data we are likely to care more about when the events occurred rather then when it was sent to our servers. Knowing this we store our event data in Hive partitioned by when the events occurred rather then when they are ingested.

Event Ingestion, Take One

The initial design of our ETL pipeline looks something like this:

  • Raw logs are aggregated and processed by a custom parser which functions as both an aggregator for high level stats as well as emitter of raw logs into the various Kafka topics.
  • A Flume agent then uses the Kafka source to pull from the appropriate topic.
  • The Flume Morphline interceptor is then used to do a series of transformations including annotating what type of event the log line represented.
  • Events are buffered via a memory channel and sent to the Kite Dataset sink.
  • Kite then handles interacting with Hive and persisting the events in HDFS.

With this ETL data is available for querying almost immediately and is stored in (close to) their final state within Hive tables. As an added bonus, because we are using Kite Datasets and the accompanying Flume sink Hive partitions are handled for us automatically. However due to the asynchronous nature of our event collection we end up having to write to multiple partitions at the same time which results in the formation of many small files. To get around this we simply run a compaction job with Oozie after some date cutoff for events.

This process worked very well until a couple days ago when some network issues and some bugs caused OOM errors which resulted in our Flume agent sporadically loosing all events buffered in memory. Fortunately, we persist our Kafka topic of important logs for about a month so we can just replay it then merge and dedupe the data in Hive. Not only that, but we have a lot of servers in Cloudera Manager so we can just add our backfilling Flume agent role to a bunch of them and we’ll be done in no time. #winning

Alas, the dream was not to be. Because we must write to many partitions at once sending events to the Kite Dataset sink with so many Flume agents caused our Hive Metastore to become unstable not only limiting the rate of our ingestion but also causing numerous query failures for our growth explorers. Sure, we could have just slowed ingestion down but being snake people that just won’t do. After all we have the technology; we can make our ETL…

Better… Stronger… Faster

With what we have learned about our first ETL pipeline we’ve decided to rewrite it but this time with an eye on durability and ingestion performance as well. We kept our custom log parser / aggregator / Kafka emitter as it’s doing an admirable job. In addition, we decided to stick with Flume (and its Kafka source) for our ETL process because of its ease of use and customizability.

We did not like the fact that when a Flume agent crashed it would just drop events in the memory channel on the floor so to make our process more durable we opted to use Flume’s Kafka channel instead. This allowed us to multiplex and publish records after they have been transformed by the Morphline interceptor to Kafka so that they are persisted when the Flume Agent dies. Doing this comes at the expense of possible duplicate events being emitted when anomalies happen however we figure it’s far easier to dedupe with Hive queries then it is to recreate missing data.

To make writes more performant we reconfigured Morphline to convert our events to Avro records with our predefined schemas and then serialize and compress those records making them ready for writing. Once the raw Avro byte array has been generated we multiplex it into the proper Kafka channel. Finally we use the HDFS sink to pull these events from Kafka in batches and write the raw Avro byte arrays to HDFS partitioned by when the record was written. We do this directly without touching Hive or its Metastore.

By partitioning on when the record was written we can ensure we only write to a single partition at a time. This results in fewer and larger files which is not only more performant but it also gives us the ability to know when partitions can be considered complete. With this knowledge we can now have Oozie jobs that merge, dedupe, and compact events from the intermediate table partitioned by the time an event is recorded into another table partitioned by when events occurred for optimal query performance.

Show Me The Code

So what does this look like? Here’s a simplified flume.conf example:

#
# Setup some names
#
agent.sources  = sr-kafka
agent.channels = ch-kafka-type1 ch-kafka-type2
agent.sinks    = sk-hdfs-type1 sk-hdfs-type2

#
# Configure same Kafka source for all channels
#
agent.sources.sr-kafka.channels = ch-kafka-type1 ch-kafka-type2
agent.sources.sr-kafka.type = org.apache.flume.source.kafka.KafkaSource
agent.sources.sr-kafka.zookeeperConnect = HOST1:PORT,HOST2:PORT,HOST3:PORT/PATH
agent.sources.sr-kafka.groupId = flume_source_20150712
agent.sources.sr-kafka.topic = kafka-topic
# Grabs in batches of 500 or every second
agent.sources.sr-kafka.batchSize = 500
agent.sources.sr-kafka.batchDurationMillis = 1000
# Read from start of topic
agent.sources.sr-kafka.kafka.auto.offset.reset = smallest

#
# Configure interceptors
#
agent.sources.sr-kafka.interceptors = in-morphline-etl in-host-set

agent.sources.sr-kafka.interceptors.in-morphline-etl.type = org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder
agent.sources.sr-kafka.interceptors.in-morphline-etl.morphlineFile = /path/to/morphline.conf
agent.sources.sr-kafka.interceptors.in-morphline-etl.morphlineId = morphline_id

agent.sources.sr-kafka.interceptors.in-host-set.type = host
agent.sources.sr-kafka.interceptors.in-host-set.useIP = false
agent.sources.sr-kafka.interceptors.in-host-set.hostHeader = flume_host

#
# Multiplex our records into channels based on the value of `eventmarker` which comes from Morphline
#
agent.sources.sr-kafka.selector.type = multiplexing
agent.sources.sr-kafka.selector.header = eventmarker
agent.sources.sr-kafka.selector.default = ch-kafka-type1
agent.sources.sr-kafka.selector.mapping.type1 = ch-kafka-type1
agent.sources.sr-kafka.selector.mapping.type2 = ch-kafka-type2
agent.sources.sr-kafka.selector.mapping.type3 = ch-kafka-type1 ch-kafka-type2

#
# Configure the channels we multiplexed into
#
agent.channels.ch-kafka-type1.type = org.apache.flume.channel.kafka.KafkaChannel
agent.channels.ch-kafka-type1.brokerList = HOST1:PORT,HOST2:PORT,HOST3:PORT
agent.channels.ch-kafka-type1.zookeeperConnect = HOST1:PORT,HOST2:PORT,HOST3:PORT/PATH
agent.channels.ch-kafka-type1.groupId = flume_channel_20150712
agent.channels.ch-kafka-type1.topic = kafka-topic-flume-type1

agent.channels.ch-kafka-type2.type = org.apache.flume.channel.kafka.KafkaChannel
agent.channels.ch-kafka-type2.brokerList = HOST1:PORT,HOST2:PORT,HOST3:PORT
agent.channels.ch-kafka-type2.zookeeperConnect = HOST1:PORT,HOST2:PORT,HOST3:PORT/PATH
agent.channels.ch-kafka-type2.groupId = flume_channel_20150712
agent.channels.ch-kafka-type2.topic = kafka-topic-flume-type2

#
# Configure sinks; We pull from Kafka in batches and write large files into HDFS.
#
agent.sinks.sk-hdfs-type1.channel = ch-kafka-type1
agent.sinks.sk-hdfs-type1.type = hdfs
agent.sinks.sk-hdfs-type1.hdfs.path = hdfs://path/to/database/etl_type1/record_ymdh=%Y%m%d%H
# Prefix files with the Flume agent's hostname so we can run multiple agents without collision
agent.sinks.sk-hdfs-type1.hdfs.filePrefix = %{flume_host}
# Hive needs files to end in .avro
agent.sinks.sk-hdfs-type1.hdfs.fileSuffix = .avro
# Roll files in HDFS every 5 min or at 255MB; don't roll based on number of records
# We roll at 255MB because our block size is 128MB, we want 2 full blocks without going over
agent.sinks.sk-hdfs-type1.hdfs.rollInterval = 300
agent.sinks.sk-hdfs-type1.hdfs.rollSize = 267386880
agent.sinks.sk-hdfs-type1.hdfs.rollCount = 0
# Write to HDFS file in batches of 500 records
agent.sinks.sk-hdfs-type1.hdfs.batchSize = 500
# We already serialized and encoded the record into Avro in Morphline so just write the byte array
agent.sinks.sk-hdfs-type1.hdfs.fileType = DataStream
# Give us a higher timeout because we are writing in batch
agent.sinks.sk-hdfs-type1.hdfs.callTimeout = 60000
# Use current time in UTC for the value of `record_ymdh=%Y%m%d%H` above
agent.sinks.sk-hdfs-type1.hdfs.timeZone = UTC
agent.sinks.sk-hdfs-type1.hdfs.useLocalTimeStamp = true
# Our record is serialized via Avro
agent.sinks.sk-hdfs-type1.serializer = org.apache.flume.sink.hdfs.AvroEventSerializer$Builder

agent.sinks.sk-hdfs-type2.channel = ch-kafka-type2
agent.sinks.sk-hdfs-type2.type = hdfs
agent.sinks.sk-hdfs-type2.hdfs.path = hdfs://path/to/database/etl_type2/record_ymdh=%Y%m%d%H
agent.sinks.sk-hdfs-type2.hdfs.filePrefix = %{flume_host}
agent.sinks.sk-hdfs-type2.hdfs.fileSuffix = .avro
agent.sinks.sk-hdfs-type2.hdfs.rollInterval = 300
agent.sinks.sk-hdfs-type2.hdfs.rollSize = 267386880
agent.sinks.sk-hdfs-type2.hdfs.rollCount = 0
agent.sinks.sk-hdfs-type2.hdfs.batchSize = 500
agent.sinks.sk-hdfs-type2.hdfs.fileType = DataStream
agent.sinks.sk-hdfs-type2.hdfs.callTimeout = 60000
agent.sinks.sk-hdfs-type2.hdfs.timeZone = UTC
agent.sinks.sk-hdfs-type2.hdfs.useLocalTimeStamp = true
agent.sinks.sk-hdfs-type2.serializer = org.apache.flume.sink.hdfs.AvroEventSerializer$Builder

The most important part of the above is that we set the HDFS sink use the Avro serializer and instruct that it should simply write the raw bytes as we've already serialized the Avro record and compressed it with Morphline. Speaking of which, here's our example morphline.conf:

morphlines : [
  {
    id : morphline_id

    # Import the Kite SDK and any custom libs you may have and need
    importCommands : [
      "org.kitesdk.**",
      "com.a8c.**"
    ]

    commands : [
      # Each command consumes the output record of the previous command
      # and pipes another record downstream.

      {
        # Parse input attachment and emit a record for each input line
        readLine {
          charset : UTF-8
        }
      }

      {
        # More commands for your ETL process
      }

      {
        # Say we set a field named `eventmarker` somewhere above to indicate the
        # type of record this is and we have a different schemas
        if {
          conditions : [
            { equals { eventmarker : "type1" } }
          ]
          then : [

            {
              # Set the schema for the Flume HDFS sink
              setValues {
                flume.avro.schema.url : "file:/path/to/schema/type1.avsc"
              }
            }

            {
              # Converts this to an Avro record according to schema
              toAvro {
                schemaFile : /path/to/schema/type1.avsc
              }
            }

          ]
          else : [

            {
              setValues {
                flume.avro.schema.url : "file:/path/to/schema/type2.avsc"
              }
            }

            {
              toAvro {
                schemaFile : /path/to/schema/type2.avsc
              }
            }

          ]
        }
      }

      {
        # Serialize the Avro record into a byte array, compressed with snappy
        writeAvroToByteArray : {
          format : containerlessBinary
          codec : snappy
        }
      }

    ]
  }
]

With these configs Flume will write compressed Avro files directly to HDFS but we will need to let Hive know about where to look so we need to create the table in Hive.

-- Table name
CREATE TABLE IF NOT EXISTS etl_type1

-- We need to specify how we are partitioning this table with the Flume HDFS sink
PARTITIONED BY ( record_ymdh INT )

-- Files were written in Avro!
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

-- We are writing to this dir in HDFS from Flume
LOCATION 'hdfs://path/to/database/etl_type1'

-- We also store the Avro schema in a hidden dir on HDFS for convenience
TBLPROPERTIES (
  'avro.schema.url'='hdfs://path/to/database/etl_type1/.schema/type1.avsc'
);

For convenience I also stored the Avro schema for the table in the .schema directory on HDFS but that schema can really be anywhere readable by Hive.

Of course as we ingest data the Flume HDFS sink with start creating new directories, a.k.a. partitions, in HDFS but Hive will know nothing about them. So you will need to let Hive repair its Metastore by scanning HDFS before you can query for new data:

MSCK REPAIR TABLE etl_type1;

SELECT *
  FROM etl_type1
  WHERE ...;

What’s Next?

We have not really pushed this new pipeline to see where the limits are however as I write this we are on track to ingest a month of data in less then 12 hours. In addition, scaling this pipeline by simply spinning up more Flume agents has thus far been linear. The one down side is that for the most up to date information we will now need to look at 2 separate Hive tables with different partition strategies making queries a bit more complicated.

We have in effect made what I like to call the “Iota Architecture” — a system that’s 1/3 of the way to a true lambda architecture. We currently have a system that emits a stream of events that can be read in batch or by a stream processor but we only have a batch process in place to allow for performant queries on “archival” data. Perhaps someday we’ll get the other 2/3 in place for our growth explorers to easily get a unified view.

WordPress Performance with HHVM

With Heroku-WP I hoped to lower the bar in getting WordPress up and running on a more modern tech stack. But what are the performance implications of running WordPress on such a modern set of technologies? Surely it’s faster but by how much and is the performance gains worth the trouble?

PHP vs HHVM

To answer that I’ve conducted a quick and dirty stress test of a sample WordPress site running under PHP and HipHop VM (HHVM) and found that the HHVM version loaded almost twice as fast and was able to serve over twice as many requests.

  Response Time Ok Responses Errors / Timeouts
Anon PHP 4.091 8,939 0.94%
Anon HHVM 2.122 18,308 0.00%
Change 48.1% 2.05X  
Auth PHP 20.688 457 74.17%
Auth HHVM 14.359 1,242 43.45%
Change 30.6% 2.72X  

In the numbers above anonymous requests represents hits to various pages without a WordPress logged in cookie which are eligible for Batcache caching whereas authorized requests are hits to the same pages with a login cookie thus bypassing page caching.

Test Methodology

To conduct this test I created a brand new Heroku instance based on the Heroku-WP template with the memcached addon installed. To populate the content I imported the WordPress Theme Unit Test after performing the initial setup. I then ran each permutation of the load test 5 times, flushing caches in between and averaged the results. When switching between PHP and HHVM I simply pushed a config change to composer.json and the boot script to toggle between the two interpreters.

To conduct the actual load test I used loader.io, (shameless referal link) a cloud based load generator with an insane 10,000 concurrent connection limit on the free account. (Quite generous when compared to the limit of 250 concurrent users on the free blitz.io account which I’ve used previously.) The free account of loader.io will only accept 2 URLs to test however with WordPress we can access most front-end pages via GET parameters passed to /index.php. Using this method and the following payload file I was able to have loader.io cycle through all posts, pages, categories, time based index pages, and the home page on the sample site for a more comprehensive sampling.

Finally, I ran all the tests for 60 seconds ramping up from 0 to 1,000 concurrent connections. I purposely excluded loading of static assets to place as much pressure as possible on PHP / HHVM for this synthetic stress test.

Conclusions

As expected, under high load MySQL is the real bottleneck, slowing down requests and causing errors due to hanging or killed queries. Caching was able to mitigate poor DB performance and bring page load times down by an order of magnitude for both PHP and HHVM.

Beyond that through simply turning on HHVM cut load times in half and appears to have allowed the server to better cope with multiple simultaneous slow requests piling up during times of high load. So if you have already installed a caching plugin and optimized your WordPress site running it on HHVM could potentially be an easy way to squeeze some more performance out of your setup.

Elasticsearch StatsD Plugin

If you’re running a multi-node Elasticsearch cluster checkout Automattic’s fork of the Elasticsearch StatsD Plugin for pushing cluster and node metrics to StatsD.

Elasticsearch, StatsD, and Grafana

At Automattic we’ve been using a set of Munin scripts to collect and aggregate Elasticsearch metrics via its native node & stats REST APIs. This method works relatively well giving us enough longitudinal information about the cluster and nodes to diagnose issues or test optimizations. That said, cluster monitoring with Munin at a 5 minute resolution leaves a lot to be desired.

First and foremost, our Elasticsearch cluster is spread across 3 data centers each with it’s own Munin instance. This makes collecting and aggregating even simple metrics like cluster wide load quite difficult. In addition, due to the polling nature of metrics collection with Munin we are limited to a somewhat corse resolution of 5 minutes. While this is good enough for looking at time series data over the course of a day or week it’s quite time-consuming to wait 5 or 10 minutes for graphs to update when deploying changes or testing performance optimizations.

We’ve already had some good experience instrumenting our PHP stack with StatsD and building dashboards with Grafana so reusing that infrastructure for Elasticsearch metrics seems like a good fit. Our fearless leader Barry suggested we tryout the Elasticsearch StatsD Plugin from Swoop Inc. however upon closer inspection we found that it does not deal with clustered Elasticsearch environments well and have yet to be updated to work with ES 1.x. So over the past week we’ve forked and rewritten much of it to suit our needs.

The Automattic Elasticsearch StatsD Plugin is designed to run on all nodes of a cluster and push metrics to StatsD on a configurable interval. Once installed and configured each node will send system metrics (e.g. CPU / JVM / network / etc.) about itself. Data nodes can also be configured to send metrics about the portion of the index stored on itself. Finally, the elected master of the cluster is responsible for sending aggregate cluster metrics about the index (documents / indexing operations / cache sizes / etc.). By default the plugin will send the total cluster aggregate as well as per index metrics for indices however granularity can be configured to report down to the individual shard level if so desired.

Check it out on GitHub: Elasticsearch StatsD Plugin.

WordPress on NGINX + HHVM with Heroku Buildpacks

WordPress on NGINX + HHVM

It’s been a year since I last made any major changes to my WordPress on Heroku build and in tech years that’s a lifetime. Since then Heroku has released a new PHP buildpack with nginx and HHVM built in. Much progress have also been made both HHVM and WordPress to make both compatible with each other. So it seems like now is as good a time as any to update the stack this site is running on.

So without further ado I like to introduce:
Heroku WP — A template for HHVM powered WordPress served by nginx.

The Goal

There are numerous other templates out there for running WordPress on Heroku and my main goals for this templates are:

  1. It should be simple — use the default buildpack provided by Heroku so there’s no other 3rd party dependency to implicitly trust or to maintain.
  2. It should be fast — use the latest technologies available to squeeze every last ounce of performance out of each Heroku Dyno.
  3. It should be secure — security is not an add-on, admin pages should be secure by default and database connections needs to be encrypted.
  4. It should scale — just because we can serve millions of page hits a day off a single Heroku Dyno does not mean we’ll stop there. The template should be made with cloud architecture in mind so that the number of Dynos can scale up and down without breaking.

The Stack

Standing on the shoulder of giants I was able to use the latest Heroku buildpack and get WordPress running on:

  • NGINX — An event driven web server that was engineered for the modern day to replace Apache. This high performance web server is preferred by more top 1,000 sites then any other and it’s what’s used by the largest WordPress install out there, WordPress.com.
  • HHVM — HipHop Virtual Machine, a JIT (just in time) compiler developed by Facebook to run PHP scripts which when tested with WordPress showed up to a 2x improvement.

I have yet to run any statical analysis on performance however antidotally it feels a lot faster navigating WP admin and page generation times looks much better. I’m looking forward to running more tests and performance tuning this build in the coming weeks.


Update:
While still not a head-to-head test looking at the response times as reported by StatusCake for this site running on Heroku-WP and a mirror of this site that is running on the old Heroku LAMP stack with no load other then StatusCake pings shows a dramatic improvement:

Stack Max Min
LAMP 3,514 1,166
Heroku-WP 1,351 68

Introducing Whatson, an Elasticsearch Consulting Detective

Elasticsearch Whatson

Over the past few months I’ve been working with the Elasticsearch cluster at Automattic. While we monitor longititudinal statics on the cluster through Munin when something is amiss there’s currently not a good place to take a look and drill down to see what the issue is. I use various Elasticsearch plugins however they all have some downsides.

ES Head is fantastic for drilling down into what is happening down to a shard level however its rendering is way to bulky. Once there is over a dozen nodes or indices in a cluster it becomes a scrolling nightmare.

Another tool that I use often SegmentSpy gives lots of info about the underlaying Lucene segments however the use of logarithmically scaled stacked bar charts tends to make it hard to estimate the deleted documents ratio in each shard. In addition it’s hard to drill down to just one shard group to figure out what’s going to happen when nodes restart and shard recovery kicks off on a per segment basis.

I’ve taken all that I wished I could do with both of those plugins and created a new Elasticsearch plugin that I call Whatson. This plugin utilizes the power of D3.js to visualize the nodes, indices, and shards within a cluster. It also allows the drilling down to segment data per index or shard. With the focus on visualizing large clusters and highlighting potential problems within. I hope this plugin helps others find and diagnose issues so give it a try.

GitHub: elasticsearch-whatson

Static Asset Caching Using Apache on Heroku

There’s been many articles written about how to properly implement static asset caching over the years and the best practices boil down three things.

  1. Make sure the server is sending RFC compliant caching headers.
  2. Send long expires headers for static assets.
  3. Use version numbers in asset paths so that we can precisely control expiration.

Implementing these suggestions on Heroku or elsewhere is super simple and will help not only reduce load but also make subsequent page loads faster. I’ve implemented the following on my Heroku WordPress install which runs on the default Apache/PHP Heroku build pack however you can apply these concepts to other tech stacks as well.

Proper Cache Headers

To properly cache a resource we should always send an explicit cache header that tells the client retrieving the content when the resource expired and a cache validator header. Cache validators are either ETag or Last-Modified headers. The server should send these headers even if the expires headers have already been explicitly set on the request so that browsers may issue conditional get requests with If-None-Match, If-Modified-Since, or If-Range requests. The browser issues these types of requests to validate the content that is already cached locally with the origin server. (This happens when the end-user clicks the refresh button on their browsers.)

For static files Apache will do most of the heavy lifting for us and it will generate both a ETag and a Last-Modified based on the filesystem stats on the static file. Unfortunately due to the way Heroku dynos operates Apache will not set either of these tags properly and will prevent the proper caching of the static asset object.

By default Apache 2.2 generates ETags using the filesystem i-node, last modified time, and file size of the file. With Heroku, when we push any changes our code gets packaged with Apache and PHP into a compiled slug which can then be deployed to dynos as they are spun up and down dynamically on an ephemeral filesystem. This means as Heroku allocates dynos to our app (which may happen at anytime without action from us) the underlying filesystem i-node for any given static file will fluctuate. Even worse, if we have multiple dynos running our app each dyno will report a different i-node value for any particular static file and thus will calculate a different ETag value. The net effect being we end up confusing any downstream browsers / reverse proxies potentially causing them to not cache our content and to respond incorrectly to any If-None-Match requests. To fix this we will need to configure Apache to use only last modified time, and file size to calculate ETag values.

In addition to sending differing ETags for the same content it’s also possible for us to have the exact opposite problem, sending the same ETag for different entities. The W3C designed ETags to identify unique instances of entities which includes the encoding used for transport. This is important because what is actually transferred for the same CSS file served compressed or uncompressed is vastly different our servers should mark the entity as different so that intermediate reverse proxies knows to treat the transferred content as different.

Because Apache generates the ETag based on file system stats any transformations on the file is not taken into account. This means that when mod_deflate is enabled both the deflate encoded and plain instance of each asset will have the same ETag value as calculated by Apache. This is not compliant and could cause reverse proxies to send improperly encoded content via ranged requests. There is a ticket opened for this with Apache but no timeline for a fix. So instead of waiting for a patch it’s better to configure Apache to not calculate ETag values when mod_deflate is turned on and rely on Last-Modified for content validation.

Putting the two things together I have added the following into my httpd.conf for my Heroku app.

# Only use modified time and file size for ETags
FileETag MTime Size

# Don't use ETags if mod_deflate is on (Apache bug)
# https://issues.apache.org/bugzilla/show_bug.cgi?id=39727
<IfModule mod_deflate.c>
    FileETag None
</IfModule>

As a side note, when new code gets pushed to Heroku all files within the app will have the time of the push assigned as the last modified date. This keeps last modified times consistent across dynos but it also means our static assets will send the full content back to the browser instead of a 304 not modified response back for If-Modified-Since requests after deploys. This is not ideal but it’s not super terrible and I don’t know of a simple way to solve this issue.

Dynamic Version Slugs & Long Expires

The easiest way to speed up repeat page loads is to cache page assets like JavaScript and CSS so that as a visitor clicks around the site their browser will only request the updated content from the server. And if those assets has a long enough expires value, the visitor will get the same quick load time when they comes back in a day or week or month. The only problem is if we cache assets for too long on the user’s browser and then decide to change something on the site our visitor will get outdated assets.

The easiest way to fix this issue is to version our assets so that we can specify a long expires value with a unique version number and then simply update the asset version number every time we change the content. Ideally we specify the version slug within the path of the URL instead of simply adding it as a query string because some reverse proxies will not cache resources with a query string even with a valid caching header.

For this site I created a /assets/ subdirectory under the document root and placed the following .htaccess file within it.

<IfModule mod_rewrite.c>
    RewriteEngine On
    RewriteBase /assets

    # Don't rewrite for actual files
    RewriteCond %{REQUEST_FILENAME} -f
    RewriteRule . - [L]

    # Remove version slug & rewrite to asset
    RewriteCond %{DOCUMENT_ROOT}/assets/$1 !-d
    RewriteCond %{DOCUMENT_ROOT}/assets/$2 -f
    RewriteRule ^([^/]+)/(.*)$ /assets/$2 [L]

    # Set 360 day expiration on everything
    <IfModule mod_headers.c>
        Header set Cache-Control "max-age=31104000"
    </IfModule>
</IfModule>

This allows me to inject any arbitrary version slug after assets and have that file served up with a long expires time. So if there exists a file on my server with the path assets/js/my_script.js I can refer to it as /assets/v123/js/my_script.js and then simply replace v123 with any other version number I want for caching purposes.

Please note I have set the expires value of assets to 360 days or slightly less than one year because according to RFC 2616

To mark a response as “never expires,” an origin server sends an Expires date
approximately one year from the time the response is sent. HTTP/1.1 servers
SHOULD NOT send Expires dates more than one year in the future.

So if we send something with a max-age of 10 years that may get interpreted as an invalid expires value and invalidate our caching headers.