Web Fundamentals: HTTP Caching

Let’s walk through the mechanics of HTTP caching. HTTP caching is used to reduce latency by delivering content from caches that are closer to the client and reducing bandwidth since no network traffic is required to serve a (locally) cached resource.

There are two types of caches. Private caches and public caches.

A public cache is something like a shared cache which usually sits between the server and the user agent (browser). Those public caches or HTTP proxies can usually be found at large cooperates and ISPs. Public caches are not used for resources which require HTTP Authentication. Furthermore, HTTPS encrypted traffic can also not be cached.

A private cache is located at the client and cannot be used by other clients. It’ usually the browser’s cache. Also authenticated and encrypted requests are subject to private caching if not stated otherwise. If you don’t want to have sensitive information stored on the user’s client (e.g. credit card details) caching should be disabled (see Cache-Control: no-store).

Controlling caching with HTTP headers

With the HTTP header Cache-Control you can specify the caching policies for requests and responses. A caching policy for a response could tell the user agent that the response should not be cached or caching is allowed by private caches only.

no-cache vs. no-store

The no-store directive advises the user agent and public caches to not store the response. So no copy of the response should be stored locally.

The no-cachedirective forces caches to perform the request to the origin server for validation before releasing a cached copy.

Let’s look at common Cache-Control header directives to control response caching:

  • no-store – disable caching, no local copies are stored
  • no-cache – allows caching, but a request must be sent to the server for validation.
  • public – marks authenticated responses as cachable. By default, authenticated responses are marked as private.
  • private – allows caching for single users, usually in the user agent’s cache.
  • must-revalidate – instruct the cache that has to follow the defined freshness rules without exceptions. In some circumstances, caches are allowed to serve stale content which can be prevented by this directive.
  • max-age=<seconds> – Defines the relative time in seconds since the request until the cached version expires.

How not to control caching

  • HTML meta tags
  • Pragma HTTP Headers

Cache validation

ETags

An ETag is a fingerprint of the resource’s content. If the cached version of a response has expired, the user agent sends the cached fingerprint along with the request. The server compares the fingerprint and can skip the response by returning a 304 “not modified” response instead of the actual (unchanged and thus already cached) content.

Request with If-None-Match header

To make a request conditional, the client sends the If-None-Match HTTP header with the cached ETag value. The server responds with a 200 “OK” if and only if the ETag send with the request does not match the ETag of the current version of the resource.

If the none-match condition failed, which means that the resource hasn’t changed, the HTTP server responds with a 304 “Not Modified” status.

An interesting fact about ETags is that it can be abused for user tracking. You’ll find more details in the ETag Wikipedia article.

Invalidation and update of a cached resources

Your users deserve fast loading times and thus you’re extensively using caching with long expiration times. That’s great, but how do you make sure that your users get the latest and greatest updates of your web application?

To profit from caching and also make sure that new resources get loaded you can change the filename when the file’s content changes. Usually, a hash of the file’s content is used and appended to the file name. This ideally happens during build time.

This approach can only work if the HTML document is re-validated on each request. Otherwise, the new URLs are not visible to the client.

HTTP/2 and caching

The major advantage of HTTP/2 is the reuse of an existing TCP connection to transfer multiple resources instead of opening one TCP connection per request.

Caching works as in HTTP/1.1 and is mainly controlled by the Cache-Control headers and ETags with conditional requests. When it comes to web performance optimization HTTP/2 introduced two new features which are not present in HTTP/1.1. Stream prioritization lets the user agent specify what order they want to receive resources. Server push sends extra resources to the user agent before it knows that they are needed.

Links

Did you like this post?

Web Fundamentals: Overview

The web consists of three separate concepts:

  • URL – Uniform Resource Locator
    • unique identifier for a resource in the web
  • HTTP – HyperText Transfer Protocol
    • Protocol to retrieve a representation of a resource through a URL.
  • HTML – HyperText Markup Language
    • an HTML document can represent a resource and link to other resources through their URL

URL

A URL identifies and locates a resource anywhere in the web.

An identifier is unique if at most one entity corresponds to it. For example, a Sales tax identification number uniquely identifies a person. It’s a unique identifier, but it’s not a locator since it won’t tell you where the person can be found.

A locator is unique if at most one location corresponds to it. For example, a post address uniquely identifies a location. It points to one specific location, but it doesn’t allow you to identify a person.

So a well-defined URL combines identification and location. Let’s look at the following URL:

https://de.wikipedia.org/wiki/Tim_Berners-Lee

The URL identifies the Wikipedia article of Tim Berners Lee, the inventor of HTML and the World Wide Web. It also allows us to locate the Wikipedia article. We can put the URL in our browser’s location bar to retrieve the article.

HTTP

HTTP is a protocol to transfer representations from a server to a client. The protocol standardizes how clients send a request for a representation of a resource through its URL.

HTTP standardizes how servers reply with a response that can contain a representation.

The client request

After resolving the server’s IP address, the client can send an HTTP request. A request consists of three parts:

  • request line – indicates a method, request URI and HTTP version
  • header fields – key-value-pairs including Host, Accept and User-Agent
  • body – an optional body

This is how a HTTP request for the Tim Berners Lee article looks like:

GET /wiki/Tim_Berners-Lee HTTP/1.1
Host: de.wikipedia.org
User-Agent: curl
Accept: text/html

The request line starts with the HTTP method. It is case-sensitive. The following most widely used methods are:

  • GET – transfer a representation
  • HEAD – transfer only status and headers (no body)
  • POST – perform a resource-specific operations
  • PUT – replace all representations
  • DELETE – remove all representations
  • OPTIONS – describes the communication options for the target resource

Read-Only methods like GET, HEAD, and OPTIONS are not causing any state change or side-effect on the server side. This is an important and servers should never break this contract.

PUT and DELETE can change the state of a resource. Both methods are idempotent if repetitions are not altering the outcome. The same request can be performed multiple times and the result remains the same.

After the method, you find the request URI which is the URL path followed by the protocol version. All three components are separated by a whitespace character.

Following the request line, the header fields are specified. The request header fields allow the client to pass additional information about the request and about the client itself to the server.

A client must include a Host header in all HTTP/1.1 request messages. Although the client resolves the server’s hostname to an IP address, the hostname is still sent to the server. This has the benefit that one server can host multiple websites. The Host header tells the server which one to pick.

The server response

When a server receives a request, it generates a response. A HTTP response is structured as followed:

  • status line – indicates HTTP version, status code and reason phrase
  • header fields – additional information about the response, including Content-Type, Content-Length, etc.
  • body – an optional body containing the actual content

Let’s see how a HTTP response for our previous request could look like:

HTTP/1.1 200 OK
Date: Sat, 18 Apr 2020 23:47:12 GMT
Content-Type: text/html; charset=UTF-8
Last-Modified: Sun, 24 Jan 2016 17:12:34 GMT

<!DOCTYPE html>
<html lang="en">
...

A HTTP response has a status code which indicates how the request was handled. The status code is a 3-digit integer and are fully defined. Status codes are classified into five different categories. The first digit defines the class of the response:

  • 1xx – Informational, request received, continuing process
  • 2xx – Success – request understood and accepted
  • 3xx – Redirection – further action required to complete the request
  • 4xx – Client Error – request contains bad syntax or cannot be fulfilled
  • 5xx – Server Error – server failed to fulfill the request

In our example, the server responds with a status code 200, which means that the request was handled successfully.

HTML

The HyperText Markup Language is a markup language (not a programming language) that captures the structure of a document. HTML comes along with other technologies like CSS that describes the appearance or JavaScript that describes behaviour.

The most powerful feature of HTML is Hypertext. Hypertexts are links that connect web pages to one another. Links can point to resources on the same website or external ones. This feature makes the Web so powerful.

HTML divides a document into elements that are indicated by opening and closing tags, which consist of the element name surrounded by “&lt;” and “&gt;“. An element can have additional attributes which are located inside the opening tag. Here is an example of an image tag:

<img src="image.jpg" alt="a image" />

The image tag in this example uses a self-closing tag since it has no child nodes. An example of an element with child nodes is the paragraph tag:

<p class="summary">
  This is a paragraph<br/>
  with <em>emphasized</em> words.
</p>

The HTML specification defines a set of elements that have certain semantics and can be used in HTML. The specification also contains rules about the ways in which the elements can be nested. A HTML documents consist of a tree of elements and text. Here is an example of a basic HTML document:

<!DOCTYPE html>
<html lang="en">
  <head>
    <title>Document</title>
  </head>
  <body>
    <h1>Headline</h1>
  </body>
</html>

Did you like this post?

Event Sourcing

Systems using Event Sourcing store their persistent state as a sequence of events instead of updating a single model. A particular state can be recreated by replaying all events. It’s basically a persistent transaction log file for your application.

Why should I care?

You’re able to restore the system state to any point in time. More important you have all information available to debug the reason for a certain state. Depending on your business domain, this might be a business- or even legal requirement.

On the source code level, I see a huge advantage in the maintainability and readability of the code. Every chance to the system’s state goes through a command which result is a set of events. This makes it easy for other developers to understand your system.

On the other hand, using event sourcing comes with a cost. Your event store will grow and its additional data that needs to be maintained. For performance reasons you have to maintain additional snapshots. If you avoid snaphots your application might need some extra time recreating a particular state that you canot affort. Also you might need to maintain a separate read model that needs to be in sync.

How can I build such a system?

In an event sourced system all intents are expressed as a command. Commands are usually start with an verb (e.g. ShipOrderCommand). A command should clearly express a single intent. Generic commands like “Update” or “Insert” are not useful.

Each command has an CommandHandler thats first resposibility is to check the preconditions if this command can be performed by the caller depending on the current state and may reject it’s execution. A CommandHandler never modifies any data directly, but it returns a sequence of events that are the result of the command. For example the ShipOrderCommand may return an event OrderShippedEvent.

The events provided by the CommandHandlers are then handled by an EventBus that stores the events in the EventStore. The EventStore also knows about EventHandlers which register for a certain event type and notifies them about the new event. Usual jobs of EventHandlers are updating the domain model.

The EventStore holds all events and provides methods to get all events for a particular thing. The EventStore is the transaction log of your application from which you can rebuild the state of the system.

The following sequence diagram shows this interaction between those components:

Sequence diagram of event sourcing components

This is just a brief overview of the most important components of an event-sourced system.

Did you like this post?

Exploring Streams in Scala

In this blog post, I‘m going to explore Streams in Scala‘s Collection API. So first of all what is a Stream? A Stream is a lazily evaluated list. This means that elements in a Stream get only evaluated when they are needed. Therefore Streams can be infinite while strict collections cannot. A Stream can also be seen as an immutable Iterator.

Creating a Stream

To create a Stream at least two values must be given:

  • an initial value
  • a function to compute the next value

For demonstration of an infinite Stream I’ll choose a Stream of natural numbers (0, 1, 2, …). The set of natural numbers starts with zero what is the initial value of our Stream. The successor of the initial value can easily computed by n+1. Then the successor of n+1 is n+1+1 and so forth. To build such a Stream we could write the following function:

def from(start: Int): Stream[Int] =
Stream.cons(start, from(start + 1))

The Stream is created by calling the cons function. It takes two parameters. The first parameter is the initial value. The second parameter takes a function which returns another Stream. The cons call returns a new Stream. The Stream consists of the initial value and the “rest“. The “rest” will only be evaluated when needed.
Ok, I lied a bit. cons feels like a function but in reality it is a object with an apply method. When we call Stream.cons(...) we call the apply method. Written out in full it would be Stream.cons.apply(...) but we don‘t really want to write it like that.

The same function can also be written in a shorter notion using the #:: operator:

def from(start: Int): Stream[Int] =
start #:: from(start + 1)

Such common builder functions like from, range, continually and so on for the creation of Streams can be found in the Stream object.

Now consider the following example:

val nn = from(0)
println(nn.take(10).mkString(","))

A variable nn is created and an infinite Stream of natural numbers is assigned. At this point the from
function is only called once. Recursive calls of from(start + 1) doesn’t happen at this point. The next line takes the first 10 elements of the Stream and prints them to the console. Now guess how many times the from function will be called? Right 9 times, because the initial value (zero) has already been initialised.

0|1|2|3|4|5|6|7|8|9|...

(Just call nn.toString on the REPL to verify)

Filtering/ Mapping Streams

A Stream can be used like every other collection because it is of type LinearSeq. Operations like filter and map are lazy evalutated. They simply return another Stream and perform the operation when required. I think we can say that all operations that return another Stream can be considered as lazy operations (like filter, map). Of course, operations like foldLeft, foldRight, find and exists are not lazy.

For example to filter even values:

val even = nn filter (_ % 2 == 0)
println(even.take(10).mkString(","))

The 1st line calls filter on the Stream of natual numbers. Calling the filter method will return a new Stream. The new Stream uses the original Stream as its input and only returns elements for which the filter conditiion holds true. In short: filter is lazy evaluated. The second line prints the first 10 even elements to the console. This should be: 0,2,4,6,8,10,12,14,16,18

Now the first 20 elements of the natural number‘s Stream has been initialized:

0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|...

Streams can be used in pattern matching

A Stream can be used in pattern matching. To do so you can use the #:: extractor. The following example prints out the string “matched” on the console if the first two elements of our nn Stream is 0 and 1. And yes, it does 🙂

nn match {
case 0 #:: 1 #:: _ => println("matched")
}

Conclusion

Stream is the immutable equivalent to Iterator. While an Iterator doesn’t keep computated values Stream does. So as with any immutable data structure you don’t have to care about state with Streams. This is useful when you pass a Stream around to other functions. If you do so with a Iterator you have to care about state what needs much more caution and can lead to logic errors. The immutability of Streams makes it also quite easy to compose them. Keep in mind that a Stream has a higher memory footprint than an Iterator because it keeps the computed elements.

Did you like this post?

Helper for streaming MongoDB GridFS files in lift web applications

Lift is a web application framework written in scala and comes with native integration for mongodb. The module is called “lift-mongodb” and integrates mongodb as the persistence layer for its Record and Mapper framework.

GridFS is a specification for storing large files in MongoDB. Most drivers support it directly.

In this post, I‘m going to develop a helper that makes GridFS files accessible via HTTP. Furthermore, the helper should support HTTP caching so the files can be cached by the clients.

Let‘s get started.

Basic setup

I assume you have a plain lift project. I‘m using sbt for building the lift application. If you (for whatever reason) prefer maven you can certainly do so.

First of all, we need to tell lift that we want to use mongodb. Therefore we‘ll add the lift-mongodb module as a dependency. See http://www.assembla.com/wiki/show/liftweb/lift mongodb to find out more.

val lift_mongo = "net.liftweb" % "lift-mongodb" % "2.2"

First shot

To start simple I wrote an object named GridFSHelper with a get function. The get function takes a file name as the only argument and returns a value of type Box[LiftResponse]. Like a real-world box, the lift box can be empty or full. And so Box has two subtypes called Empty and Full.

The behaviour of the get function is like this:
It uses the default mongo connection to query for a file with the given filename. If no file was found it returns Empty to signal that it has nothing to respond. This leads into a 404 (Not Found) HTTP message.

If the file was found it returns a “Full Box” containing a StreamingResponse . A StreamingResponse takes six arguments.
First of all the InputStream that should be sent to the client. The second argument is a function which is called when the stream was done or aborted. This is perfect to cleanup resources. The third argument is the length of the stream. The last three arguments are a map with HTTP header fields, a map with cookies and the HTTP status code.

object GridFSHelper {

  def get(filename: String): Box[LiftResponse] = {
    MongoDB.use(DefaultMongoIdentifier) ( db => {
    val fs = new GridFS(db)

    fs.findOne(filename) match {
       case file:GridFSDBFile => 
         val headers = ("Content-Type" ->  "application/octet</del>stream") :: Nil
         val stream = file.getInputStream

         Full(StreamingResponse(
              stream,
              () => stream.close,
              file.getLength,
              headers, Nil, 200))

       case _ => Empty
     }
   })
  }
}

You can use the GridFSHelper by binding it to an uri.

Add the following code to the Boot.scala file.

LiftRules.dispatch.append {
  case req @ Req(List("files", filename), <em>, </em> => {
     () => GridFSHelper.get(req, filename <ins> "." </ins>     req.path.suffix)
  }
}

However this implementation has some restrictions:

  • The content-type is not set properly.
  • It doesn‘t support HTTP caching (no 304 messages).

Know the content type

Currently we have the following line which sets a fixed content type.

val headers = ("Content-Type" <del>> "application/octet</del>stream") :: Nil

This means all responses have the same content type regardless if it‘s an image, an HTML or pdf …
This is far away from being perfect.

To determine the file‘s type we can look at the file extension. Fortunately, web containers do this already, so we don‘t have to implement it ourselves.

To get the content type evaluated just replace the previous line with the following:

def get(filename: String): Box[LiftResponse] = {
// some code ...
val headers = ("Content-Type" -> contentType(filename)) :: Nil
// more code ...
}

private def contentType(filename:String) =
LiftRules.context.mimeType(filename) openOr "application/octet-stream"

Now the HTTP response should come with the right content type. if not, the content type for the given file extension is not known by the web container. In that case you can add the content type to the web.xml :

For example:

<mime-mapping>
<extension>svg</extension>
<mime-type>image/svg+xml</mime-type>
</mime-mapping>

Handle HTTP caching

In the default configuration Lift sets a bunch of HTTP header fields to tell the client that nothing should be cached. This rule applies to our GridFS response as well. To allow clients to cache our response we have to reset some HTTP header fields:

val headers =
("Content-type" -> contentType(filename) ::
("Pragma" -> "") ::
("Cache-Control" -> "") :: Nil

Namely, we have to reset the Pargma and the Cache-Control field.

Next, we have to set the Date , Last-Modified and the Expires headers. The header list will now look like this:

val headers =
("Content-Type" -> contentType(filename)) ::
("Pragma" -> "") ::
("Cache-Control" -> "") ::
("Last-Modified" -> toInternetDate(lastModified)) ::
("Expires" -> toInternetDate(millis + 10.days)) ::
("Date" -> nowAsInternetDate) :: Nil

Great, our HTTP header is set probably. Now we need to check the request to see if we can return a 304 (not modified) response. This tells the client that there is no need to download the whole file again. The client can use the cached file.

Fortunately there is a testFor304 function in the Req which we can use.

def get(req: Req, filename: String): Box[LiftResponse] = {
// some code ...
req.testFor304(lastModified, "Expires" -> toInternetDate(millis + 10.days)) openOr {
// create and return StreamingResponse
}
// more code ...

As you see I introduced a new parameter req to pass the current Request to the function.

All the magic is done by the testFor304 function. If its return value is Empty we have to build our response, otherwise, we can simply return the already prepared response.

This simple helper allows us to stream files from GridFS to the client. It sets the proper content type and supports HTTP caching.

The complete code can be found at github: http://gist.github.com/653101

Comments and improvements welcome!

Did you like this post?

Parsing chunks of XML documents with JAXB

Jaxb is a great java library for mapping XML documents to Java objects and vice versa. But how can Jaxb be used to parse large XML documents?
The Unofficial JAXB Guide contains a small section which provides some useful information about this topic.

Assume we have a xml document similar to the following:

<Example id="10" date="1970-01-01" version="1.0">
   <Properties>...</Properties>
   <Summary>...</Summary>
   <Document id="1">...</Document>
   <Document id="2">...</Document>
   <Document id="3">...</Document>
</Example>

Now I want to unmarshal the Example element into the corresponding Example object. If I do so the whole XML document gets unmarshalled. If the XML document contains hundreds of thousands of Document elements it will consume a huge amount of memory. But at a certain point, I’m only interested in the Example element with its Properties and Summary element. The Document elements can be parsed by chunks.

To reach that goal I use virtual infosets like stated in the JAXB Guide. Therefore I created a simple class named ParitalXmlEventReader which is of type XmlEventReader and delegates all method calls to a parent reader. As a constructor argument it takes a QName of an element. If the reader finds the first start element of that type it closes the parent element by returning the EndElement event. So the xml document above will look like that to the caller of the reader:

<Example id="10" date="1970-01-01" version="1.0">
  <Properties>...</Properties>
  <Summary>...</Summary>
</Example>

As the parent reader is still located at the first Document start element we can use the same reader to parse the document elements

The following code demonstrates the use of the PartialXmlEventReader:

@Test
public void testChunks() throws JAXBException, XMLStreamException {

  final QName qName = new QName("Document");

  InputStream in =   getClass().getResourceAsStream("example.xml");
  if(in == null)
    throw new NullPointerException();

  // create xml event reader for input stream
  XMLInputFactory xif = XMLInputFactory.newInstance();
  XMLEventReader reader = xif.createXMLEventReader(in);

  // initialize jaxb
  JAXBContext jaxbCtx =   JAXBContext.newInstance(Example.class, Document.class);
  Unmarshaller um = jaxbCtx.createUnmarshaller();

  // unmarshall the Example element without parsing the   document elements
  Example example = um.unmarshal(new   PartialXmlEventReader(reader, qName),
Example.class).getValue();

  assertNotNull(example);
  assertEquals("My Properties",   example.getProperties());
  assertEquals("My Summary", example.getSummary());
  assertNull(example.getDocument());

  Long docId = 0l;
  XMLEvent e = null;

  // loop though the xml stream
  while( (e = reader.peek()) != null ) {

    // check the event is a Document start element
    if(e.isStartElement() &&     ((StartElement)e).getName().equals(qName)) {

      // unmarshal the document
      Document document = um.unmarshal(reader,   Document.class).getValue();

    assertNotNull(document);
    assertEquals(++docId, document.getId());

    } else {
      reader.next();
    }
  }
  assertEquals(new Long(10), docId);
}

You can find the source code of the PartialXmlEventReader here.

Did you like this post?

Generating large PDF documents with Apache FOP

Some days ago I had trouble with generating large PDF documents (> 2000 pages) with Apache FOP. The problem was the memory consumption while rendering the document. In my opinion, it was not an acceptable solution to increase the JVM memory > 2GB. So I had to find a way to optimize my templates.

Fortunately, on the FAQ list of Apache FOP is a section about Memory Usage, which gives some very useful hints on optimizing the template.

In my case, I had defined one page-sequence element for the whole document, but logically the PDF document contained multiple documents with several pages. So it was quite easy to define an own page sequence for each logical document. With this little modification in my template, the memory usage shrank from about 2GB under 60MB and the rendering process finished notable faster.

The initial template looked something like this:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format">

<xsl:template match="/">
  <fo:root>
    <fo:page-sequence master-reference="A4">
      <xsl:apply-templates select="Document" />
    </fo:page-sequence>
  </fo:root>
</xsl:template>

<xsl:template match="Document">
  <!-- Page content goes here -->
</xsl:template>

</xsl:stylesheet>

After optimizing the template it looked like this:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format">

<xsl:template match="/">
  <fo:root>
    <xsl:apply-templates select="Document" />
  </fo:root>
</xsl:template>

<xsl:template match="Document">
  <fo:page-sequence master-reference="A4">
    <!-- Page content goes here -->
  </fo:page-sequence>
</xsl:template>

</xsl:stylesheet>

Did you like this post?