Tuesday, February 14, 2012

Choosing the right browser local storage

So your new shiny HTML5 web app needs to persistently store data in the browser. What do you store where?

Good ol' caching

Just because you're building a fancy-pants offline HTML5 app doesn't mean you should overlook the classics. Lots of content can be happily cached the old fashioned way: by setting appropriate expiration headers. This works great for things like images, especially when their presence is not absolutely critical for the functioning of the app. This lets you conserve your more sophisticated persistent caches for things that are truly critical. Like...

Application Cache

The application cache is driven by the oft-misunderstood cache manifest file. This is where you should cache your main page, critical javascript, and stylesheets. Don't spend your application cache space frivolously, because once you hit quota, not all browsers will even give the user the opportunity to grant more. Chrome and Safari, for example, will simply refuse to cache your content. Storage quotas are an area that is poorly documented and rapidly changing.

As long as you keep it to code and markup, staying under quota shouldn't be difficult. You don't need to worry about versioning, as the browser takes care of all that. However, you should probably incorporate version tracking into your build & release process, and make your clients publish their version numbers to the server so you can see who's on what version.

HTML5 Storage

The localStorage object is a highly convenient place to stick your actual data models. It's one big persistent key-value store. But beware, there are pitfalls as your data grows! Specifically:

  1. localStorage is subject to a quota that cannot be increased in most browsers (typically 5*2^20 bytes). Again, quotas are an area where browsers are still weak and evolving.
  2. You may be using twice as much storage as you think because the encoding is UTF-16. So you really get 5*2^19 characters. And remember to include the cost of your keys, not just your values.
  3. If you decide to migrate from localStorage to one of the other options described below, prepare for a painful rewrite, because localStorage is synchronous, and the others are asynchronous. Too bad Javascript doesn't have first-class continuations.

webSQL (sqlite)

This is an abandoned web standard, so you might be tempted to skip it. However, it is the most widely available option for storing structured data, and not likely to go away just yet. It has the benefit of an expandable quota in webkit-based browsers. There are a few pitfalls to avoid:

  1. You pick an expected size when a database is first created. This size is checked against the available quota and possibly prompts the user to grant more quota.
  2. The aforementioned interaction only happens on creation, and you can't deleted a database once it's created. So if you later update your code to ask for more quota, you will be silently ignored. The openDatabase call will succeed, but your quota hasn't really increased, and when you fill it up the user may or may not be offered the opportunity to expand it.
  3. You can't list existing databases to figure out where all your quota went (at least not from within Javascript, though you can usually find the underlying sqlite files on disk).

IndexDB

The successor to webSQL with the W3C's holy blessing. Only available on Chrome, Firefox, and IE10. No iOS support, so I haven't used it extensively.

Filesystem API

You get to read and write files and directories, all within a sandboxed filesystem. This is great for large chunks of data, especially binary chunks. The downsides:

  1. Only supported in Chrome so far.
  2. Naively reading and writing a few tens of thousands of small files (to mimic the localStorage key-value store, for example) is slow, so you'll need to implement your own more intelligent on-disk data structures. If this sounds like unnecessarily reinventing the database, that's because it is. Don't use the Filesystem API for this kind of data, just use one of the above databases.

Conclusion: this stuff is still half-baked, but worth using anyway

The benefits of making your application work offline can be significant, and a combination of the above techniques can cover any modern browser. The biggest pain point is quota management, since most browsers seem to lag behind on adding user interface elements to control all this stuff. In many cases it's not even possible to revoke a quota decision once made without hacking on the browser's files directly.

Friday, February 10, 2012

Many sites block access from Amazon EC2

I'm a generally happy customer of Amazon Web Services. So when I needed to set up a VPN server, I figured EC2 would be a fine place to stick it. Unfortunately, this has some unintended consequences for VPN users whose browsing traffic gets routed out through Amazon's IP space.

Many high-profile sites (including Yelp and the whole Stack Overflow family) block access from EC2. This can lead to pretty unfriendly errors:

I've even seen a site that just breaks subtly when some assets load and others are blocked.

I can only assume EC2 is home to enough badly behaved crawlers and content-stealing bots that they ruined it for the rest of us. I've seen others comment on the difficulty of sending email from EC2 due to reputation problems, but I haven't seen much comment on this HTTP blacklisting. For me it's just an inconvenience, but if I was trying to build a search engine it would make EC2 unusable.