The Guinea Pig in the Cocoa Mine

Underground Cocoa Experiments

Contentless SQLite FTS4 Tables for Large Immutable Documents

TL;DR? If you use SQlite FTS4 for indexing large immutable documents, you might want to consider contentless tables. With minor adjustments, the database could be 2 to 3-time smaller.

Gus Mueller recently detailed how Acorn 5’s Help menu implements live searching of all their online documentation. The actual help content is huge and hosted on Acorn’s website. To provide local search from within the app, a small SQLite database is embedded in Acorn. The database uses the brilliant FTS4 extension extension module which allows to create virtual tables for super-fast full-text search. Combine this with NSUserInterfaceItemSearching, and you get a clever yet simple setup for a powerful Help menu that seamlessly connects Acorn and its extensive online documentation (did I mention Acorn 5 is awesome?).

Contentless FTS4 Tables

I just happen to have been knee-deep in FTS4 last month while working on full-text search for the Findings app (of course leveraging Gus’ own FMBD). One section of the SQLite documentation in particular caught my attention: contentless FTS4 tables. As the name implies, such a table does not store a copy of the content. Yet, just like contentful FTS4 tables, you can perform efficient full-text search on that content. For instance, type this:

SELECT docid FROM data WHERE body MATCH 'bar'

… and you get the list of docids for all the documents containing the word “bar” somewhere in the body column.

Behind the scenes, SQLite creates various data structures for the index, and they are the same for contentful and contentless tables. In contrast, the ‘content’ part of a contentless table is simply empty:

Contentful vs contentless.

This difference is particularly relevant when indexing large documents that exist outside a database, such as a collection of web pages or PDFs. Normal contentful FTS4 tables store the entire content of those documents in the database, which means this data is needlessly duplicated. This is not the case with contentless tables. They work more like the SearchKit framework on OS X, that focuses on building and querying an index, and does not keep a copy of the documents (the client is responsible for that).

It should thus be obvious why contentless FTS4 tables could be more advantageous: they are potentially smaller than a contentful table. In theory. Is that really the case? And how much smaller, really? Let’s use the scientific method and run an experiment to answer those questions!

Contentless vs Contentful

As an example of content, I used the SQLite database that indexes the Acorn help content (it can be easily extracted from the Acorn app package). The database contains 187 entries, and occupies 1,028,096 bytes on disk (roughly 1 Mo). That is pretty small if you consider that the actual Acorn help site is really quite extensive and weighs in at 524 MB! The index of course only deals with the text, but that is still roughly 400,000 characters (quick observation: that’s about half of the database size).

Using this dataset, I performed the following experiment:

  1. create a database using a “contentful” FTS4 table; I expect the database to be pretty much the same as the original; this is what scientists would call a positive control; it is here to make sure that whatever comparison I make with database (2) is valid; if Gus used some special pragma or a different SQLite build and I used his database for comparison, I would end up comparing apples to oranges;
  2. create a database using a contentless FTS4 table (note: the database also has an “info” non-FTS4 table, for reasons explained in the section “Querying Contentless Tables”);
  3. same as (1) but adding the data 10 times (1870 entries);
  4. same as (2) but adding the data 10 times (1870 entries).

Databases (3) and (4) are meant to extrapolate what happens with more entries. Since the exact same data is used 10 times, the ‘content’ will be 10 times bigger while the ‘index’ part will probably not grow as much 1.

The exact script I have used is available for download. It is a Ruby script that expects the original SQLite database ‘HelpIndex.db’ in the same directory when you run it. It creates 4 databases (index1.db, index2.db, index3.db and index4.db). If you don’t want to run the script yourself, feel free to download index1.db and index2.db and compare their contents, e.g. using the great Base app.

Now, the results.

An image is worth five numbers.

Database       Size (bytes)   
--------       ------------   
Original          1,028,096   
1                 1,028,096   
2                   442,368   (43%)
3                 9,089,024   
4                 3,276,800   (36%)

First, database (1) is the exact same size as the original. Good job. Positive controls are very boring when all goes right. But then, of course, it only highlights the really interesting part: a very significant size reduction for contentless tables. Using a contentless table brings the database to less than half of its contentful size. For a larger database, the relative savings appear to be even larger (down to almost one third), in line with our expectations.

The conclusion? For indexing a collection of large documents, contentless FTS4 tables are worth considering. For indexing help pages for instance, and depending on the number and size of the indexed documents, database size could be a significant fraction of an application package. In the case of Acorn, using a contentless table would lead to a 2% size reduction of the app. If Gus decided to switch to a downloadable index instead of packaging it with the app, it would save bandwidth and would provide faster updates to the user. As the help site grows in size, the potential savings would potentially be even larger. Small is beautiful.

Now, to be fair, contentless tables are not all rosy. There are still a couple of drawbacks that I will cover in the next sections. Fortunately, in the use case we are considering here, only the first one is relevant and it is easy to work around.

Querying Contentless Tables

With contentless FTS4 tables, the only values you can get back when running a full-text query are the docids for the matches. It is an error to attempt to retrieve the values of any other column. Here is comparing what happens with contentful and contentless tables with the same column setup:

When querying a contentless table, you can only get docids.

The docids are not useful on their own. In the case of Acorn, we need to display the title of the entry in the Help menu and we need the url to open the page in a browser if the user chooses that entry. Since the original content is not stored in the contentless table, the information is simply not available.

There are different ways to work around this limitation. A simple solution I am describing here is to store the url and title in a separate, normal, non-FTS4 table, populating it in parallel to the FTS4 table, so that the rowids match. Here are the statements for creating the contentful and contentless tables; then populating the tables (latter statements left as an exercise; this is the setup I used in the above experiment where I compared the database sizes, so you can also just check the script):

Let’s add an ‘info’ table to complement the contentless table.

When querying the full-text index of the contentless table, we still only get docids, but we can now use those to get at the information we need: the title and the url. This adds a bit of complexity to the query, and makes it slower. I settled on using 2 subsequent queries, but I am no SQL ninja, and a faster approach probably exists (please let me know!). Given the type of data we consider here, the query will still be very fast (we’re not indexing the HHGTTG). Here is how it goes, again comparing contentful and contentless tables:

A simple 2-step query for a contentless table.

Update Sept 14, 2015: Evan Schoenberg (from the Disqus comment thread at the end of this post) suggests the following single SELECT statement for the query:

SELECT info.url, info.title FROM info, data WHERE (data.body MATCH 'bar') AND (info.rowID = data.docid)

Deleting Content in Contentless Tables

Unlike in contentful tables, it is not possible to delete or update entries in a contentless table. Once you index data, it cannot be altered, and will remain part of any query you do now and in the future. The reason for this requirement is the way an FTS4 index is stored internally. It is optimized to get the rows from a list of tokens (searched words), but not the other way around. To prune a row without knowing which tokens were indexed, SQLite would have to pretty much scan the entire index searching for the rowid. With a contentful table, SQLite can check the content for that row, tokenize it again, quickly get to the relevant tokens in the index and alter things as needed.

It is possible to work around this by keeping track of obsolete docids and adding a new entry for each modification. Then, when you get results from a full-text search, you ignore the entries corresponding to obsolete data. I experimented for a while with such a setup, but Findings documents are heavily edited as the user works on them, and the search index needs to be updated on a regular basis. The table then keeps growing, which defeats the initial purpose of having a smaller database. In the end, a contentful table is not that big anyway, even with hundreds of documents in the Findings library. For better or for worse, Findings documents are mutable, so contentless tables are not a good fit. This is very different from what I covered here…

Going back to our use case: we don’t care about deletion. For immutable documents, this limitation of contentless FTS4 tables is completely irrelevant.

Where does that leave us, then? My earlier experiment showed that the relative space savings for large documents can be significant. When indexing help pages for instance, it is thus an alternative really worth considering. Contentless SQLite FTS4 tables + large immutable documents = Yay!

[Update Sept 8: Comment thread on HN]

1. Using the exact same data 10 times seemed like the best I could do for the scope of this post. Of course, it is a very artificial dataset: in the real world, nobody will ever index the same documents 10 times (right?). Using real data, though, would have made it harder to compare (3)/(4) with (1)/(2) since we would have to use a different corpus of data, and that could introduce artifactual differences (one set of data could be inherently easier to index, independent of the size). Still, the problem with my approach here is that the index will be smaller than it would be with real data, because we are not introducing new tokens with the additional documents. But hey, counter-argument: anyway, with real data, new tokens become increasingly less frequent even as you index more documents because most words will have already appeared in previous documents. OK, ideally, I’d scour the web for all kind of representative datasets, and index all of these. Alas, I only had time to write this footnote as an excuse, so let’s just move on. ↩︎