Which type-safe database library should you use?

December 14, 2019

« Previous post Next post »

Beam versus Squeal: which one is better? Or maybe you've heard good things about Selda or Opaleye. Lots of options, not a lot of guidance.

To answer this question, I took 7 popular options for database libraries and implemented the same project and queries using each one.

The contestants:

Why use any of these libraries?

Presumably you, like me, are sold on the benefits of strong typing to ensure that you write better software. (If you're not, let's just pretend you are for now.) Your application presumably needs some sort of permanent storage. You could just use postgresql-simple for everything, but it's a little embarassing to be writing raw SQL queries as strings and hoping that they work in a language that wants to do anything but.

Thankfully, there are lots of options in the Haskell ecosystem for type-safe SQL queries. You can make sure that you're not forgetting to include columns in your output, or getting them in the wrong order. You can make sure that you're joining tables on an ID for the same entity. You can even reuse queries easily by composing them directly in Haskell, then producing a single query to send to your DB backend. All with the type-checker helping you, making sure you don't create invalid queries and runtime errors.

What's the example project?

We're creating the backend for a website for professional hitmen. Think Fiverr or Upwork, but for paid killers. Each hitman has a handler (handlers might handler multiple hitmen), and hitmen pursue "marks." Once the job is done, hitmen mark their targets as "erased." We'll model this like so in our database. Adding an erased_marks entity doesn't delete any pursuing_marks.

For the purposes of our exercise, we're using Postgres as our database backend. While some of the libraries we'll look at (like Beam) are fairly backend-agnostic and can be used for any database engine, others (like Opaleye and Squeal) only really target Postgres.

In order to serve up data from our backend, we'll need some queries. Specifically, these queries, ranging from trivial ones to ones that exercise how the libraries handle joins, subqueries, and aggregates.

Get all the hitmen
Get all the hitmen that are pursuing active marks (i.e. marks that haven't been erased yet)
Get all the marks that have been erased since a given date
Get all the marks that have been erased since a given date by a given hitman
Get the total bounty awarded to each hitman
Get the total bounty awarded to a specific hitman
Get each hitman's latest kill
Get a specific hitman's latest kill
Get all the active marks that have only a single pursuer
Get all the "marks of opportunity" (i.e. marks that a hitman erased without them marking the mark as being pursued first)

We also want to write updates and inserts using each library to see how they handle them. These should stretch the query capabilities of each library sufficiently to find rough spots.

If you'd like to skip straight to the code, here's the repository containing implementations for each library.

With all that in mind, let's jump into comparing.

Beam

Beam is an attempt to solve the type-safe SQL problem in a completely backend-agnostic way. The way it does this is by adding a type parameter to each query for the backend and having lots of typeclasses to specify functionality. Unfortunately, this gets really old, really fast, especially when you have to use typeclasses with names like HasSqlEqualityCheck backend Int64 and BeamSqlT071Backend backend, because god forbid you want to use BIGINTs. (You'll need both of those on the same query, by the way.)

You can get around this by ignoring the backend-agnosticism entirely and specifying the specific backend in each query, but even then the types of your queries will be a mess of inscrutable QExprs, QAggs, and an ever-present s type parameter for query scoping.

Once you get past all of that, building and composing queries in Beam is actually fairly nice; subqueries can be reused very easily by using Beam's monad instance for its queries, and joins (both inner and left) are one line away. It's easy to define queries that return ad-hoc tuples, without needing to define new datatypes. It's just that the Beam types are obnoxious. You might be able to get around that with type aliases and clever type families, but at point... why bother, when some of the other library choices do the same thing without the fuss?

Some other red flags are that there's no easy way to get distinct rows, so you'll need to explicitly aggregate and group on all the columns you want. Beam also seems to type SUMs incorrectly, at least on Postgres; it doesn't correctly change the column type. For instance, in Postgres, summing over a BIGINT column will give you back a NUMERIC result, but in Beam-land, if your datatype attribute is an Int64, summing it will still try to give you back an Int64, which causes a runtime error.

Overall I can't recommend Beam; it's too finicky.

latestHits :: Q Postgres HitmanDB s
  ( HitmanT (QExpr Postgres s)
  , MarkT (QExpr Postgres s)
  )
latestHits = do
  hitman <- allHitmen
  mark <- allMarks

  (minHID, minCreated, minMarkID) <- minID
  (latestHID, latestCreated) <- latest

  guard_ (minHID ==. latestHID)
  guard_ (just_ minCreated ==. latestCreated)
  guard_ (minHID ==. HitmanID (_hitmanID hitman))
  guard_ (minMarkID ==. just_ (_markID mark))

  pure (hitman, mark)

  where minID = aggregate_ (\em -> ( group_ (_erasedMarkHitman em)
                                   , group_ (_erasedMarkCreatedAt em)
                                   , min_ (getMarkID $ _erasedMarkMark em)
                                   ))
                           allErasedMarks

        latest = aggregate_ (\em -> ( group_ (_erasedMarkHitman em)
                                    , max_ (_erasedMarkCreatedAt em)
                                    ))
                            allErasedMarks

        getMarkID (MarkID id) = id

Opaleye

Opaleye is a SQL DSL specifically for Postgres. Out of the box, it was the solution that "just worked" without major tweaking or having to completely ignore core parts of the library.

Writing queries works. Composing queries works. The types are (relatively) simple and easy to work with. ~~Joins are a little bit painful, because left joins break type inference and require type annotations, but inner joins work fine.~~ As of Opaleye version 0.6.7006.1, left joins are now inferrable.

One major distinction between Opaleye and other DB libraries is the way you define table schemas. All the schema definitions are done at the term level, in a way that's mostly independent of your domain types themselves. Because of the way it's set up (using product-profunctors), you can easily abstract out common columns. For instance, I used it to abstract out the definitions of created_at and updated_at timestamps. Further, Opaleye makes a distinction between write-time and read-time data, so it's easy to, say, make it so certain columns can't get written on inserts/updates (like the aforementioned timestamps).

While previous versions of Opaleye had issues with correctly typing aggregate columns like sums, as of Opaleye version 0.6.7006.1, the library has functions to handle aggregates properly. Additionally, it's now possible to use the library entirely through a monadic interface rather than arrows, avoiding some cognitive overhead that was previously necessary. One cognitive hurdle that you will have to get over is that the library uses product-profunctors everywhere. Thankfully, you should be able to get by without a deep knowledge of it; just add p2s and p3s wherever the documentation tells you to.

Overall Opaleye just works, and it's my personal recommendation. It has a bit of an abstraction learning curve and takes a little getting used to, but after you do, the facilities it gives you to factor out common functionality and recombine parts of your queries makes it my personal favorite among the DB libraries we've looked at.

latestHits :: Select (HitmanF, MarkF)
latestHits = do
  (hID, created, mID) <- byDate
  (maxHID, maxCreated) <- maxDates
  h <- selectTable hitmenNoMeta
  m <- selectTable marksNoMeta

  viaLateral restrict $ hID .== maxHID .&& created .== maxCreated
  viaLateral restrict $ hID .== hitmanID h
  viaLateral restrict $ mID .== markID m

  pure (h, m)

  where byDate = aggregate (p3 (groupBy, groupBy, min)) $ do
          ( ErasedMark { erasedMarkHitmanID = hitmanID
                       , erasedMarkMarkID = markID
                       }
            , (createdAt, _)
            ) <- selectTable erasedMarkTable
          pure (hitmanID, createdAt, markID)
        maxDates = aggregate (p2 (groupBy, max)) $ do
          ( ErasedMark { erasedMarkHitmanID = hitmanID }
            , (createdAt, _)
            ) <- selectTable erasedMarkTable
          pure (hitmanID, createdAt)

Squeal

Squeal is a bit of an odd child; less of a convenient DSL, it's meant to be a deep embedding of SQL in Haskell itself. So it's much closer to writing actual SQL, and doesn't try to abstract things away, down to having your SQL keywords appearing in the right place in your query.

This rigidity makes Squeal extremely painful to use in practice, as it uses types to enforce, for instance, that you put your WHERE clause after all the tables you're joining from are brought into scope. Since Squeal uses a pure combinatorial approach, rather than the arrows/monads of other libraries, using it becomes a juggling exercise of nested parentheses and constantly hopping between different levels of nesting. Frankly, it feels like a mess.

Squeal also uses OverloadedLabels for selecting columns and tables, and goes even further than everything else on this list, in that it not only asks you to type your columns, but also keeps track of which names you used for each column. Which, admirable, but also extremely annoying when composing a subquery into another and you have to explicitly reselect the subquery results using the exact name.

This insistence on naming columns causes a lot of other problems as well. The way you return instances of your domain types is to explicitly name the columns of the query the same as the properties in your datatype, using SQL AS. There's no easy way to just define the correspondence once and forget about it, which means that even if you're just selecting all the entities out of a single table, you have to explicitly rename all the columns. Fun! Want to return an ad-hoc tuple of data from a quick one-off query? Sorry, can't do that; tuples don't have named fields, so how can you label your columns correctly? In fact, any time you want to return data in a new form, you'll need to define a completely new datatype for that and rederive Squeal's special typeclasses.

While working with Squeal, I felt like I never stopped running into stumbling blocks. Squeal's query type has type parameters for both the query's inputs and outputs, but there doesn't seem to be a way to then pass an input parameter to a subquery? So you end up just having to duplicate query code. Sometimes using subqueries just... caused a runtime error for no discernable reason, despite type checking. I hope you never misspell a column either, or Squeal will drown you in a sea of inscrutable type errors.

Overall, Squeal feels like it succeeds at its goal of embedding SQL in Haskell, but fails to actually be able to describe common SQL patterns without breaking down. Not recommended.

latestHits :: Query_ Schema () HitInfo
latestHits = select_
  (#minid ! #hitman_id `as` #hiHitmanID :* #minid ! #mark_id `as` #hiMarkID)
  ( from ((subquery ((select_
           ( #em ! #hitman_id
          :* #em ! #created_at
          :* (fromNull (literal @Int32 (-1)) (min_ (All (#em ! #mark_id)))) `as` #mark_id
           )
           ( from (table (#erased_marks `as` #em))
             & groupBy (#em ! #hitman_id :* #em ! #created_at ))) `as` #minid ))
    & innerJoin (subquery ((select_
                  ( #em ! #hitman_id
                 :* (max_ (All (#em ! #created_at))) `as` #created_at
                  )
                  ( from (table (#erased_marks `as` #em))
                    & groupBy (#em ! #hitman_id) )) `as` #latest))
        (#minid ! #created_at .== #latest ! #created_at)) )

Persistent + Esqueleto

Persistent is a thin persistence layer for doing simple CRUD operations. Esqueleto is a SQL DSL on top of Persistent that adds the ability to do joins and more complicated queries.

There isn't much to talk about with Persistent, since it's just providing the persistence layer, so let's talk about Esqueleto instead. Esqueleto is meant to be a very lightweight query language, while still providing juuuuust enough power to give you the things you'd expect from a well-written Haskell library, like some type safety and a bit of compositionality.

For me, though, I found that this focus on a simple query interface just made the library require about as much mental energy as writing plain SQL, if not more. It's something of a halfway approach, where there's some checking that you're writing queries that make sense, but the library still forces much of the responsibility of writing syntactically-correct and well-formed queries on you.

For instance: your queries will happily compile without warnings but emit syntactically-incorrect SQL at runtime if you forget an ON clause on your joins. Or happily crash at runtime when you try to select a mix of aggregate and non-aggregate columns. The query DSL itself is practically a 1:1 translation with raw SQL, including a lot of possible ways to misuse it. So I hope you're already familiar with SQL and its quarks, because Esqueleto makes very few attempts to hide SQL's warts from you.

On top of this, Esqueleto is by far the least featureful in terms of supporting typical RDBMS functionality. A conspicuous absence is the lack of ability to join on subqueries, and you'll basically have to write all your queries using only a single SELECT and clever join conditions on existing tables. I was able to implement all the hitman queries correctly, but it felt like I had to heavily contort them to get them to work.

Overall, even though I was able to implement the project in Esqueleto, I felt like I wasn't gaining much over just writing raw SQL queries. In many ways, it felt far too restrictive, owing to the library's somewhat barebones set of features. You could either have the cognitive simplicity and flexibility of writing raw SQL, or the strict type safety and heavy composability of something like Opaleye. Esqueleto feels like it tries to be a little of both and does neither well.

latestHits :: MonadIO m => SqlPersistT m [(Entity Hitman, Maybe (Entity Mark))]
latestHits = select $
  from $ \(hitman `LeftOuterJoin` emark1
                  `LeftOuterJoin` emark2
                  `LeftOuterJoin` mark) -> do
    on (emark1 ?. ErasedMarkHitmanId ==. emark2 ?. ErasedMarkHitmanId &&.
        emark1 ?. ErasedMarkCreatedAt <. emark2 ?. ErasedMarkCreatedAt &&.
        emark1 ?. ErasedMarkMarkId >. emark2 ?. ErasedMarkMarkId)
    on (emark1 ?. ErasedMarkHitmanId ==. just (hitman ^. HitmanId))
    on (emark1 ?. ErasedMarkMarkId ==. mark ?. MarkId)
    where_ (isNothing $ emark2 ?. ErasedMarkCreatedAt)
    where_ (isNothing $ emark2 ?. ErasedMarkMarkId)
    pure (hitman, mark)

Hasql: not like the others

While implementing this project, I realized that Hasql isn't actually meant to be a full-fledged query/DB library, despite the fact that I'd heard about it in context of Haskell ORM choices. Instead, Hasql is an alternative to postgresql-simple; it's a way to write and run raw SQL queries, optimized for speed. It's not trying to compete with Beam and Opaleye; instead, it's trying to be a backend for libraries like those to emit code to.

So as it turns out, Hasql wasn't in the running in the first place for a type-safe DB library. I've still provided a reference implementation for our hitmen backend, in case you're curious. I've omitted including the inserts and updates, because, well, I assume you know how to write SQL queries.

latestHits :: Statement () (Vector (Hitman, Maybe Mark))
latestHits = Statement sql encoder decoder True
  where sql = [r|
                WITH max_dates AS
                  (SELECT em.hitman_id, MAX(em.created_at) AS max_date
                     FROM erased_marks AS em
                    GROUP BY em.hitman_id),
                     min_marks_by_date AS
                  (SELECT em.hitman_id, em.created_at, MIN(em.mark_id) AS min_mark_id
                     FROM erased_marks AS em
                    GROUP BY em.hitman_id, em.created_at)
                SELECT h.*, m.*
                  FROM hitmen AS h
                  LEFT OUTER JOIN max_dates AS maxd
                    ON h.id = maxd.hitman_id
                  LEFT OUTER JOIN min_marks_by_date AS minmks
                    ON maxd.hitman_id = minmks.hitman_id
                   AND maxd.max_date = minmks.created_at
                  LEFT OUTER JOIN marks AS m
                    ON m.id = minmks.min_mark_id;
              |]
        encoder = Encoders.noParams
        decoder = Decoders.rowVector $ (,)
          <$> hitmanDecoder
          <*> maybeMarkDecoder

While it's not on the same level of expressive power as any of the other options on this list, Hasql does have a few extra tricks up its sleeve. The hasql-th provides some extra type safety around your queries using Template Haskell, and I highly recommend it if you choose to handwrite your queries using Hasql.

Groundhog: Disqualified

Unfortunately, I wasn't able to implement the full example project in Groundhog. Groundhog fell down very fast, as it doesn't support joins.

Next, please.

Selda

Selda is another DSL that feels very much like a stripped-down version of Beam or Opaleye. Like Beam, it also aims to provide a backend-agnostic query interface. Unlike Beam, it doesn't go to quite the same lengths to achieve that, and ends up being fairly pleasant to use.

One nice thing is that relative to the other options on this list, Selda doesn't require too much advanced type machinery to work. It mostly "just works."

However, there are still some red flags. One of the biggest is that Selda doesn't seem to provide DB conversions for fairly common types. For instance, it's missing typeclass instances for lazy Text, as well as Int32 and Int64; it doesn't even provide an instance for Integer! So have fun trying to store a integer larger than 32 bits. Insertion/updates also feel like an afterthought, as Selda won't even allow you to use SQL DEFAULT for columns other than autoincrementing primary keys.

Overall, I'd say that Selda is usable. If you're not willing to bite the bullet for Opaleye's learning curve, Selda is a passable alternative.

latestHits :: Query s (Row s Hitman :*: Row s (Maybe Mark))
latestHits = do
  hitman <- select hitmen
  (_ :*: eachDate :*: minID) <- leftJoin (\(hid :*: _) -> hid .== hitman ! #hitmanID)
                                   minByDate
  (_ :*: maxDate) <- leftJoin (\(hid :*: _) -> hid .== hitman ! #hitmanID)
                       maxDate
  mark <- leftJoin (\m -> just (m ! #markID) .== minID) (select marks)
  restrict (maxDate .== eachDate .|| isNull maxDate)
  pure (hitman :*: mark)

  where minByDate :: Query s (Col s (ID Hitman) :*: Col s UTCTime :*: Col s (Maybe (ID Mark)))
        minByDate = aggregate $ do
          hitman <- select hitmen
          erasedMark <- innerJoin (\em -> em ! #erasedMarkHitmanID .== hitman ! #hitmanID)
                          (select erasedMarks)
          hid <- groupBy (hitman ! #hitmanID)
          date <- groupBy (erasedMark ! #erasedMarkCreatedAt)
          pure (hid :*: date :*: min_ (erasedMark ! #erasedMarkMarkID))

        maxDate :: Query s (Col s (ID Hitman) :*: Col s (Maybe UTCTime))
        maxDate = aggregate $ do
          hitman <- select hitmen
          erasedMark <- innerJoin (\em -> em ! #erasedMarkHitmanID .== hitman ! #hitmanID)
                          (select erasedMarks)
          hid <- groupBy (hitman ! #hitmanID)
          pure (hid :*: max_ (erasedMark ! #erasedMarkCreatedAt))

The verdict: Use Opaleye if you're comfortable with profunctors, use Selda (or maybe Beam) otherwise.

None of the DB library options in the Haskell ecosystem are really at the level of a full ORM; you're not going to find something to equal ActiveRecord or similar. Still, if you need your queries to be rock-solid, you're in good hands.

Here's a link to the repository containing implementations for each library. Each implementation also has a small executable driver to demonstrate how to run queries.

If you've gotten this far, why not give a library of your choice a shot? Try getting something from that repository up and running. Write a new query to see how the library feels to use. For instance, you could try getting all marks that required a long chase. Or, take one of the implementations as a template and start on a project of your own. If you're thinking of trying Opaleye, Saurabh Nanda has written some excellent docs to get you started.

Found this useful? Still have questions? Talk to me!