Survey of Open Source Database

Selena Deckelmann

“Which Databases Solve My Problem?”

Talking about the right open-source tool for the job.

2005:

  • BerkleyDB - 1980s, one of the originals.
  • MySQL
  • PostgreSQL - started in the 80s, open-sourced in the 90s.
  • SQLite.

2010:

Now there are about 50 open source databases with active communities, code updates, one was announced during linuxconf.

So many choices - which to use.

MySQL vs PostgreSQL

It’s the perennial question. The answer is, well, what are you trying to solve?

  • Store and manipulate GIS data: PostgreSQL
  • My blog? Probably default to MySQL.
  • I have ONE BEELLION users to store and analyse data from: you may be in the custom world.

PostgreSQL: A replacement for Oracle is one problem they want to solve.

Things to think about.

Performance: the top of a lot of lists. Use your use case and test with real data. Benchmarks aren’t that great.

Interoperability: How do you get data in, out, shake it all about.

Sustainable: You don’t want your DB project to die and stick you with a dead product.

There is no one true solution. Selena thinks that mixed database solutions will be a part of the future.

The Survey

  • Wasn’t perfect.
  • 12 responses, 25 people.
  • Will probably use a pro to help survey in the future.

Means of comparison:

  • DB model: what do you do, how do you structure your data
  • Features
  • Development Style

Relational

OLTP vs embedded vs column (data warehouses) vs MPP (Massively parallel)

  • OLTP: CUBRID, MySQL, PostgreSQL.
  • Embedded: H2, HSQLDB, SQLite and many Java DBs.
  • Column-store: Monet, LucidDB, C-store/Vertica

Non-relational

Flatfile, key-value (map-reduce, fault tolerance, caching), multi-value, graph/tri-graph (relationship oriented), document-oriented (semi-structured data).

  • key-value: Berkley, Cassandra, Hbase, memcached, riak, redis, TokyoCabinet
  • Graph: neo4j, 4store, Parliament
  • Document: CouchDb, BerkleyDB-XML, MongoDB

There’s gotta be a collapse in the key-value space. There’s a bajillion DBs all doing the same thing.

Infrastructure Features

  • Partitioning/Sharding: Cassandra/Hbase/Voldemort/Riak/MySSQL
  • Replication: BerkleyDB/CouchDB/Cassandra/MySQL/PostgreSQL/Riak/Scalaris/Voldemort/HyperTable/Hbase/Memcached/MNesia

There’s many ways to do replication.

Memory vs Disk

  • In-memory: memcache, scalris, redis
  • Configurable: casandra, hbase, hypertable, MNesia
  • Disk: everyone else

Relational DBs don’t as a rule support even hybrid models. Think it will be a requirement for the future.

High Availability

Nodefailover: Cassandra, HBase, Riak

Currently something people want, but SQL DBs require lots of admin work to achieve currently. Needs to be easier.

Sustainability

  • Core + Modules: Drizzle, LucidDB, PostgreSQL
  • Monolithic: GT.M, Ingres, CUBRID
  • Infrastructure: Memcached, Redis, Scalaris

Hard to model; Core + Modules is about having a sold core and lots in contrib, with open interaction.

Monoliths want to own it all, extrude a feature set periodically, often resistent to outside contribution.

Infrastructure: Intended to be wedged in as middleware. Clear APIs, protocols are critical.

Redis is like memcached, but can distribute across systems, but is incomptabile with memcached’s protocols.

Community Model

  • Benevolent Dictator: Redis, XtraDB, MckoiDDB

  • Feature Driven: Derby, InfiniDB

  • Small Group: CouchDB, MonetDB

  • Mixed: LucidDB, Drizzle, H2, Pg

  • Dictator typically are smaller devleopment groups.

  • Feature driven are driven my external (commercial, user) requests; typically don’t have a strong vision, nothing to inspire developers.

  • Small groups tend to be long-running, sometimes ex-commercial, with a longstanding core of developers.

Selena feels it’s a progression, the models change as they mature.

You can see the survey results online. Would like to publish to wikipedia, but their publishing model is high-impedence.

More in touch, more help from experts in the survery field.

Interesting things

  • Some projects are implementing one another’s protocols; e.g. H2 like Pg, and wants to be compatible, but wants to be in Java, and is a drop-in replacement from a protocol perspective.
  • Sphinx chooses to implement the MySQL protocol.

Verification needs to happen in this space; memcapable is a tool to do this for memcached, but they don’t exist for other protocols.

Proper regression suites for complex cases like replication are quite critical; there’s a scarcity of tools. Pg are doing more replication and finding testing is a challenge.

There needs to be more contact between projects; Selena is concerned that the Java world seems cut off from the Db world, for example.

Databases talking to each other: e.g. Thrift with ThruDB is the future of talking to DBs efficicently and managing cross-Db conversations better.

Questions

  • SQL dialects: Not really an interesting question, convergance is happening.
  • What’s the die-off rate? Relational SQL DBs are mostly forks that are living forever. Ones that have died is harder-to-find information, but the rate seems low, maybe 3 or 4.
  • Are there signs of consolidation? It’s still a long way off. Releases of papers by Amazon have spawned a lot of activity for new DBs where companies have rolled-their-own and open sourced their own solutions. We’ll see if problem sets are similar enough to encourage merges.
  • What’s the biggest/most amusing/fun project with an open source DB you’ve done? Helped implement a cowscii module in PostgreSQL, or maybe Nigera. A state government wanted to go open source, and they wanted to invite people to help. Went as a guest of the state, armed guards, driving trips, a week with some local developers who’d mostly worked with .NET tools, discuss how to use Pg, interact with the community. Will be used to drive community welfare projects.
  • Were lots dual-licensed? Yep. Mostly to drive funding. In light of current events, maybe that’s not suc a good idea.
  • Talked to many of the MySQL forks; MySQL itself isn’t working as a community project any more, and that’s why we have all those forks. As an experiment in open-source development they seem ot have failed, but the forks live on.
Share