Selena Deckelmann
“Which Databases Solve My Problem?”
Talking about the right open-source tool for the job.
2005:
- BerkleyDB - 1980s, one of the originals.
- MySQL
- PostgreSQL - started in the 80s, open-sourced in the 90s.
- SQLite.
2010:
Now there are about 50 open source databases with active communities, code updates, one was announced during linuxconf.
So many choices - which to use.
MySQL vs PostgreSQL
It’s the perennial question. The answer is, well, what are you trying to solve?
- Store and manipulate GIS data: PostgreSQL
- My blog? Probably default to MySQL.
- I have ONE BEELLION users to store and analyse data from: you may be in the custom world.
PostgreSQL: A replacement for Oracle is one problem they want to solve.
Things to think about.
Performance: the top of a lot of lists. Use your use case and test with real data. Benchmarks aren’t that great.
Interoperability: How do you get data in, out, shake it all about.
Sustainable: You don’t want your DB project to die and stick you with a dead product.
There is no one true solution. Selena thinks that mixed database solutions will be a part of the future.
The Survey
- Wasn’t perfect.
- 12 responses, 25 people.
- Will probably use a pro to help survey in the future.
Means of comparison:
- DB model: what do you do, how do you structure your data
- Features
- Development Style
Relational
OLTP vs embedded vs column (data warehouses) vs MPP (Massively parallel)
- OLTP: CUBRID, MySQL, PostgreSQL.
- Embedded: H2, HSQLDB, SQLite and many Java DBs.
- Column-store: Monet, LucidDB, C-store/Vertica
Non-relational
Flatfile, key-value (map-reduce, fault tolerance, caching), multi-value, graph/tri-graph (relationship oriented), document-oriented (semi-structured data).
- key-value: Berkley, Cassandra, Hbase, memcached, riak, redis, TokyoCabinet
- Graph: neo4j, 4store, Parliament
- Document: CouchDb, BerkleyDB-XML, MongoDB
There’s gotta be a collapse in the key-value space. There’s a bajillion DBs all doing the same thing.
Infrastructure Features
- Partitioning/Sharding: Cassandra/Hbase/Voldemort/Riak/MySSQL
- Replication: BerkleyDB/CouchDB/Cassandra/MySQL/PostgreSQL/Riak/Scalaris/Voldemort/HyperTable/Hbase/Memcached/MNesia
There’s many ways to do replication.
Memory vs Disk
- In-memory: memcache, scalris, redis
- Configurable: casandra, hbase, hypertable, MNesia
- Disk: everyone else
Relational DBs don’t as a rule support even hybrid models. Think it will be a requirement for the future.
High Availability
Nodefailover: Cassandra, HBase, Riak
Currently something people want, but SQL DBs require lots of admin work to achieve currently. Needs to be easier.
Sustainability
- Core + Modules: Drizzle, LucidDB, PostgreSQL
- Monolithic: GT.M, Ingres, CUBRID
- Infrastructure: Memcached, Redis, Scalaris
Hard to model; Core + Modules is about having a sold core and lots in contrib, with open interaction.
Monoliths want to own it all, extrude a feature set periodically, often resistent to outside contribution.
Infrastructure: Intended to be wedged in as middleware. Clear APIs, protocols are critical.
Redis is like memcached, but can distribute across systems, but is incomptabile with memcached’s protocols.
Community Model
-
Benevolent Dictator: Redis, XtraDB, MckoiDDB
-
Feature Driven: Derby, InfiniDB
-
Small Group: CouchDB, MonetDB
-
Mixed: LucidDB, Drizzle, H2, Pg
-
Dictator typically are smaller devleopment groups.
-
Feature driven are driven my external (commercial, user) requests; typically don’t have a strong vision, nothing to inspire developers.
-
Small groups tend to be long-running, sometimes ex-commercial, with a longstanding core of developers.
Selena feels it’s a progression, the models change as they mature.
You can see the survey results online. Would like to publish to wikipedia, but their publishing model is high-impedence.
More in touch, more help from experts in the survery field.
Interesting things
- Some projects are implementing one another’s protocols; e.g. H2 like Pg, and wants to be compatible, but wants to be in Java, and is a drop-in replacement from a protocol perspective.
- Sphinx chooses to implement the MySQL protocol.
Verification needs to happen in this space; memcapable is a tool to do this for memcached, but they don’t exist for other protocols.
Proper regression suites for complex cases like replication are quite critical; there’s a scarcity of tools. Pg are doing more replication and finding testing is a challenge.
There needs to be more contact between projects; Selena is concerned that the Java world seems cut off from the Db world, for example.
Databases talking to each other: e.g. Thrift with ThruDB is the future of talking to DBs efficicently and managing cross-Db conversations better.
Questions
- SQL dialects: Not really an interesting question, convergance is happening.
- What’s the die-off rate? Relational SQL DBs are mostly forks that are living forever. Ones that have died is harder-to-find information, but the rate seems low, maybe 3 or 4.
- Are there signs of consolidation? It’s still a long way off. Releases of papers by Amazon have spawned a lot of activity for new DBs where companies have rolled-their-own and open sourced their own solutions. We’ll see if problem sets are similar enough to encourage merges.
- What’s the biggest/most amusing/fun project with an open source DB you’ve done? Helped implement a cowscii module in PostgreSQL, or maybe Nigera. A state government wanted to go open source, and they wanted to invite people to help. Went as a guest of the state, armed guards, driving trips, a week with some local developers who’d mostly worked with .NET tools, discuss how to use Pg, interact with the community. Will be used to drive community welfare projects.
- Were lots dual-licensed? Yep. Mostly to drive funding. In light of current events, maybe that’s not suc a good idea.
- Talked to many of the MySQL forks; MySQL itself isn’t working as a community project any more, and that’s why we have all those forks. As an experiment in open-source development they seem ot have failed, but the forks live on.