Hello everyone,
I am struggling with the design of a database for a scientific
application.
I've been running around in circles trying to figure this out for weeks
and
I really need help with it. I am working on a project with comparitively
little resources and I need to start making progress soon, so thanks in
advance if you manage to read even half of this (I know its long!).
My frustration stems from a combination of the problem being hard and me
being new to SQL and not having a lot of experience with very large
databases. It has very different usage characteristics compared to normal
databases. For example, I consider the following im****tant:
-"Doing it right the first time", because the database will be around for
a
long time
-Design that relies as much as possible on SQL standards
-I have to be able to afford it (i.e., it has to be free)...this and the
previous item led me to PostgreSQL
-Referential integrity (foreign keys)
-The ability to store at the very least several hundred million, but
possibly up to several hundred billion (!) rows in a table
That last requirement may seem pretty ridiculous, but consider that
-Despite the huge number of rows, I'll never SELECT more than a few
million
at time
-There are usually between 3-6 fixed format columns in most tables...It's
scientific data, mostly numbers. So it's not as big as it sounds, and it
probably won't seem that large in 5-10 years, which might be how long it
takes to get that big..for the first few years it will be 100 gigabytes at
the most
-I don't care about performance in the same way most people using
databases
do. I'll never have a huge number of users accessing it, and I won't be
concerned with metrics like X queries per second (there will probably be
at
the most a few queries per minute...and probably not even that many, with
the occasional update, insert, or delete). My definition of performance
is
the "the query takes 2 seconds" instead of 2 or 20 minutes. If it takes 2
milliseconds great, but I don't really care.
-Here's the interesting part: not only is most of the data naturally
partitionable, it's "so natural" that the full table is never actually
needed, just the partitions. That is, there is no query that will ever be
performed on a union of 2 or more partitions. Ever. The only reason to
even
use a database is referential integrity. I have to force it to be one big
table, because the foreign key reference needs to know about a single
logical name. But really, all it amounts to is a big, well-organized
filesystem with lots of meta-data to help me find things.
It seems like by that last point, I can break up each 2-3 million row
partition into a table and not even bother with partitioning. Except if I
do that, I'll have hundreds of tables with automatically generated names
and
hundreds of odd foreign keys everywhere. In addition to being "ugly",
this
seems like it would be hard to administer and maintain.
In PostgreSQL it makes no difference anyway, since all partitions are
explicitly visible in the schema either way, and you have to directly
manage
them either way. In MySQL or Oracle (AFAIK), the physical partitions are
hidden from you, managed automatically, and appear as one nice logical
partition. Unfortunately, MySQL partitions are limited to 1024 tables and
they totally remove the foreign key system (so I guess they removed the
thing that seems like it would actually make the implementation hard).
Oracle probably does not have these limitations, but costs arbitrarily
large
amounts of money.
Anyway, should I even be using an RDBMS for this? Surely other people
have
had to do things like before. Is this what people mean by "data
warehouse"
(the definitions I've heard seem so vauge I honestly don't know what it
means). How would you guys implement something like this?
Thanks for reading!
Ken


|