发信人: babycry (babycry), 信区: Database
标 题: Re: question on large tables (>=800 million records, 10 G b
发信站: BBS 未名空间站 (Thu Jan 18 21:47:59 2007)
Thanks! I like this suggestion.
This is actually the approach we are currently using.
It is pretty ad hoc, however, it saves a lot of software-engineering time.
We dislike software-engineering since we are not accredited for doing it.
The current query time is normally 2~5 minutes.
This query time is not good for webapps,
but is acceptable for data mining.
Since we do not update/insert,
the data integrity issue of having several copies of the same data
is not a problem.
If a database storage engine supports a data type like record/row number
(which are numbered continuely from 1 to maximum number of rows in the
and if each record in a table have fixed length,
then indexing on this data type would cost zero storage,
and the accessing time to any record with specified row number is constant.
This idea is valid for read-only tables.
A search with google reveals that storage engines like
NitroEDB and BrightHouse should be promising,
since they are asserted to have unique indexing techniques,
and can manage multi-billion records.
They are scheduled to be available some times in 2007.
We would hope those techniques to be free-of-charge for academic uses,
and to be helpful for future data sets.
【 在 Assailant (反恐精英 勇救人质 拆弹专家) 的大作中提到: 】
: how is the data collected, or updated/inserted? do you get a data feed at
: certain time of the day, or is this going to be a static set of data you
: working with?
: have you considered breaking the data into smaller groups of files, and
: only the needed data into a temp table when requested.
: each of those files would contain all the data associated with that cabID.
: and when a request comes in, read the file(s) and write them to a table
※ 来源:·BBS 未名空间站 http://mitbbs.com·[FROM: 18.85.]