发信人: wyr (遗忘小资), 信区: Database
标 题: Re: question on large tables (>=800 million records, 10 G b
发信站: BBS 未名空间站 (Sun Jan 21 15:50:21 2007), 转信
Teradata is so far the best platform for Terabyte data warehouse. The
parallelism implementation from bottom up in Teradata ensures a quite linear
This product is quite expensive in terms of license and service. I doubt
academic institutions will be able to easily afford its product and service,
however, academic license might be fairly cheap and you can always have
students willing to learn maintain the software/hardware.
What I wanted to clarify is how the parallelism is designed and implemented.
The customized hash algorithm and parition is the key here. Once you have a
good hash algorithm and a good partition schema, you can split your large
data file/index file into smaller manageable pieces. Splitting the physical
storage of data and index file to a clusterred environment could help a lot.
A single node PC will probably never achieve the performance you want no
matter how you partition your data. The 5G index file you mentioned in your
first post is simply a impossible size for a PC to handle. A lot of
processing time is wasted by swap-in-out memory contents. A search on this
5G file is not n(log n) anymore, mysql has to swap in and out memory, on the
contrary, with your own implmentation of hash / parititon, you can make it
manageable and cut all these waste which could easily takes away more than
65%-75% total execution time. Also, with data split into independent IO
channels, you can have a good estimation on the IO cost for each search. Let
's say if you split your file into 20 pices, each of them contains a B tree
with the height of 5 . With the size of each file does not exceed 512M,
search on each individual tree (resides in memory) would be very predictable
If you know the characteristics of your query, you can further fine tune yor
index file. Pre-joined index file could help also(if the individual key has
low cardinality and you are always query on combination of certain keys,
Oracle's bitmap index + index join step is a good example). To push the
performance for data inquiry query(based on keys)to sub-second on most
modern RDBMS is not a huge challenge anymore with the help of paritioning/
parallelism/sufficient IO subsystem. What is more challenging is aggregation
/ sorting computatoin(OLAP)on these large data sets especially when joins
【 在 babycry (babycry) 的大作中提到: 】
: Thanks for the help offered and the clarification. I appreciate it !
: Yes, splitting large data sets into smaller files based on keys greatly
: In this way, we have keys implicitly implemented (in terms of application
: specific semantics)
: without using any storage.
: It also helps parallelism on a SMP machine or a cluster.
: This is exactly what Assailant pointed out (see previous posts).
: Assailant also suggested splitting the data set in different ways by
: different keys.
※ 来源:·BBS 未名空间站 mitbbs.com·[FROM: 70.244.]