发信人: babycry (babycry), 信区: Database
标 题: Re: question on large tables (>=800 million records, 10 G b
发信站: BBS 未名空间站 (Sun Jan 21 00:59:46 2007)
Thanks for the help offered and the clarification. I appreciate it !
Yes, splitting large data sets into smaller files based on keys greatly
In this way, we have keys implicitly implemented (in terms of application
without using any storage.
It also helps parallelism on a SMP machine or a cluster.
This is exactly what Assailant pointed out (see previous posts).
Assailant also suggested splitting the data set in different ways by
Both suggestions are actually what we are currently doing to make the data
The key to fast queries is, how the data is organized (i.e., similar data
should put together).
A gigabytes / terabytes data set does not necesserily indicates
that queries on them are going to be slow.
However, there is a tension here. In order to well-organized the data,
we need to know how the data is going to be queried.
On the other hand, in many data mining tasks, especially in academic data
we do not know in advance how we are going to query the data.
As a result, we do not know in advance how to exactly organize the data.
I do not know much about Teradata. Here is my understanding:
If I were to have relationship with Teradata, what I need is its service,
rather than its products.
In other words, assuming I have the phone call records of the whole British
(in the order of 10 Giga records per month).
What I want is the fact that I can do my usual queries in, say, one second
A specific data warehouse software does not interest me,
and I suspect about the performance of a general-purpose data warehouse
【 在 wyr (遗忘小资) 的大作中提到: 】
: customized hash algorithm to help you partition your data based on the
: features of your query. I do not know how mysql implement their algorithm
: Here is my 2 cents based on my understanding of Teradata ..
: If you have a primary index (unique or not unique), you starting from
: to distribute your data evenly into several segmentatoins using these
: columns(I am assuming your query condition is primarily based on these
: columns). let say 10 .
: Then based on your PK, you build a hash algorithm to make the permutatoin
: all your PK columns map to 10 bucket .
: For each bucket, try to build your btree within the bucket.
※ 来源:·BBS 未名空间站 http://mitbbs.com·[FROM: 18.251.]