当前在线人数11782
首页 - 分类讨论区 - 电脑网络 - 数据库版 -阅读文章
未名交友
[更多]
[更多]
文章阅读:Re: question on large tables (>=800 million records, 10 G
[同主题阅读] [版面: 数据库] [作者:babycry] , 2007年01月21日00:59:46
babycry
进入未名形象秀
我的博客
[上篇] [下篇] [同主题上篇] [同主题下篇]

发信人: babycry (babycry), 信区: Database
标  题: Re: question on large tables (>=800 million records, 10 G b
发信站: BBS 未名空间站 (Sun Jan 21 00:59:46 2007)


Thanks for the help offered and the clarification. I appreciate it !

Yes, splitting large data sets into smaller files based on keys greatly
helps.
In this way, we have keys implicitly implemented (in terms of application
specific semantics)
without using any storage.
It also helps parallelism on a SMP machine or a cluster.

This is exactly what Assailant pointed out (see previous posts).
Assailant also suggested splitting the data set in different ways by
different keys.
Both suggestions are actually what we are currently doing to make the data
mining possible.

The key to fast queries is, how the data is organized (i.e., similar data
should put together).
A gigabytes / terabytes data set does not necesserily indicates
that queries on them are going to be slow.

However, there is a tension here. In order to well-organized the data,
we need to know how the data is going to be queried.
On the other hand, in many data mining tasks, especially in academic data
mining tasks,
we do not know in advance how we are going to query the data.
As a result, we do not know in advance how to exactly organize the data.


I do not know much about Teradata. Here is my understanding:
If I were to have relationship with Teradata, what I need is its service,
rather than its products.
In other words, assuming I have the phone call records of the whole British
Telecom
(in the order of 10 Giga records per month).
What I want is the fact that I can do my usual queries in, say, one second
per query.
A specific data warehouse software does not interest me,
and I suspect about the performance of a general-purpose data warehouse
software.



【 在 wyr (遗忘小资) 的大作中提到: 】
: customized hash algorithm to help you partition your data based on the
: features  of your query. I do not know how mysql implement their algorithm
.
: Here is my 2 cents based on my understanding of Teradata ..
: If you have a primary index (unique or not unique), you starting from
trying
:  to distribute your data evenly into several segmentatoins using these
: columns(I am assuming your query condition is primarily based on these
: columns). let say 10 .
: Then based on your PK, you build a hash algorithm to make the permutatoin
of
:  all your PK columns map to 10 bucket .
: For each bucket, try to build your btree  within the bucket.
: ...................



--

※ 来源:·BBS 未名空间站 http://mitbbs.com·[FROM: 18.251.]

[上篇] [下篇] [同主题上篇] [同主题下篇]
[转寄] [转贴] [回信给作者] [修改文章] [删除文章] [同主题阅读] [从此处展开] [返回版面] [快速返回] [收藏] [举报]
 
回复文章
标题:
内 容:

未名交友
将您的链接放在这儿

友情链接


 

Site Map - Contact Us - Terms and Conditions - Privacy Policy

版权所有,未名空间(mitbbs.com),since 1996