Managing the Google Web 1T 5-gram with Relational Database
Abstract
On Sep 19 2006, Google released
Web 1T 5-gram, an n-gram corpus generated from a source of approximately 1 trillion words. It provides a valuable reference of English usage since there is no other comparable corpus of this data size. However, it has not been widely used in language education due to the difficulty in managing the huge data size. In this paper, a practical approach of using relational database to store, index and search the corpus is described and implemented with commodity hardware. Basic search queries are also designed for performance testing. Sample performance results are recorded which show acceptable data processing and search response times. It is shown that the 5-gram corpus can be managed using relational database and commodity hardware. Further search queries can be designed and implemented to make better use of the corpus in language education.
Keywords
Journal of Education, Informatics and Cybernetics, 2009, ISSN: 1943-7978