INDEX
LSIへのセマンティックWebの適用{swforlsi}
1) Patterns in Unstructured Data
Latent Semantic Indexing
http://javelina.cet.middlebury.edu/lsa/out/cover_page.htm
INTRODUCTION - THE NEED FOR SMARTER SEARCH ENGINES
As of early 2002, there were just over two billion web pages listed in the Google search engine index, widely taken to be the most comprehensive. No one knows how many more web pages there are on the Internet, or the total number of documents available over the public network, but there is no question that the number is enormous and growing quickly. Every one of those web pages has come into existence within the past ten years. There are web sites covering every conceivable topic at every level of detail and expertise, and information ranging from numerical tables to personal diaries to public discussions. Never before have so many people had access to so much diverse information.
Even as the early publicity surrounding the Internet has died down, the network itself has continued to expand at a fantastic rate, to the point where the quantity of information available over public networks is starting to exceed our ability to search it. Search engines have been in existence for many decades, but until recently they have been specialized tools for use by experts, designed to search modest, static, well-indexed, well-defined data collections. Today's search engines have to cope with rapidly changing, heterogenous data collections that are orders of magnitude larger than ever before. They also have to remain simple enough for average and novice users to use. While computer hardware has kept up with these demands - we can still search the web in the blink of an eye - our search algorithms have not. As any Web user knows, getting reliable, relevant results for an online search is often difficult.
For all their problems, online search engines have come a long way. Sites like Google are pioneering the use of sophisticated techniques to help distinguish content from drivel, and the arms race between search engines and the marketers who want to manipulate them has spurred innovation. But the challenge of finding relevant content online remains. Because of the sheer number of documents available, we can find interesting and relevant results for any search query at all. The problem is that those results are likely to be hidden in a mass of semi-relevant and irrelevant information, with no easy way to distinguish the good from the bad.