Precognition: Thinking about the query before it happens

Daniel Gruhl

doi:10.1109/WIRI.2005.29

WIRI 2005

Conference paper

01 Dec 2005

Precognition: Thinking about the query before it happens

View publication

Abstract

Web Information Retrieval and Integration is one of the most challenging of emerging information retrieval do-mains. Traditional approaches of returning all of exactly what the user asks for are not feasible; what does a user do with a billion pages in rank order? Identifying relevant documents requires substantially more "thought" go into selecting each result; but the data scale is such that doing the "thinking" at query time as a post-processing of results is not really a viable option. The solution that WebFountain[5] has pursued is that of examining the pages and really "thinking" about them before the query occurs. There are many kinds of analysis that are easier to do page by page than over a whole corpus. For example, trying to find all of the pages that contain a mention a drugs which contain aspirin is difficult (there are thousands of drugs and brand names that do). A query with this kind of term fan-out is untenable in a high performance system. Instead, consider a program that scanned a document and was able to add a tag Drug : Aspirin whenever it found one of the variants. While doing so it could also add Drug: Cox I inhibitor and Drug-type: Analgesic. This allows high level queries such as Drug: Cox II inhibitor NEAR Condition: Cardiac Disease to be processed as a simple 2 term boolean query. But there is more to finding information than simple boolean queries. In many cases it is not actually the pages that are of interest, but aggregate information on them. With the proper indexes and complex joining functionality it is possible to explore what trends and relationships occur between entities mentioned on webpages. For example, what I say in my blog about a popular music artist is not a good predictor of their popularity, but the trend of the number of all blog mentions is an excellent predictor of sales two weeks later. We have found the same true for certain classes of book sales as well. Understanding the strong relationships between people, places, universities, companies, products, etc. is another example where the preponderance of existence in a particular set of websites is sufficient evidence to propose a linkage. These higher level annotations and more complex querying capabilities enable not only better point querying, but also open the door for more interesting higher level applications. Examples include exploring trends in discussions and information diffusion[6], identifying templates[4], detecting collusion[3], finding connections[1], finding the connections between the real world and the web[7], and discovering aliases[8], just to name a few. In short, the creation of this semi-structured metadata from unstructured source data allows the system to begin to perform "business intelligence" type queries over the unstructured corpus. Traditionally, the thought of generating and storing all this metadata has been prohibitive. However, the price on low end storage has been dropping, recently falling below $.31 a gig. This opens the door for this kind of research even in small scale systems. © 2005 IEEE.

Conference paper