Graziela Medeiros

June 30, 2009

Common Information Retrieval Myths

Filed under: Recuperação da Informação — grazielamedeiros @ 13:58

Interessante esse texto sobre  ‘mitos’ da recuperação da informação, tendo em vista que geralmente se faz confusão com o conceito desse termo. No texto foi destacada a importância de áreas como a Biblioteconomia e a Ciência da Informação (mito 2).  Vannevar Bush, importante autor nas referidas áreas também é citado (mito 5).

Os títulos dos ‘mitos’ se refere a tudo o que a recuperação da informação NÃO É. Apesar de estar em Inglês, o texto é objetivo e simples de entender.

1.Information retrieval is the same as Information Extraction

“Information Extraction is not Information Retrieval: Information Extraction differs from traditional techniques in that it does not recover from a collection a subset of documents which are hopefully relevant to a query, based on key-word searching (perhaps augmented by a thesaurus).

Instead, the goal is to extract from the documents (which may be in a variety of languages) salient facts about prespecified types of events, entities or relationships. These facts are then usually entered automatically into a database, which may then be used to analyse the data for trends, to give a natural language summary, or simply to serve for on-line access.” (GATE)

More on that here.

2. Information retrieval is a compter science discipline

No, not quite.
IR is interdisciplinary because of the many different problems which arise within it.
First off our data is usually in text format so we need the area of linguistics and cognitive psychology.

Then the data is stored somehow and is either structured or unstructured so we need information architecture, information science, library science to help with that.

The text and the query are analysed and rendered into a numeric format that a machine can inderstand so statistics come into play also.

We borrow ideas from Physics too and of course many mathematical concepts come into play.

Computer science as a whole is a mozaic of different disciplines.

3. Information retrieval is just for search engines

Search engines are a common example of an information retireval system, but online library catalogs (OPAC), commercial databases like Web of sciences (and many search engines), and even the entire www are all information retrieval systems.

4. Information retrieval’s biggest challenge is ranking documents

“Search is an unsolved problem. We have a good 90 to 95% of the solution, but there is a lot to go in the remaining 10%.” (Marissa Mayer)

She is quite right we had a deluge of work to do in this area still. We have invented the wheel and we have hooked 4 of them onto a box. We don’t have a Ferrari Enzo yet.

Some of the biggest challenges yet involve relevance and feedback, information extraction, multimedia retrieval, effective retrieval, rooting and filtering, interfaces and browsing, “Magic”, indexing and retrieval, distributed IR and integrated solutions.

The “Magic” issue (coined by Bruce Croft) concerns the vocabulary mismatch issues we have.

There is a list of Grand challenges for IR which is published and presented every year. This is the latest document. (PDF)

5. Google pioneered information retrieval

Google did arguably make the most commecially successful information retrieval system, but they were not the first to launch into IR.

In fact no search engine was.

In 1945 Vannevar Bush’s As We May Think appeared in Atlantic Monthly and in this article he described an information retrieval system. In the 1960’s Gerard Salton created the SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System at Cornell University. One of the 1st papers was Melvin Earl (Bill) Maron and J. L. Kuhns’ “On relevance, probabilistic indexing, and information retrieval” in Journal of the ACM in 1960. In 1963 the Weinberg report “Science, Government and Information” gave a full explanation of the issues concerning the “crisis of scientific information.” – basically we couldn’t manage this huge corpus that we had gathered throughout the centuries.

Karen Spärck Jones researched relentlessly since the 1960’s computational linguistics and their application to IR at Cambridge. J. W. Sammon pioneered the vector model in 1968, and in the 1970’s NLM’s AIM-TWX, MEDLINE are the first ever online IR systems. Round about the same time Theodor Nelson starts introducing hypertext.

Fonte: Escrito por Marie-Claire Jenkins e publicado no site Search Engine People.


Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Blog at

%d bloggers like this: