пятница, 1 октября 2010 г.

What can and what cannot searching find?

Nowadays, searching can find millions pages of content. "Capital of France" query dumps 100 millions pages instead of one word. Realize, a student gives such an answer in a school, trying to guess millions words or proposing millions books where this answer can be found. Seemingly, search engines are starting to exhaust own capabilities and we need something more sophisticated than just a guess based on complex mathematical/statistical methods for words in flat database of millions pages. Consider the following examples:

Example 1. Queries.

- "Opera" query gives links to (a) browser, (b) articles about opera as piece of art, (c) specific opera theatres, (d) "Opera" hotel. But, in fact, we do not want to know about all implications of given word.
- "European Union" and "EU" queries give different results, and include even "Delegation of the European Union to the USA".
- "set icon to application" gives less correct results than "change icon of application", and "application change icon" gives all possible results without trying to refine a query at once.
- "England Germany" query mostly gives references about the game of FIFA World Cup 2010.

Example 2. http://www.bbc.co.uk/nature/species/Common_Goldeneye

The page uses several (arbitrary) classifications:
1. Inside URL: Nature -> Species -> Common Goldeneye
2. At the page: Wildlife Finder -> Animals -> Goldeneye
3. Animals -> Vertebrates -> Birds -> Anseriformes -> Ducks, geese and swans -> Bucephala -> Goldeneye
As you can see these classifications use shortcuts (eg, Animals -> Goldeneye instead of all links between them as in item 3) and are generally arbitrary (we can devise more classifications: Nature -> Birds -> Goldeneye or Birds -> Goldeneye)

Abstraction is no less arbitrary:
1. Title HTML tag is "BBC - Wildlife Finder - Goldeneye (videos, sound files, facts, photos and news stories)"
2. Meta description is "Watch Goldeneye video clips from the BBC's natural history archive and learn about how and where they live."
3. Visible page title is "Goldeneye (Bucephala clangula)"

Example 3. http://www.bbc.co.uk/news/business-11256738

The page is represented as "China imports in surprise surge" at the main page. Inside we see "China imports in surprise August surge". Inside HTML, description is : "China reported a surprise surge in imports during August, leading to a fall in its trade surplus to $20bn."

The title at the main page is abridged (evidently because of design reasons). The content is better represented inside but important details are missed: (1) imports relate to China economy in general, (2) August 2010 implied, etc. Also, the content is not linked with relations, therefore, it is not evident for search engines that "fall in trade surplus" relates to "China". Therefore, "Сhina surge august" leads to this page, whereas "Сhina fall in trade surplus" leads not.

Example 3. http://edition.cnn.com/2010/BUSINESS/09/10/tequila.mexico.china/index.html?hpt=Sbin#fbid=qHeQ5geks3w&wom=false

The title is "Will China say 'salud!' to tequila?", whereas the description in HTML is "When Armando Rojas and a group of tequila entrepreneurs founded a boutique distillery for premium tequila in 2008, their sights were set on high-rolling neighbors to the north to snatch up their exclusive $70-a-bottle brand.".

This is good example of the title which tries to attract you but does not reflect the content (and brief content is not visible at all).

Example 4. http://www.euronews.net/2010/09/09/oecd-warns-of-slowing-global-recovery/

The title is "OECD warns of slowing global recovery", the title inside HTML is "OECD warns of slowing global recovery - OECD : business, economy | euronews", the description in HTML is "OECD - OECD warns of slowing global recovery. euronews : International and European business news, available as video on demand", HTML keywords are "OECD", and tags are "Economy, Growth, OECD".

The title includes the name of the site, but description is even more littered: "euronews : International and European business news, available as video on demand", the keyword partially relates to the content (it is more possible, that given page will be searched by users which are interested in economy or the current recession), tags are arbitrary: (1) Economy is the example of classification, but why not Global Economy? the article is namely about it, (2) Growth is not adequate here (if you follow the link you find pages as "Growth: UN confims global recession", "German businesses less confident", "Slowdown of global growth in wages", "US economy grows for third quarter in a row", "Shell predicts new growth after earnings rise"), that is, you understand why it is used here intuitively but why namely this word? how it relates to other pages? is it about the current recession or about economic growth in general?

As you may see we have a lot of problems with search engines which:
- do not try to specify what we mean ("Opera" as a browser or a piece of art);
- do not understand all synonyms (as "EU" and "European Union");
- look for words not meaning (the query is about namely EU, but not about anything which relates to EU);
- do not understand complex relations, especially, if a query includes words which usually does not occur in one phrase (as with queries about application icons);
- nowadays events by default have more relevancy (because they more often mentioned by news and discussions), but not everyone is interested with football (relations of England and Germany evidently not restricted with the single sport event);
- though you can find other information by specifying more words or by using related searches, but they require additional time/clicks (one of UI rule is a goal should be reached within 3 clicks).

In general, the contemporary search is satisfactory if we search for:

- Definition of a term which is unknown for us (then, we have no complex relations and the word usually coincides with a title of a page in a dictionary or an encyclopedia).
- Typical information (represented with a lot of pages, which use different synonyms).

No less problems with attempts to classify information:
- the most of classifications are arbitrary;
- abstraction (a brief content of a page) is arbitrary too and their definition comes from multiple sources;
- titles often consist of metaphors, all other elements misused and littered with keywords, which may give no notion about the content;
- all elements have a lot of implicit meaning, which cannot be extracted by search engines;
- no general understanding of how correctly use titles, descriptions, keywords, and tags.

At last, we have problems with URLs:
- difficult to remember;
- path inside URL should classify a resource (as directory path in file system) but it has all shortcomings of any classification as described above .

This results in the situation when search engines may not and can not search information adequately. Full analysis of natural language could resolve all our problems, but we cannot even predict when it will be available. Semantic Web tries to solve similar tasks, however, it is oriented for ordered data, applications, and experts (which could deal with its standards). But what is about unordered data (for example, user Web pages) and ordinary users? Actually, there is still no solution which could organize information appropriately.

As you may see we have several levels of abstraction:
- computer identifiers (eg, URLs which includes implications of domain and file systems) which uniquely identify resources with information;
- search systems which establish correspondences between natural language words and computer identifiers (more efficient when applied to large volume of unknown information like in Internet); classifications and associating system which establish correspondence between hierarchy/path of words and computer identifiers/natural language words (more efficient when applied to small volume of known information like at local computer);
- natural language which establish correspondences between words and real world things and abstract conceptions.
All three levels have own shortcomings:
- usage of computer identifiers is quite restricted (optimal when used some computer techniques like copy-paste);
- linking between ambiguous natural language and unambiguous computer identifiers will be ambiguous by default;
- natural language is ambiguous by itself, at least, because words map to real world with the help of abstraction which reduces information.
Considering this state of things, whatever search technology is used, we will be always restricted due to ambiguities of natural language (plus shortcomings of each lower level).

Can we avoid or alleviate ambiguities of natural language? For that information should be identified more precisely. Now, identification often is either implied or misused. Directory and file names are often misused (like "rep_32.doc"), information inside files often can be identified only with processing application, information which is represented as a text cannot be identified fully due to restrictions of natural language. Methods which purposed to help identifying is misused too (titles use metaphors, classifications and keywords are arbitrary). The bottom line: identification is underrated.

More precise identification implies that:
- Query input should include term identification (in our example, we should identify if "opera" is browser or art piece, etc), ideally, original pages should contain such identifiers, because anyway a search engine is not able to separate such terms by a page context in all cases, whereas information creator knows for sure if "opera" is a browser or art.
- Synonyms and even similar words should be mutually replaceable, but an user should be warned on what similarity criteria is. That is, identifiers should be translated into each other (it could involve translation of synonyms, of different language identifiers, of similar things and conceptions, of subjective understanding like "favourite book").
- Search results should be grouped by equality/similarity (for example, to sort by different types of opera), grouping/classifying/associating information should be implicitly present in identifier (that is, instead of defining explicit hierarchy, identifier should only imply such information).
- Page topic (which should summarize the content of a page) may not coincide with a title (which can use metaphors for focus). Topic of a resource should have more priority in searching than a set of words/objects of a resource. That is, a page about EU should has more priority than a page where EU is mentioned. If a query and a page topic are equal, a search could be considered completed.
- To understand complex relations, search engines should understand meaning (things, conceptions, and relations between them behind words) not just words. Some frequently used complex relations should have identifiers too (for example, "world in 1800" or "final of FIFA World Cup 1966" identifies one layer of history and one event respectively).

To have this we need world-wide approach to identification (unfortunately, implemented in one application, it won't work), because:
- Identifiers should be shared (currently available applications allow to define identifiers restricted with own boundaries or with boundaries of an user system). Once defined, accepted everywhere: a document marked as "opera (browser)" will mean the same at any computer and any language.
- Translation should be shared (because we can't predict all possible translations) and delegated (because one system could maintain only part of all possible translations of identifiers).
- Classifying/associating information implied behind identifier should be shared and delegated too.
- Identifiers should be supported by UI (ideally, with native support of OS or browser). Only in this case, usage of identifiers can be easy.
- Support of contexts (as filtering) is necessary.

For example:
- "Opera (browser)" is the identifier for Opera browser, "Opera (art)", "Opera (theatre)", "Opera (hotel)" are identifiers for other entities. Specific opera theatre can have identifier like "Opera (theatre)/London".
- "Opera (art)" could correspond to "Opera" as English word and "Oper" as German word.
- "Opera (browser)" can be a part of many classifications and associations: "Internet", "Surfing", "Computer", "Windows", "Linux", etc.
- If you created a document which relates to "Opera (browser)", usually it means that you already used this identifier frequently, therefore, instead of identifying what opera is by millions users and request, your responsibility as a creator is to find the correct identifier only once, maybe spend on this some time, but your advantage is your information will be easier to find.
- To restrict information, we need filtering. For example, context of "Opera (art)" should include information and entities (applications, objects, actions, etc) which relates only to this identifier.

Resolving ambiguities of natural language and abstraction is condition sine qua non for searching. Unfortunately, nowadays search engine only tries to enhance algorithms of searching, instead of managing the root cause of failed searches.