четверг, 30 декабря 2010 г.

How to solve problems of contemporary search?

ABSTRACT

The article considers problems of contemporary search and information organizing, their causes, and proposes the solution for these problems.


WHAT CAN SEARCH DO TODAY?

Nowadays a search can find millions pages of content. However, we can get correct results along with unnecessary ones. "France capital" can retrieve 190 million pages instead of one word. In fact, search engines already tries to give this one word as the first result, but it does not work in all languages, whereas "Aragon capital" query does not work at all. In any case, there is a lot of simple questions which cannot be answered by modern search engines. So far, users can reconcile themselves with such state of the art, but there is a lot of reasons to have something more than just a guess, based on complex mathematical/statistic methods for finding words in flat database of billions pages. But the problem is not only in very searching but also in information organizing, which is evident by the following examples:

1. Search

- "Opera" query gives links to (a) browser, (b) opera as art, (c) opera theatres, (d) hotels
- "European Union" and "EU" give different results and include information not only about EU, but also on related topics as European Commission
- "Germany Spain" query give references to FIFA World Cup 2010 match, however not everyone are interested in association football (and relations between Germany and Spain are not restricted by this match even within football topic)
- "German Spain 2010 football" give links to not only match but to FIFA World Cup 2010 in general (though this query is more concrete)
- "How to go to Aragon" gives a lot of unrelated references
- "Aragon" query is ambivalent by itself because we may want to (a) identify the word, because we don't know what Aragon means (b) get more details on it, whereas a search fetches a lot of unsorted information, which tries to cover all goals of the request
- "Aragon Navarra" query is ambivalent by itself because we may want to know (a) how these words are generalized, eg, they are parts of contemporary Spain, (b) how these words are linked, eg, what relations exist between Aragon and Navarra nowadays or in the past, or how to get from Aragon to Navarra, etc
- "Aragon Spain" and "Aragon Saragossa" queries are ambivalent by themselves because they include conceptions which are a part of other conception

2. Search help

Look at advices of search engines for users:
- Keep it simple
- Think how the page you are looking for will be written
- Describe what you need with as few terms as possible
- Choose descriptive words
In the essence, everything is correct. However, how far it is from real life. Nolens volens, but human beings answer to complex queries, without predicting which result they will hear, not thinking about how much and which words have to be used for description. Moreover, advices for advanced search recommend to use special language for that, which is hardly applied for the most of users. The fact that answer sites exist only proves that search engines have problems. Though also it proves that many users don't want to make own life more difficult and even don't intend to read search help (many questions at these sites can be answered with even the simplest queries).

3. Generalization, keywords

http://www.bbc.co.uk/news/business-11256738

The page is represented as "China imports in surprise surge" at the main page. Inside we see "China imports in surprise August surge". Inside HTML, description is : "China reported a surprise surge in imports during August, leading to a fall in its trade surplus to $20bn."

The title at the main page is abridged (evidently because of design reasons). Internal title is better but important details are missed: "August" means August 2010. Also, the content is not linked with relations, therefore, it is not evident for search engines that "fall in trade surplus" relates to "China". Therefore, "China surge august" leads to this page, whereas "China fall in trade surplus" does not.

http://edition.cnn.com/2010/BUSINESS/09/10/tequila.mexico.china/index.html?hpt=Sbin#fbid=qHeQ5geks3w&wom=false

The title is "Will China say 'salud!' to tequila?", whereas the description in HTML is "When Armando Rojas and a group of tequila entrepreneurs founded a boutique distillery for premium tequila in 2008, their sights were set on high-rolling neighbors to the north to snatch up their exclusive $70-a-bottle brand.". This is good example of the title which tries to attract you but does not reflect the content (and brief content is not visible at all).

http://www.euronews.net/2010/09/09/oecd-warns-of-slowing-global-recovery/

The title is "OECD warns of slowing global recovery", the title inside HTML is "OECD warns of slowing global recovery - OECD : business, economy | euronews", the description in HTML is "OECD - OECD warns of slowing global recovery. euronews : International and European business news, available as video on demand", HTML keywords are "OECD", and tags are "Economy, Growth, OECD".

The title includes the name of the site, but description is even more littered: "euronews : International and European business news, available as video on demand", tags are arbitrary: (1) "Economy" classify this page, but whose economy implied? global, EU? (2) "Growth" is not adequate here (if you follow the link of this tag you will find pages as "Growth: UN confims global recession", "Slowdown of global growth in wages", "Shell predicts new growth after earnings rise" which are ambiguous at least), why namely this word? how it relates to other pages? is it about the current recession or about economic growth in general?

4. Classification

How does information placed at different portals, we may see at the example of one utility which can be located at different categories:
- System
- System information
- Software / Network utilities / Network monitoring
- Home > Windows > Network Tools > Network Information
- Home > Windows Software > Utilities & Operating Systems > System Utilities
- System > Power Tools

http://www.bbc.co.uk/nature/species/Common_Goldeneye

The page uses several (arbitrary) classifications:
1. Inside URL: Nature -> Species -> Common Goldeneye
2. At the page: Wildlife Finder -> Animals -> Goldeneye
3. Animals -> Vertebrates -> Birds -> Anseriformes -> Ducks, geese and swans -> Bucephala -> Goldeneye
As you can see these classifications use shortcuts (eg, Animals -> Goldeneye instead of all links between them as in item 3) and are generally arbitrary (we can devise more classifications: Nature -> Birds -> Goldeneye or Birds -> Goldeneye)

5. Local information

We can save the page from the previous example as C:\Downloads\Animals\Goldeneye.html (though browsers propose more interesting file name: "BBC - Wildlife Finder - Goldeneye facts, pictures & stunning videos"). After this you can find a video file, but if you have no space at C disk, then you have to save it as D:\Video\Animals\Duck_geye.avi. How can you find both files later? Launch local search? It makes sense once but if you will work with them constantly, it would be great to reach them more easily. File libraries? They have own restrictions: you should create them manually, and its growing make you to use search again but already within these libraries. The problem is you have to apply the same operation many times (whether this is a search or navigating by file path using directories or libraries).

6. XML, Semantic Web, microformats

Semantic Web shifts from information representation (which is area of hypertext) to meaning (semantics) of information. For example, person data can be represented as:


<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#">

<contact:Person rdf:about="http://www.w3.org/People/EM/contact#me">
<contact:fullName>Eric Miller</contact:fullName>
<contact:mailbox rdf:resource="mailto:em@w3.org"/>
<contact:personalTitle>Dr.</contact:personalTitle>
</contact:Person>

</rdf:RDF>


Of course, a human being won't use such representation which evidently oriented for machines. Microformats implement the same idea (working with meaning) using hypertext. For example, home phone data is represented as:


<span class="tel">
<span class="type">home</span>:
<span class="value">+1.415.555.1212</span>
</span>


However, there is another problem: to support phone data visualization a browser needs to process every microformat, which makes their support more difficult. Moreover, you need to have and know specific format for each area of knowledge you use. This implies new learning curve for each format. What's for? Learning is good, learning something which can be avoided is not.


PROBLEMS OF CONTEMPORARY SEARCH AND INFORMATION ORGANIZING

As you may see, we encounter many problems because search engines:
- look for words not meaning ("EU" query is about namely EU, but not all related to EU)
- do not understand all synonyms ("EU" and "European Union")
- do not specify what you meant (is "opera" a browser or a kind of art?)
- do not understand query goals (do we want to identify or receive more details on word, etc)
- do not order results enough (for example, by grouping of similar information)
- give too redundant results (millions pages instead of one word)
- do not understand complex relations between words
- the last events have more relevancy (only because they mentioned more often in news and discussions), but it may not fit what a user wants
- require additional time for make query more precise (which often violates GUI rule that any goal has to be attainable in 3 clicks)
- satisfactory result usually achieved because either we search for a term definition (there is no complex relations and words in a query just coincide with some page title in a dictionary), or typical information (which is represented in many pages, which used many synonyms, which increases our chances to guess them in our query)

Search advices only prove these problems are more than actual today:
- queries have to be simple, because search engines just cannot handle them
- you should guess words of responses, because search engines cannot handle similar terms
- the requirement to use as few words as possible sometimes makes sense in real life (brevity is the soul of wit) but on the other hand is absurd: the more words are used , the more detailed is description, eg, "car" describes an object worse than "red sport car"
- the advice to use descriptive words (ie, words which add something new to the content and not overlap in meaning with other words) makes sense only if you understand what it is, however, what prevents search engines from excluding non descriptive words?

But there is also problems of information ordering and organizing:
- identification of real world object is absent today (therefore meaning of words is available only for human beings)
- classifications are arbitrary (is an article about "animals" or "vertebrates"?)
- some classifications (as file directories) allow to order information but do not prohibit incorrectly classified information (eg, you can save music into video directory), because they do not expose criteria of classification (eg, if you don't know what/who is Verdi, you won't understand what does "Music/Opera/Verdi" path mean)
- fixed classification criteria deteriorate possibility of recurring finding (you can place music into "Music/2010", but later forget which year does music belong to) and ability of information to be separated (copying of a part of a directory tree is quite difficult, because it covers involved tree branches in the whole, but we need only their parts, eg, as when we need to copy three files from different branches with keeping their paths)
- existing classifications (including various taxonomies and onthologies) use many relations for their elements, which makes them difficult to use (and often they can be understood only by experts in semantics)
- generalizing and abstraction (which usually represented by a title, ie, an abridged content) are arbitrary too, titles often include metaphors and other elements which just obstructs an understanding of the content (eg, a title can include a site name, but you can even do not suspect it is a site name but not just a word), sometimes they are replaced by associations, keywords, and tags (eg, "OECD warns of slowing global recovery" is about the content, whereas "Growth" is just one ambivalent associations from millions)
- generalization is duplicated (and partially becomes contradictory) inside a title of a reference, inside a title of a page, inside additional tags, keywords, etc
- URL is hard to remember (because it can use abbreviations, and natural language separators are used not as expected)
- a link refers to one resource whereas a natural language link (word) is a generalizing link ("movie" word can refer to any movie)
- not any object or event has corresponding computer resource to refer to, but even if it has, it can be outdated or be removed
- hypertext strictly integrates diverse information, which results in "content hell" (it is impossible to understand where is boundaries between different content, between content and not content)
- in many cases we work with plain text only because existing software can process either already structured information or not structured information at all
- in many cases information integration is absent, because we use text identifiers without linking them with meaning (ie, even if a file is named "Braveheart", it does not mean it linked with the movie)
- information exchange is difficult because we need to use specific tools for that (email, instant messaging, sites)
- information usage is difficult because there is no simple ways to share classified information (ie, each user has to classify information anew) and to filter information (ie, each user has to filter information on each usage)
- problem of information access is not solved transparently, therefore it is either ignored at all or has so many obstacles, that usage becomes difficult

Semantic Web (its creators call it also "Web of data") has own shortcomings:
- oriented for machine processing
- considers already ordered data and applications
- is an universal data format to integrate large data arrays and applications, but which has no universal representation for humans
- its standards can be treated only by experts
- identification of real word objects is not covered fully (otherwise, we already would use it)
- there is no support for generalizing/detailing (which is one of the most important mechanism of intelligence)
- semantics is represented with subject-predicate-object triplets, which ignores richness of natural language
- qualification problem (there is too many exceptions for any generalizing description) is actual for it too, as well as problem of big quantity of mutually linked facts


WHERE ARE THESE PROBLEMS FROM?

Given shortcomings appear because a search considers any text as just a set of symbols (similar as we consider a text in unknown language). Naturally, without understanding meaning, you can only guess what users search thanks to word coincidences. Search advices, sophisticated mathematical methods can increase probability of successful coincidence but not more than that. And only understanding can advance a search to really new level, which implies search has to consider meaning not text.

For example, what meaning has "Braveheart" movie?
- the movie as (1) a frame sequence, (2) a video tape, a disk, or a file, (3) actors, decorations, scenes
- historic events, which occurred several centuries ago
- (1) description of these events, (2) interpretation of this description in the movie, (3) description of how movie was shot
- ideas, thoughts, feelings, emotions, which (1) contributed in the interpretation, (2) experienced by real actors, (3) experienced by spectators

This meaning corresponds to the following identifiers:
- "Braveheart" title which has to uniquely identify the movie (though this title is not unique, because there is 1925 year movie, therefore, the unique title should be "Braveheart(1995)")
- "movie" or "historical movie" words underlines similarity of the movie with other movie, generalizing its traits and ignoring details
- words which describe the movie plot
- words which describe our feelings and impressions from the movie
- shots (which are the movie themselves) which can be considered as description too (pictures are visual language)

In general, the conception of meaning includes the following items:
- objects and events of real world (here, it is historic events)
- to decrease complexity of objects and events up to the level, which is appropriate for perceiving, the subject transforms them into symbols and identifiers (visual and other representations, letters, words, digits, etc)
- identifiers separated from real world (because now they are inside the subject) start to live on their own and form a system of identifiers / abstract world with own objects and events (here, it is a description of historic events and their interpretation in the movie, as well as ideas, thoughts, feelings of the very movie and the audience)
- identifiers are used as links to objects ("William Wallace"), to actions ("The Battle of Stirling Bridge"), to relations (ie, objects of abstract world, "father of William Wallace" is relation of kindred, which cannot be perceived directly, because the fact of kindred can be only registered or proven by DNA analysis), to time relations (ie, actions in abstract world, the very movie and "if William Wallace has not been executed" phrase are not part of reality), and to mixed structures ("I have seen Braveheart movie", where "I" and "have seen" are parts of reality, whereas the movie consists of relational structures)
- in general case, the more identifiers are used, the more concrete (detailed) is a description, and more abstract (generalized) otherwise (eg, "Braveheart, cinema M, city N, 20:00, October 26, 2010" refers to the show of specific movie, "movie in October 2010" refers to any movie which was shown in October 2010, "movie" refers to any movie, though "I have seen a movie yesterday" refers to the specific show)
- to decrease quantity of identifiers ("a man who struggled for Scotland independence in 13th century") we can use the unique identifier ("William Wallace"), which can build own systems of identifiers (a system of personal names, of country names, databases, etc)
- because a system of identifiers is independent from real world (we should but not always support verifiability of it), meaning of an identifier can depend on circumstances, objects, and events (meaning of "I was in cinema" depends on where and when this phrase was said), on a subject and him/her/its representations ("I was in cinema" depends on why and what is for this phrase said), on the very system of identifiers
- translation of identifiers from one system to another (from representation to language, from language to language, between identifiers of one language) may be based on identity (if there is precise correspondence between identifiers like between "William Wallace" and "Uilleam Uallas") or on similarity, that is, partial identity (like between "William Wallace" and "Scotland hero", which has both equal and not equal criteria)

To cope with existing problems, we should consider meaning broader:
1. To make meaning more precise, we have to take into account all used systems of identifiers. They include a text as result of search, but also the very query. Nowadays, a query lacks for attention even considering search help. Search engine try to satisfy all directions of searching, which are possible theoretically, leaving the right of choice for a user. However, such approach is the problem if complex queries are used, because in this case subqueries can be handled by a machine not a human being.
2. Interface has to be regarded as a part of semantics, because interface is an abstraction of some content (similarly as a title abstracts a content). This at least assumes some synchronization between semantics and interface itself (for example, if you are within a window of movie registration, then all queries, initiated from this window, should relate to movies). That is, the result of working inside some interface or with some semantics is a context (base) for further work with information.
3. The problem of identification is underrated: eg, instead of defining all details of information (its associations, relations, etc) often it is enough to know the answer of "What it is?" Thus, even an ordinary user can identify which wine he or she bought by a label, but only an expert can know which classes it belongs to, how it is linked with other objects, etc. Detailed information is derived from identification: if you know it is namely "Braveheart" of 1995, then it speaks for itself who directed it, who played in it, etc.
4. Local information is considered as an analogue of Internet search, whereas its task is rather "Where is it?" than "What is it?", therefore it is inefficiently to launch search each time for finding a lost file (which content you can remember even better than its location at a disk).

Generally speaking, a human being always try to deal with generalized and not completely defined meaning. In fact, we always try to reduce volume of information, because we have no sufficient resources and time to describe some thing or event fully, whereas a reader not always wants to read too detailed description. Missing or ignored information often can be restored only by asking a bringer of information (eg, you cannot extract from "I was in opera yesterday": who is I, when was yesterday, which opera I attended). The bottom line: search fails just because it cannot restore all information.


HOW TO SOLVE PROBLEMS OF SEARCH AND INFORMATION ORGANIZING?

Can we resolve the contradiction between intention to generalize information and necessity to provide precise information? Opposition of detailing and generalizing derives from the nature of intelligence itself, in general, and from the nature of abstraction, particularly: the more information we have, the more precise we can describe something, but the more resources and time we need to do this. To describe something we can use one phrase or the entire book, because brevity (generalizing) will be efficient in one situation, and precision (detailing) in another. We have to poise between size of information and speed for its processing (typical dualism in programming, which is similar to Goedel incompleteness theorem). We cannot prefer one of sides, therefore to solve the contradiction we need to satisfy needs of both sides and to do detailing, generalizing, identifying, and connecting.

We may observe the same dualism in development of computer technologies, in general, and Web, in particular. On the one hand, we deal with precise applications, on the other hand, we need to provide interface to interact with humans, which prefers flexibility and generalizing. This dualism manifests in two Webs: regular one and semantic one (Web of data). Originally, regular Web was created namely for humans which is apparent because hypertext:
- links diverse content using quite simple language (before that, more complex technologies and formats were used, whereas hypertext can be edited in plain text editor)
- easily survives (it works even when not fully loaded or not valid)
- allows ambiguity (words) and precision (concrete links) simultaneously.
Whereas Semantic Web (and XML in particular) oriented for machines:
- it is a set of formats for universal representing of data (which, in fact, is not far from binary formats, which existed dozens of years before), this the very universality became more actual because of appearance of Internet and necessity of compatibility for diverse data sources
- as any machine format, it must be precise and obey to certain rules, which is not always acceptable for humans.

And again we may ask ourselves: can we combine benefits of both approaches? On the one hand, we should simplify semantics usage, so it will be accessible even for inexperienced users. On the other hand, we should do not lose precision, which is necessary for machines (and not only). On the third hand, we should be able to restore missing or ignored information. This implies, that new approach should:
- identify information as precise as possible
- generalize/detail information and context of it
- connect information into space-time-relation complexes.

Besides that, it implies that:
- contradictions in identifiers should be resolved as early as possible (ie, if we type "opera", we should decide which kind of opera we look for before search is triggered)
- usage of identifiers should be unobtrusive (eg, a user should be able to use natural language for generalized description, as he or she does in real life, but also be able to make meaning more precise, if needed)
- semantic link, in general case, should refer not to only information resource (which is a particular case of abstract conception, and which describes something), but to meaning (ie, to something, which is described, or with the help of which we describe, and which can also include information resources)
- additional ordering information (as classification, association, etc) should be implied, because it can be easily retrieved from an identifier
- identifiers may be translated, because humans use different systems of identifiers (languages) and synonyms
- identifiers may be delegated, because identifiers can change meaning depending on a subject, which uses it, and because it could enhance scaling and distributed abilities (load delegating, subqueries, etc)
- generalizing information should have priority over detailing (eg, an article with "Aragon" title has priority over information, which mentions Aragon)
- it should be possible to integrate meaning with any information (files, text, etc) and to share it as needed
- usage of meaning should be simplified (integration with interface, filtering, contexts)

The compromise between precision and flexibility may be attained only at the junction of technologies, which represent them correspondingly. Relating to meaning, such technologies are machine-oriented XML (standards of Semantic Web partially continues idea of XML) and human-oriented HTML. Microformats already use this idea with several "but"s: (a) they are formats, which oriented for machines (humans are able to work with them but only to certain level of complexity), (b) this is namely micro-formats, which means there is problem of their fragmenting and scattering, (c) they use class, rel, rev, and title tags, which already have other meaning and which cannot provide all necessary features. So, how can we make the solution to be friendly for both humans and machines with all necessary abilities?


IDENTIFYING

Before to manipulate with something, we have to know what it is. That's why we need identifying and identifying protocol. Why namely protocol? Because natural language (which is identifying protocol in real life) is ambiguous. Because hypertext protocol sets correspondence between URL and information resources, but we need to set correspondence between identifiers themselves, or between an identifier and things or event of real world, or conceptions, etc. There is already URN, which was created namely for that purpose, but it is not mature enough, partially due to own restrictions:
- it is considered as "name of addressee" comparing with URL which is "address", asserting they supplement each other
- it includes NID (namespace identifier), which should be registered (similarly to domains)
- it includes NSS (namespace string), which does not take into account how natural language identifiers are built
These restrictions restrain usage of URN because:
- in fact, URL is a partial case of URN, which can refer to anything, whereas URL only to an information resource
- lukewarm registration of NIDs is potentially can be as corrupted as domain system (with all collisions around domain names); though, of course, registration by itself is necessary to avoid conflicts, but it could be customized, decentralized and distributed
- NSS is difficult to remember identifier (eg, urn:isbn:0451450523 for a book or urn:isan:0000-0000-9E59-0000-O-0000-0000-2 for a movie), instead, complex identifiers of natural language have to be used (to be translated into more precise identifiers if necessary)

For example, our case with "opera" can be represented with the help of URN as:
- urn:browser:Opera
- urn:art:opera
- urn:hotel:Opera (Madrid)
- urn:English:opera
- urn:German:Oper
Generally speaking, there is infinite number of NIDs (including (a) personal ones like urn:JohnDoe:I, which links "I" with specific person, (b) for an organization like urn:MyCompany:Building 1, Office 220 or urn:MyCompany:Project 1, Application, Settings window, (c) for an abstract world like a movie: urn:Braveheart:Bruce Wallace, which indicates a character of this movie), which relate to either different systems of identifiers (languages, protocols, etc), or to subjects, which have inside own system of identifiers (of a person, of an organization, etc). Numbers are the special case, because they are infinite sequence of identifiers, which are changed by specific rules. They also need a subject (which change their meaning), which includes (1) calculation system, (2) separator format, (3) calculation unit.

In the essence, NID is a subject, inside of which an identifier receives subjective association with some thing, event, or conception. Though identifiers can change meaning not only because of a subject, but also due to the following factors:
- the previous context (eg, "I have seen today movie A and movie B. This movie is the best", where "movie" word in the last sentence can imply two values)
- time ("I have seen a movie", said yesterday and today, can refer to different movies)
- subject ("I have seen a movie", said by different people can change meaning of "movie")
- subjective time (expressed in modal verbs and different times: "I can see that movie" or "I would see that movie")
Moreover, these factors can overlap (a subject can provide meaning for identifiers inside some topic: urn:MyCompany:Domain:term), or be absent (if a term is unknown for us and you don't know where it belongs to).


MEANING (ABSTRACTING, ASSOCIATING)

The main goal of identification is to make comparison precise. The simplest case is the comparison of two identifiers: if we have "Have you seen Braveheart?" query and "I have seen Braveheart" text, where both "Braveheart" words refer to the same "urn:movie:Braveheart(1995)", we can compare them more precise than possible today. But queries and texts are not restricted with only identifiers. URNs give the base for elementary identifiers but they could not encompass all possible combinations of them, and even more complex structures built with them, which balance between brevity and accuracy. This is why we need hypertext.

Country can be represent in XML as:


<Country>
<Name>Бутан</Name>
<Area>38,8 sq. km.</Area>
</Country>


It seems like such structure can be easily understood by a human (which knows English), but with the growth of field number and information, deciphering of it will be more and more difficult. The other difficulty is field names. First, it would be better to use not only Latin letters, second, if a field name has two words we are to combine them into one XML identifier (like "LastName"), or replace a whitespace with another separator (like "Last_Name"), finally, sometimes field names turn into just abbreviations (like "LName"). Could we represent the same structure in HTML?


<span s-id="urn:Country"> <!-- urn:* can define URN prefix by default for the given subject -->
<span s-id="Name" s-id="urn:country:Bhutan">Bhutan</span>
<span s-id="Area">38.8 sq. km.</span>
</span>


The same structure can transform into other structures like table:


<table s-id="urn:Country">
<th s-id>Flag</th>
<th s-id>Name</th>
<th s-id="urn:Area:sq.km.">Area</th> <!-- unit of area can have own subspace of names -->
<tr s-id="urn:country:Bhutan" s-id="#bh"> <!-- #bh is the local identifier -->
<td><img src="bhutan.jpg" /></td>
<td>Bhutan</td>
<td>38,8</td>
</tr>
</table>


<p><span s-ref="#bh">Bhutan</span>, officially the Kingdom of Bhutan, is a small landlocked country in South Asia, located at the eastern end of the Himalayas and bordered to the south, east and west by the Republic of India and to the north by the People's Republic of China.</p>

where:
- s-ref="#bh" links the second remark about Bhutan with the local identifier
- nested tags naturally express generalization/detailing (Country -> Name -> Bhutan)
- logics of correspondence between fields and values can differ in different tags ("Name" can be an tag attribute or a separate tag)
- empty attribute is necessary to avoid redundancy (in example above, it means a value inside a tag used as s-id)

Or the same without using nested tags:
<s-of="#3" s-id="#1">Area</s-of> of <s-id="#2" s-id="country" /> <s-id="urn:country:Bhutan" s-id="#3" s-is="#2">Bhutan</s-id> is <s-of="#1">38.8 sq. km.</s-of>.

where:
s-of=#3 links "area" and "Bhutan"
s-is="#2" links "Bhutan" and "country"
s-of="#1" links "38,8 sq.km." and "area"

or
<s-id="#1" s-has="#4">Area</s-id> of <s-id="#2" s-id="country" /> of <s-id="urn:country:Bhutan" s-id="#3" s-is="#2" s-has="#1">Bhutan</s-id> is <s-id="#4">38,8 sq.km.</s-id>.

where:
s-has=#4 links "area" и "38,8 sq.km."
s-has="#1" links "Bhutan" and "area"

Why do we need these tags/attributes?
- s-id: associating of identifiers of natural language and hypertext elements with other identifiers
- s-is: abstracting (in our case, this is "Bhutan", which is abstracted as "country", but it could be a brief content of an article or a book)
- s-of: generalization as a part of the whole (eg, "area of Bhutan")
- s-has: detailing
- s-ref: associating

Should hypertext be completely marked up with semantic tags? In fact, no. Usage of already existing HTML allows to use them when necessary (in first turn, it can concern brief content of pages, as the most important part of the content).

Are these tags enough? Here, a simple set of tags is proposed, which is enough for plain abstracting and associating. Look at natural language, similarly, it uses only whitespaces and punctuation marks (a set of which is restricted too) for expressing any kind of link between words.

But what's about verbs, adjectives, and adverbs? Counter-question: do we need to know that word is a verb or a noun? Finally, if a word is precisely identified, computer can figure out itself if we deal with a verb or not (and often even without identifying). Though it is quite possible that more precise word classification (and more semantic tags) should be introduced. Unfortunately, this topic is out of scope of this article. Briefly, we can notice that the necessity of additional identification is clear from the following questions:
- Where is Bhutan? (the question relates to namely space)
- When was Bhutan inhabited? (the question relates to namely time)
- Which color has Bhutan flag? (the question relates to flag characteristics)
- How to get to Bhutan? (the question relates to relations, which fits getting from one place to another, and which can be established between your home and Bhutan)
- Why was TV banned in Bhutan? (the question relates to cause-result sequence, which led to TV ban)
- Can I get to Bhutan? (the question relates to possibilities)
On the other hand, we again can notice, that if a word belongs to space or adjective can be established based on a word identification.

Do we need this system of semantic tags at all? We can notice, for example, programming expresses above-mentioned example as: "Bhutan" is an object of "Country" class, "area" is a property of a field, which can have some value (of some type). Semantic Web expresses this example as: "Bhutan" is a subject, "has area" is a predicate, and its value is an object. So, why do we need yet another system of identifiers? The problem is programming is flexible enough to deal with transforming of objects into properties and values and vice versa. For example, if you did not make "Bhutan" as an object (decided it is enough to be as a string value), additional implementing would take a lot of time and resources. Somebody can say everything should be an object, somebody can say it is the problem of design. But this is namely the clue: such considerations are good for an application, which is optimized for some task. But it is not flexible to assume this for meaning, where "Bhutan" can be linked with other meaning in any possible way. As for Semantic Web triplets, they are not flexible enough too, but have another shortcoming: they represent data as ternary association whereas there is a lot of binary ones, which is inappropriate to transform into ternary ones.


SEARCH

So, what is changed for a search? For example, what can "What is area of Bhutan?" query find? Any question deals with unknown, which can be represented with own identifier: url:_ (though explicit usage of urn:_ may be redundant, because a list of question words is restricted). Thus, this query can be represented as:
<s-id="urn:_" s-of="#1">What</s-id> is <s-id="#1" s-of="#2">area</s-id> of <s-id="urn:country:Bhutan" s-id="#2">Bhutan</s-id>?

whereas to be found phrase is:
<s-of="#3" s-id="#1">Area</s-of> of <s-id="#2" s-id="country" /> <s-id="urn:country:Bhutan" s-id="#3" s-is="#2">Bhutan</s-id> is <s-of="#1">38,8 sq.km.</s-of>

As we may see, the query has to just find similar links (between "area" and "Bhutan"), ignore irrelevant (between "country" and "Bhutan"), and compare what should be found (between "area" and unknown) and text (between "area" and "38.8 sq.km.").

Usage of urn:_ is not restricted only with questions. For example, a service which provides country data, can reveal the content as a linke with unknown: <s-id="#1" s-of="#2">Area</s-id> of <s-id="#2" s-id="urn:country">country</s-id> <s-id="urn:_" s-of="#1" />. Then "What is area of Bhutan?" query can use this service too, because there are links between "area" and "country", and unknown and "area" (whereas we look for links between unknown and "area", and "area" and "Bhutan", but "Bhutan" is similar to "country").

But search is not only simple queries. More complex ones can include (a) similar or (b) implicit information. For example, "Is Bhutan big country?" implies a search has to find a value of Bhutan area and compare it with "big country" conception, which is possible only similarly, because there is no strict classification for country area. "Is it warm in Bhutan?" is even more vague. But how it can relate to "The climate in Bhutan varies with altitude, from subtropical in the south to temperate in the highlands and polar-type climate, with year-round snow, in the north."? There is no direct and precise correspondence between "warm" and words from description. However, we can notice the similarity of "warm" and "subtropical", therefore we can answer with indicating explicit similarity criteria: "It is warm in the south of Bhutan with subtropical climate" (in real life, this answer can be extended with more facts).

If a query ("What is Bhutan?") gives more than one answer, then the base operations have to be applied recursively for answers:
- identifying ("Bhutan" as country)
- detailing (facts on Bhutan, its maps, etc)
- generalizing ("Bhutan and tourism", "Bhutan and mountains", etc)
- associating ("How to get to Bhutan", "How to get Bhutan visa", etc)

The way how information is found has to concern information creators even more than before. Now, this way depends mainly on search engine algorithms, but the proposed solution relies on more precise and universal rules, which should be equivalent in any application. Tasks of information creator have to be not only precise identifying of a content and associating facts, but also testing (if created information can be found with corresponding queries). To the contrary, classification has to be less important, because classification is mainly a set of generalizations (which derive from an identifier itself). For example, if you identified "Bhutan" word as the specific country, you can easily infer this word relates to geography, Earth, mountains, etc.


INFORMATION ORGANIZING

Problems of information organizing is not less important than ones of search and meaning. In fact, they mutually connected with each other, and should be resolved as the whole. These problems mainly concern information which can be structured as hypertext, but also they relate to ways of interaction between information and information subjects (which search and store it).

* How to search binary files?
A binary file has only name (which is not structured, but if you try it will have a format, which is unknown for others like "Braveheart_1995_Gibson") and so called metadata (which exists not for all kinds of information, moreover, format of metadata usually restricted). Because of that, we need other solution: hypertext should describe information and have possibility to be attached to a file. Thus, each file copying should involve copying of attached file wrapper, which describes and identify a file.

For that, we need one small step: semantic form of hypertext. Possibly, it is not a must but it makes a goal of a hypertext more evident: either we use it to represent information (and use body tag), or we use it for semantics. For example, video stream may be represented as:


<semantics>
<s-id="urn:movie:Braveheart(1995)" s-has="urn:movies:trailer" />
</semantics>


The list of files (or URL) may be represented as:


<semantics>
<s-id="urn:file://c:/movies/braveheart.html" />
<s-id="http://youtube.com/watch?v=vBXBtORI7pE" />
</semantics>


Similar representation also may be used for quoting (which is external markup, that is, which is done not by information creator):


<semantics>
<s-id="http://mysite.com/blog/braveheart.html">
My opinion.
</s-id>
</semantics>


That is, if you work with a file which has such a wrapper, you don't need to identify it anew, you don't need to think where to save it or where to find it in future (it will be integrated into your local system according with identifying information). Instead "c:/movies/braveheart.html" (where file name and path guarantee uniqueness of this identifier inside local but not for external system) you can use "urn:movie:Braveheart(1995)" as the globally unique identifier (which we use in real life too), as well as flexible classification information, which is associated with this identifier (instead of directory hierarchy which represents fixed classification).

Thus, all information has to form a net of connected facts (which can be stored as either hypertext or binary files, which can be described with hypertext). And information fetching has to be setting boundaries and retrieving a content inside these boundaries (which can include both multiple file and only a part of hypertext).

* How information and information subjects interact?
Our possibilities for information retrieving, transferring, and delivering is artificially restricted by existing software. For example, if you cannot find (or reach) some information, you can do the following: (1) you ask your friend about it by email, (2) your friend search local files, (3a) attach them to a letter and send you, or, (3b) share files and send you a link, (4) if a file was updated, the procedure is repeated. We are forced to do this, because (1) we cannot retrieve information directly, (2) we have to know where information is, (3) versions and updates of information are available only if corresponding functionality is provided in an application, (4) the way of information exchange separated from information itself.

To solve the problem of interaction between information and its subjects we need the following:
- represent computer resources (networks, computers themselves, applications, and even files) as information providers
- providers can use query routing
- providers can compare information version, which other provider has, as well as update it and notify on change.
Btw, communication subjects (like someone represented with email or IM nickname) are information provider too (where communication protocol settings indicate information source).

Correspondingly, in our example, instead of asking your friend and exchanging files, you need just indicate specific information provider. And (a) your friend's system can provide the answer automatically, or (b) your friend can answer it manually, (c) if his or her computer is turned off, then the communication method should be chosen to guarantee delivering even in such case, for example, email.

Routing by itself has to include several aspects:
- translation of identifiers from one system to another (translation of words, correspondence of synonyms, etc)
- delegation of complex identifiers, when a part of identifier refers to other system of identifiers (eg, a hotel identifier can include a city identifier, because a hotel name may be not unique)
- delegation of subqueries (eg, a query about medieval Scotland movies may be separated into subqueries on geography, history, and art)
- delegation of correspondence between identifiers and resources, because at each level (employee - department - company - world) there are different correspondences (ie, "Project 1" means different things for an employee and outside world)
- delegation of calculations, to distribute queries between computer resources
- delegation of incomplete identifiers (you can use identifiers without NID, eg, for queries, when NID is just unknown).

That is, information providers can use quite complex routing (eg, you can use computers A and B as information providers for project 1, and computers A and C for project 2), which would be hidden from an user. For example, if you look for "Specification of project 1", a query behind curtain can route as the following:
1. Look for information in the current application (if not found, route a query level up)
2. Look for information at the current computer
3. Look for information in the current domain (if not found, route a query to a provider of this domain)
4. Look for information at computer A
5. Look for information in application 1 at computer A
Finally, you can receive a document as \\computerA\app1\doc1.doc, even without knowing this specific path.


ADVANCED USAGE

A man has a wolf, a goat, and a cabbage. He must cross a river with the two animals and the cabbage. There is a small rowing-boat, in which he can take only one thing with him at a time. If, however, the wolf and the goat are left alone, the wolf will eat the goat. If the goat and the cabbage are left alone, the goat will eat the cabbage. How can the man get across the river with the two animals and the cabbage?

What we have here?
1. Objects: a man, a wolf, a goat, a cabbage, a boat.
2. A boat can cross a river.
3. A wolf eats a goat, a goat eats a cabbage.
4. Question: how can the man get across the river with the two animals and the cabbage?

Implied:
4. A river has two banks.
5. Crossing river is moving from one bank to another.
6. A boat used for crossing river, by moving someone or something to it and after from it.
7. "Take" means a man should move wolf, goat, and cabbage by himself.
8. "Thing" means someone or something.
9. "At a time" means one crossing.
10. "Left" means someone or something did some action, and move to another place, whereas something is still here.
11. "Left alone" means "left by a man at one of banks".
12. "Eat" means destruction of someone or something.

Is it possible to solve this problem by marking up it with semantic tags? The problem includes initial conditions (what we can identify directly), and implied conditions (which can be inferred with generalizing/detailing), and each condition is a set of identifiers, which linked between themselves. That is, these conditions are seemingly covered by operations described above. But is it possible to solve the problem with resulting hypertext? From the point of view of this task, the solution is "future" (as cause-result sequences) relating to task conditions. That is, basing on described situation and possible actions, we should create actions which would "happen" in small abstract world of this task. Does the creation of such solution differ from other information? No, the creation of task "future" is based namely on its conditions:
- a man takes a wolf in a boat, a goat eats a cabbage, a cabbage is destroyed -- a man cannot get across the river with the two animals and the cabbage, because we already have no cabbage
- a man takes a goat in a boat, a wolf does not eat a cabbage, a goat is at the second bank, etc


CONCLUSION

So, does the proposed solution solve the problems of search and information organizing? At least, some of them. At least, they allow to request information as necessary for a user, but no as forced by search help. Queries can be as complex as needed. We don't need to think about how searched information would look like. Besides that, we can be sure we search namely what we wanted to searched. This is the significant contrast with how search works today. Therefore only thing remains: to start implementing it.

Комментариев нет:

Отправить комментарий