четверг, 25 августа 2011 г.

Simplified Semantic Web

The battle goes on. In heads and souls. While the whole world thought Google can find everything, the very search engine was enhanced with social features. Not the least because machine search is not perfect. While social networks and Wikipedia show the potential of communities in technology world, Semantic Web (or "Web of data", which focuses on processing by intelligent agents) advances. Web is constantly rolled over by waves of technical and social overconfidence. The conventional Web with its personal pages, blogs, and social networks is the asylum for social overconfidence. Semantic Web is for technical one.

Any overconfidence (bordering with blind faith) is the way to nowhere, true is always in middle ground. Actually, processing of information by machines and humans (as well as worlds of data and text) are inseparable parts of the whole. Look at any application. Application lifecycle starts from requirements (written in natural language), which turn into design and implementation (in the form of data, models, and code), to be incarnated in user interface and documentation (which, again, use natural language).

However, even so, computer- and human-oriented structures are separated in different layers of abstraction. Separation of layers is one of main principles of development, which makes development more efficient but destroy links between different layers of information. After this, these links are emulated by manual synchronization of different components of applications, and documentation. In result, we see numerous desynchronization problems of requirements ("what an user expects"), code ("what a developer does"), and product ("what an user gets"), etc.

There are even more deep causes of this separation. Natural language (as human thinking in general) usually generalizes information, to reduce its volume and time, required for communication. Whereas, computer processing usually specifies everything as precise as possible. The problem is you cannot solve both tasks (generalization and specifying) at the same time. In fact, it is known since publishing of Goedel incompleteness theorem, which declares, if simply put, that any system cannot be complete and consistent at the same time. The same concerns any information too: it cannot be ordered (by generalization) and complete (by specifying) at the same time. Complete information includes a lot of signals, which we receive on some thing, but it has no sense in this form. As soon as we start to order this information, we ought to throw away a part of information, therefore we lose completeness.


Look now at data and natural language. Data formats define a set of rules, which guarantee consistency (integrity) of data, but nobody can ensure completeness (moreover, usually data formats do not aim to provide full data and provides beforehand restricted frames for data). Natural language is based on restricted set of rules, which use almost infinite set of identifiers. Which cannot safeguard consistency (in some cases, you can understand grammatically incorrect sentence), but usually we expect completeness (though, it is not guaranteed by natural language itself, but it can be achieved with it).

Data formats provide advantage, when they order homogeneous information, which has similar characteristics, which allows to distinguish generalized classes and properties of things. However, any generalized characteristics is always subjective, because we generalize according with some criteria or arbitrarily at all. Thus, we throw away "unnecessary", in our point of view, information. For example, we can have database of countries and include capital in one field as just string. Whereas any capital may be described by a book at least.

The first problem of data formats is borders. Any data format should stop somewhere and draw borders (that is, "we define capital as a string"), which makes usage of it by machines more efficient, though we lose completeness. Especially, it is evident when we apply them to heterogeneous information. In this case, formats usually have so called "custom fields", which use... natural language. Which proves this problem really exists and that we cannot order information with the speed of its creating.

The second problem is data formats are monolithic. That is, you can work with the format only if you can interpret it completely. For example, to retrieve country list from geographic database, you should know database format or have application which knows that. This exemplifies coarse-grained information compatibility vs. fine-grained information compatibility of natural language (if you don't know one word from a sentence, you need interpretation only for this word).

However, despite of these shortcomings, formats are efficient in ordering information, which cannot be achieved by natural language. Which does not order information so tightly. Which is usually redundant and full of ambiguities.


The same concerns search. The assumption that machine search can find everything is dubious. At least. Because search (which is consistent) works with text (which is not). Natural language abounds with generalizations, which ignore or miss a part of information. Can we resolve such ambiguities with some context? Rather not, than yes. Because natural language has questions exactly because context cannot help in all cases. There is a lot of situations, when you cannot understand information completely. Therefore, you ask questions. Because of these restrictions, modern search succeeds only with simple or popular questions. But it fails as soon as we asks questions with several entities, linked between each other ("Is India larger than Bhutan?").

Can social search help in this case? In fact, only partially, because it can cover only a part of infinite questions which we can ask. And don't forget, that unlike data, answers in natural language are unordered, therefore if you find indirect answer, you have to reorder information, which would require additional time.


Such duality of machine vs. human processing of information will remain for years ahead. The cause is machines are faster in processing, but they can process only in consistent way (which is not available for all areas of knowledge). Capabilities of humans are restricted, but they are better with ordering information. We may see that both approaches rather complement each other.

If so, why conventional Web and Semantic Web are separated still? The latter is built on good idea of providing semantics, but why it concerns only data and machines? We need semantics. Here and now. And who said humans are not able to deal with it? It may concern machine-oriented forms of it, but not natural language. Quite possible, that even experienced user cannot work with complex classifications, taxonomies, and ontologies, but any human can easily identify things and conceptions (otherwise, he or she could not communicate with natural language).

There is the abyss between worlds of text and data. And this is the result of machine-friendly features. We work with cryptic file names (or rather abbreviations), we should remember file paths (if a computer is similar to a library, why there is no librarian which gives me a book on request and does not force me remember long path to it?), any user interface is rather human-friendly patch, which exposes application functions, and any questions about its singularities are answered with "read the manual" (why I should read it if I need one function of this application per entire life?).

Humans are underestimated. It seems that information is created by humans and for humans, but there is still no simple tools for meaning. Semantic Web accentuates on data (which are to be processed by intelligent agents) and complex formats (which are to be created by experts). But what is about an ordinary user? Unfortunately, even developers consider Semantic Web too complex and cumbersome (and which, by the way, makes it too expensive).

Stop it. We need rather simple forms of semantics. We need simplified Semantic Web. The approach, which should be a mediator between conventional and Semantic Web. The approach, which should make natural language more precise and make data more human-friendly. And it should be based on several simple ideas: rich human-friendly identification, human friendly representation of semantics, and restricted set of human-friendly semantic rules.


1. Idea of human-friendly identification is simple: we should go beyond natural language identifiers and make them precise and unambiguous. Identification is the base of semantics: it answers "What is it?" question. All the rest like hierarchies, associations, taxonomies, ontologies are derivables (which only help to order information). You may not know how your vehicle is classified, but you, without doubt, know what identifier it has. The fact, that you vehicle is SUV is derivable from its identifier.

Human-friendly identification supposes simple and compact way of specifying value. For example, conventional Semantic Web proposes to use URI, so to distinguish different meanings of "opera" word, we need to use URIs like http://en.wikipedia.org/wiki/Opera and http://en.wikipedia.org/wiki/Opera_(web_browser). Shortcoming of such approach are we depend on computer resources (site and path). But humans usually do not know about the source of an identifier (do you know where your favorite encyclopedia defines "opera"?). Instead of that, human-friendly identification should provide uniqueness of meaning (not uniqueness within computer system).

That is, instead of arbitrary URI http://en.wikipedia.org/wiki/Opera, we need identifiers, which are close to natural language but more precise. For example: opera or Opera, that is, any natural language idenfier may be made as precise as needed and without duplication. But how even this identifier will be routed to the specific location or server? This is responsibility of semantic cloud, which will route identifying requests, will retrieve identifier derivables, etc. The role of computer is nevertheless important: it may hint which identifier is ambiguous and how we can resolve it, but the result should be human-friendly.

2. Idea of human-friendly representation of semantic relation is simple too: let's merge hypertext and semantics. In fact, it is quite natural alliance: hypertext as convenient form of representation, marked up with semantics. This way, everyone may use and add meaning to text, with zero reconfiguration (well, maybe not zero but about to).

3. Restricted set of rules is necessary, because, unfortunately, there is no way to precisely draw borders between groups of meaning in sentence. For example, a human can understand the difference between "I like movie about medieval Scotland" and "I like movies, medieval, and Scotland". That is, the simplest rule may allow to just link identifiers, for example: "I like movie about medieval Scotland" vs. "I like movies, medieval, and Scotland". Extended set of rules may include ones for generalizing and specifying, etc.


Moreover, simplified Semantic Web may enhance even existing conventional Web, which has following restrictions: (1) loosely links information (especially within a computer or a network), (2) rarely used for binary data (the same concerns Semantic Web, which proposes to convert data to own format but has not simple way of semantifying already existing data), (3) provides restricted means for integrating heterogeneous application and data, (4) creation and publishing of hypertext is not easy enough.

Of course, you may say there are no such problems. Why you cannot describe a binary file with hypertext? First, hypertext and binary files are separate entities, though description does not make sense without described information. Second, hypertext editor and Web server are not must for any computer (especially the latter, which should be correctly configured and used). Third, links in hypertext are different when used inside and outside Web server (namely because they dependent on computer entities like sites and paths).

What can simplified Semantic Web do? (1) binary files may have semantic wrappers, which should be coupled somehow with files (and maintain integrity of meaning), (2) semantifying through intuitive identification is simpler than hypertext editing, (3) identification should cover all entities of both real and computer worlds, (4) as soon as identifying is not linked with computer-oriented resources, publishing is facilitated too (because the fact may be represented as plain text with fragments of semantic markup).


Of course, both human-friendly and computer-friendly Web is nevertheless necessary. But the bridge between them is inevitable too. This way or another.