[home: http://monkeyfist.com]
essays · argument · politics · technology · culture

A simple, prima facie argument in favor of the Semantic Web

Friday, 26 April 2002


[icon] Printer version
[icon] Permanent URL
[icon] Support this author's work

I do a bit with so-called Semantic Web technologies (OK, I've written a couple of articles, have a book proposal in the works, and am about to start a job as a Semantic web researcher ... as I said, "a bit"), but I must confess to never really getting certain aspects of it. I like logic programming, and I'm certainly interested in knowledge representation, and I do a bunch of web stuff so I must be a Semantic Web person. However, some bit never clicked for me, some key shared assumption left me feeling a bit out of the flow of things. I used to characterize this as having more of a logician/philosopher background, but that didn't seem quite right. During the recent Google and SOAP furor, I had a little insight that led to the following prima facie argument for the Semantic Web. I hope it helps other people "get it".

The Semantic Web

For the purpose of this argument, I consider the key feature of the Semantic Web to be the use of URIs as differentiated terms. URIs aren't just untyped pointers to web pages with (perhaps) server specific internal semantics. Rather, they can play different grammatical roles (the canonical subject-verb-object trio), and they are used in a very rich system of typed links. The "verb" in an RDF triple is a specific sort of "hypertextual" link between the subject and the object.

This is different from my personal, long standing view of RDF as a simple and rather awkward logic language. I think it explains why some people get so excited by the "graph nature" of RDF, which I just saw as a (personally uninteresting) notation.

Prima facie?

One quick meta-point about this argument: It's not conclusive, nor is it meant to be. Indeed, it's not the result of deep reflection or significant specific investigation. That's why it's merely a prima facie argument about the goodness of the Semantic Web. There are loads of possible defeaters out there, and I don't mean to make extravagant claims for the correctness of my conclusion. I think this argument merely puts the Semantic Web on a good initial footing in debates about the future of the Web. It presents the Semantic Web as an initially plausible contender.

The argument (in schematic form)

  1. Web links are untyped.
  2. Google does amazingly cool things with enough Web links. (I.e., PageRank derives interesting and somewhat surprising information from Web links.)
  3. The basic point of RDF and the SemWeb is that links should have "semantic" types.
  4. One should expect correspondingly Googlecool applications.

The Google Premise

I think it's safe, in this context, to presume that certain features of the Semantic Web are obviously good and, thus, unimportant to detail. For example, most everyone on either side of the debate thinks that a global, highly interconnected 'net is a good thing, and that it being trivial to publish information that references other information is also highly desirable. So we're all on the same page, I hope, on the value of the "Web" part of the Semantic Web.

Note that I'm largely not talking about ecommerce, whether between business or direct to consumers. eBay is interesting, but this argument isn't directed toward those aspects of the Web.

I also take it as completely uncontested that Google is the best search engine, period. Indeed, that Google makes the Web a vastly nicer place to be. A Web of 2 billion plus pages that we had to wade through with AltaVista would sort of suck, as would everyone having to join the Yahoo/Open Directory editorial staff.

But even if these were rational alternatives, Google is still striking. The quality of its results is consistently high. It makes crucial use of the mere presence of hyperlinks to infer information about webpages (aside from their more mundane use in crawling). The thrust of this argument is that if there were more information "in" the links than their mere presence, Googlesque analysis would be able to do more, perhaps much, much more.

Naturally, PageRank isn't everything, as the folks at Google say:

Of course, important pages mean nothing to you if they don't match your query. So, Google combines PageRank with sophisticated text-matching techniques to find pages that are both important and relevant to your search.

This suggests a second round of prima facie argument: The more machine understandable we make the content of the pages, the more likely search results will combine harmoniously with link derived information.

What This Argument Shows

This argument is inconclusive in a number of ways. The way that it's not obviously inconclusive to me is whether, if the Google database were full of (good/correct) typed links to even simple (good/correct) "semantic" content, there'd be nothing clearly better to derive from that augmented database. It's possible, of course, that there would be some sort of overload from too much high quality information, but even then, it seems a heck of a lot easier to dumb it down to various degrees than to smarten up what we have. In the worst case scenario we have Google pretty much as it is.

It may be that a Semantic Google would be more vulnerable to gaming or other trash input. Or that good typed links will be too hard to add. Or so on. These seem to me to be arguments that the Semantic Web isn't achievable, rather than it not being a good thing to have.

The stronger argument is that we'd get more bang for a research buck if we just kept things the way they are and focused on further Googlesque research on using untyped links and, perhaps, on various sorts of text analysis or even natural language processing techniques. I don't think, given the possible difficulties of getting everyone to correctly even just minimally type their links, that Semantic Web advocates can argue that such research is more pie in the sky or likely to fail. Indeed, the Google argument gives a prima facie contrary conclusion.

Finally, nothing about this argument is exclusive. There may be far greater gains to be gotten by Semantic Web technologies (e.g., a Semantic Amazon -- or slew of mini-Semantizons -- might be a much bigger improvement on the status quo than a Semantic Google). But that's completely irrelevant to the modest aims of this argument. I think that the possibility of a Semantic Google is an appropriately sufficient motivation for a wide class of developers. If they get even more out of the tech than that, hooray! That would be a case of underpromising and overdelivering, which would be a good motto for Semantic Web advocates at the moment.

Another Aspect

I've focused on the typing of links, but there is another aspect of current RDF use that's worth noting, the finer grain of linked objects. What gets linked, predominantly, on the Web today are web "pages", which are usually fairly large "chunks" of information. Thus, Googlesque link analysis tends to yield information about those larger chunks. Since the chunks themselves have no Google-discernible internal structure (for the most part), Google can't even assert a "part of" relationship between the chunk and its constituent bits. This fundamentally (barring more sophisticated analysis of page contents, i.e., natural language processing of some sort) limits the sorts of thing Google can reason about. Simply improving the granularity of what gets linked might yield significant new capabilities.

The Cyc Argument

My argument is a bit like Douglas B. Lenat's Cyc argument, to wit (and very roughly), it's not the sophistication of the reasoner that matters (after a certain point) as much as the largeness of the knowledge base. Given a large enough collection of the facts of common sense, the Cyc engine would be able to use natural language investigations to bootstrap itself further. The main difference (allowing this caricature's version) is that Lenat argues to a fairly specific result -- human or greater than human intelligence. I'm more agnostic about what interesting things will result.

A Surprising Side-Conclusion

Working through these lines of reasoning led me to this surprising thought: Google, as it stands, is a Semantic Web application/site. This is not an enormous leap, as Google clearly makes extensive use of datamining theory and practice. Google, after all, reasons about hyperlinks, augmented by some heuristics about page composition. It does a very good job, as far as we all can tell, at determining page importance with no explicit encoding of human judgments about page importance. Well, at least, for the most part. Various Google-gaming moves are essentially attempts to add importance-focused human judgments to the mix.

One can reinterpret my argument to say, roughly: "It's worth putting some effort into improving the data fed into Google." This is, of course, not the only way the Semantic Web can move things along. A brief glance at the example applications for the Google programming contest is suggestive: There is more meaning to be discovered in the structure of the Web, have no doubt about it. The Semantic Web vision shows that there's even more to put into it.


· See also The Human Touch
· More about technology
· More by Bijan Parsia
· More web pages like this article
· Discuss this article

Return to top of page