I do a bit with so-called Semantic Web technologies (OK, I've written a couple
of articles, have a book proposal in the works, and am about to start a job
as a Semantic web researcher ... as I said, "a bit"), but I must confess to
never really getting certain aspects of it. I like logic programming, and I'm
certainly interested in knowledge representation, and I do a bunch of web stuff
so I must be a Semantic Web person. However, some bit never clicked for
me, some key shared assumption left me feeling a bit out of the flow of things.
I used to characterize this as having more of a logician/philosopher background,
but that didn't seem quite right. During the recent Google and SOAP furor, I
had a little insight that led to the following prima facie argument
for the Semantic Web. I hope it helps other people "get it".
The Semantic Web
For the purpose of this argument, I consider the key feature of the
Semantic Web to be the use of URIs as differentiated terms. URIs aren't
just untyped pointers to web pages with (perhaps) server specific internal semantics.
Rather, they can play different grammatical roles (the canonical subject-verb-object
trio), and they are used in a very rich system of typed links. The "verb" in
an RDF triple is a specific sort of "hypertextual"
link between the subject and the object.
This is different from my personal, long standing view of RDF as a simple and
rather awkward logic language. I think it explains why some people
get so excited by the "graph nature" of RDF, which I just saw as a (personally
uninteresting) notation.
Prima facie?
One quick meta-point about this argument: It's not conclusive, nor is it meant
to be. Indeed, it's not the result of deep reflection or significant specific
investigation. That's why it's merely a prima facie argument about the
goodness of the Semantic Web. There are loads of possible defeaters out there,
and I don't mean to make extravagant claims for the correctness of
my conclusion. I think this argument merely puts the Semantic Web on a good
initial footing in debates about the future of the Web. It presents
the Semantic Web as an initially plausible contender.
The argument (in schematic form)
- Web links are untyped.
- Google does amazingly cool things with enough Web links. (I.e., PageRank
derives interesting and somewhat surprising information from Web links.)
- The basic point of RDF and the SemWeb is that links should have "semantic" types.
- One should expect correspondingly Googlecool applications.
The Google Premise
I think it's safe, in this context, to presume that certain features of the
Semantic Web are obviously good and, thus, unimportant to detail. For example,
most everyone on either side of the debate thinks that a global, highly interconnected
'net is a good thing, and that it being trivial to publish information that references
other information is also highly desirable. So we're all on the same page, I
hope, on the value of the "Web" part of the Semantic Web.
Note that I'm largely not talking about ecommerce, whether between
business or direct to consumers. eBay is interesting, but this argument isn't
directed toward those aspects of the Web.
I also take it as completely uncontested that Google is the best search engine,
period. Indeed, that Google makes the Web a vastly nicer place to be. A Web
of 2 billion plus pages that we had to wade through with AltaVista would sort
of suck, as would everyone having to join the Yahoo/Open Directory editorial
staff.
But even if these were rational alternatives, Google is still striking. The
quality of its results is consistently high. It makes crucial use of the mere
presence of hyperlinks to infer information about webpages (aside from their
more mundane use in crawling).
The thrust of this argument is that if there were more information "in" the
links than their mere presence, Googlesque analysis would be able to do more,
perhaps much, much more.
Naturally, PageRank isn't everything, as the folks at Google say:
Of course, important pages mean nothing to you if they don't match
your query. So, Google combines PageRank with sophisticated text-matching techniques
to find pages that are both important and relevant to your search.
This suggests a second round of prima facie argument: The more machine
understandable we make the content of the pages, the more likely search results will combine
harmoniously with link derived information.
What This Argument Shows
This argument is inconclusive in a number of ways. The way that it's not obviously
inconclusive to me is whether, if the Google database were full of (good/correct)
typed links to even simple (good/correct) "semantic" content, there'd be nothing
clearly better to derive from that augmented database. It's possible, of course,
that there would be some sort of overload from too much high quality
information, but even then, it seems a heck of a lot easier to dumb it down
to various degrees than to smarten up what we have. In the worst case scenario
we have Google pretty much as it is.
It may be that a Semantic Google would be more vulnerable to gaming or other
trash input. Or that good typed links will be too hard to add. Or so on. These
seem to me to be arguments that the Semantic Web isn't achievable, rather than it
not being a good thing to have.
The stronger argument is that we'd get more bang for a research buck if we
just kept things the way they are and focused on further Googlesque research
on using untyped links and, perhaps, on various sorts of text analysis or even
natural language processing techniques. I don't think, given the possible difficulties
of getting everyone to correctly even just minimally type their links, that
Semantic Web advocates can argue that such research is more pie in
the sky or likely to fail. Indeed, the Google argument gives a prima facie
contrary conclusion.
Finally, nothing about this argument is exclusive. There may be far
greater gains to be gotten by Semantic Web technologies (e.g., a Semantic Amazon
-- or slew of mini-Semantizons -- might be a much bigger improvement on the
status quo than a Semantic Google). But that's completely irrelevant
to the modest aims of this argument. I think that the possibility of
a Semantic Google is an appropriately sufficient motivation for a wide class
of developers. If they get even more out of the tech than that, hooray! That
would be a case of underpromising and overdelivering, which would be a good
motto for Semantic Web advocates at the moment.
Another Aspect
I've focused on the typing of links, but there is another aspect of current
RDF use that's worth noting, the finer grain of linked objects. What gets
linked, predominantly, on the Web today are web "pages", which are usually fairly
large "chunks" of information. Thus, Googlesque link analysis tends to yield information
about those larger chunks. Since the chunks themselves have no Google-discernible
internal structure (for the most part), Google can't even assert a "part of"
relationship between the chunk and its constituent bits. This fundamentally
(barring more sophisticated
analysis of page contents, i.e., natural language processing of some sort)
limits the sorts of thing Google can reason about. Simply improving the granularity
of what gets linked might yield significant new capabilities.
The Cyc Argument
My argument is a bit like Douglas B. Lenat's Cyc
argument, to wit (and very roughly), it's not the sophistication of the
reasoner that matters (after a certain point) as much as the largeness
of the knowledge base. Given a large enough collection of the facts of common
sense, the Cyc engine would be able to use natural language investigations to
bootstrap itself further. The main difference (allowing this caricature's version)
is that Lenat argues to a fairly specific result -- human or greater than human
intelligence. I'm more agnostic about what interesting things will result.
A Surprising Side-Conclusion
Working through these lines of reasoning led me to this surprising thought:
Google, as it stands, is a Semantic Web application/site. This is not
an enormous leap, as Google clearly makes extensive use of datamining
theory and practice. Google, after all, reasons about hyperlinks, augmented
by some heuristics about page composition. It does a very good job, as far as
we all can tell, at determining page importance with no explicit encoding
of human judgments about page importance. Well, at least, for the most part.
Various Google-gaming moves are essentially attempts to add importance-focused
human judgments to the mix.
One can reinterpret my argument to say, roughly: "It's worth putting some effort
into improving the data fed into Google." This is, of course, not the only way
the Semantic Web can move things along. A brief glance at the example applications
for the Google programming
contest is suggestive: There is more meaning to be discovered in
the structure of the Web, have no doubt about it. The Semantic Web vision shows
that there's even more to put into it.