A recent study by IBM has found that the Web is a bow tie affair, but not everybody is invited to the party. Sure, there are those pages in the know – they know, and are known, by all the other important pages. But there are also those that remain in the fringes – with little or no ties to the outside world.
IBM recently pooled its resources with Compaq in Palo Alto, Calif. and AltaVista Co. in San Mateo, Calif. in order to get a clear picture of what the Web’s structure looks like. And what they found surprised them.
“Earlier studies predicted that everything on the Web is accessible from everywhere else – you can go from any page to any other page, almost. In fact, our study shows that that is not the case. The Web is more intricate than that,” said Ravi Kumar, a researcher at IBM’s Almaden Research Center in San Jose, Calif.
What the researchers found was that the Web is divided into four distinct parts, each of them almost equal in size. They dubbed their findings as the Bow Tie Theory. First, there is the core, which makes up about 30 per cent of all publicly accessible pages on the Internet – that is, those pages that are not hidden behind firewalls. These pages are linked to each other and it is possible to get from any one page on the core to any other page on the core by following links between pages.
Origination and termination
The second part of the Web is known as the origination node, which has about 44 million pages and makes up about 24 per cent of the Web. It is possible to get from origination pages to the core, but not the other way around. An origination page could, for example, be somebody’s home page – joeblow.com. The page could include links to more popular pages on the core, but because the core doesn’t know about Joe Blow’s home page, there are no links from the core back.
The third part of the Web is called the termination node, which also consists of about 44 million pages. These are pages that the core links to, but that don’t link back to the core. Deep within IBM’s home page, for example, there could be a link to Kumar’s own page. This page might just have his address and contact information, but no links back to the core.
The last part of the Web is made up of islands, also numbered at about 44 million pages, that are completely cut off from the core. These pages were once linked to, or from, the core. But that link has, for some reason or other, been broken.
“The study shows that if I pick two random pages, than the chances that I have from going from the first one to the second one is only one in four. So, it’s not the case that the Web is completely connected. It’s not the case that I can get from any page to any other page on the Web,” Kumar said.
Within the core however, it takes an average of 16 clicks to get from one page to another – 16 degrees of separation.
One of the important aspects of the study was that it took into account the direction of the links going from one page to another, said Steven Strogatz, an applied mathematician who studies networks at Cornell University in Ithaca, N.Y.
“A lot of the earlier works didn’t take into account the directionality of the link,” Strogatz said.
Links on the Web only go in one direction. That is when people are on a Web site, they can only get to the pages that that page point to. They cannot travel backwards and go to the pages that point to the page they are already on.
If people could surf backwards, then pages on the core would on average only be divided by seven degrees of separation, Kumar said.
“Browsing backwards could be a much richer experience. If you actually have such a thing, it does, quite often, enhance one’s browsing experience. Knowing what points to a site tells you a lot about that site,” said Jon Kleinberg, an assistant professor of computer science at Cornell. But he said technology that would allow users to see which pages point to the page they are currently on may not be feasible. “You really need access to an index to do that.”
Trapped in the Web
A lot of termination pages are also deliberately designed so that users can’t get back out to the rest of the Web, making the ability search backwards undesirable for a lot of commercial Web site designers.
“Some of the commercial Web sites deliberately don’t want you to get out. It’s like the way shopping malls are deliberately designed,” Strogatz said.
“There are networks all around us in the world,” he said. These include a network of banks which are connected by automated teller machines, power grids, nervous systems and social networks of people.
“But in a very few cases do we really understand the structure of these networks. So I see this as one of the first studies, in a pretty concise way, of what giant networks are really like,” Strogatz said.
“It’s a really interesting picture that they came up with too. It makes the Web look like sort a biological organism. It’s a very organic picture with a heart and limbs and tendrils. I thought that was striking.”
Kleinberg was also impressed by the findings.
“I think one of the things that’s interesting is that they actually measured the size of the pieces. In a sense, one could predict that the Web was going to look like this qualitatively, but the question was, how important are all these pieces, how important is their influence on the Web. And what they discovered is that these pieces are all roughly comparable in size. And therefore, each of them somehow plays a crucial role.”