Finding stuff the Search Engines can’t

Lee Ratzan

18 years ago

Just because a Web search engine can’t find something doesn’t mean it isn’t there. You may be looking for info in all the wrong places.

The Deep Web is a vast information repository not always indexed by automated search engines but readily accessible to enlightened individuals.

The Shallow Web, also known as the Surface Web or Static Web, is a collection of Web sites indexed by automated search engines. A search engine bot or Web crawler follows URL links, indexes the content and then relays the results back to search engine central for consolidation and user query. Ideally, the process eventually scours the entire Web, subject to vendor time and storage constraints.

The crux of the process lies in the indexing. A bot does not report what it can’t index. This was a minor issue when the early Web consisted primarily of static generic HTML code, but contemporary Web sites now contain multimedia, scripts and other forms of dynamic content.

The Deep Web consists of Web pages that search engines cannot or will not index. The popular term “Invisible Web” is actually a misnomer, because the information is not invisible, it’s just not bot indexed. Depending on whom you ask, the Deep Web is five to 500 times as vast as the Shallow Web, thus making it an immense and extraordinary online resource. Do the math: If major search engines together index only 20% of the Web, then they miss 80% of the content.

What makes it deep?

Search engines typically do not index the following types of Web sites:

— Proprietary sites

— Sites requiring a registration

— Sites with scripts

— Dynamic sites

— Ephemeral sites

— Sites blocked by local webmasters

— Sites blocked by search engine policy

— Sites with special formats

— Searchable databases

Proprietary sites require a fee. Registration sites require a login or password. A bot can index script code (e.g., Flash, JavaScript), but it can’t always ascertain what the script actually does. Some nasty script junkies have been known to trap bots within infinite loops.

Dynamic Web sites are created on demand and have no existence prior to the query and limited existence afterward (e.g., airline schedules). If you ever noticed an interesting link on a news site, but were unable to find it later in the day, then you have encountered an ephemeral Web site.

Webmasters can request that their sites not be indexed (Robot Exclusion Protocol), and some search engines skip sites based on their own inscrutable corporate policies. Not long ago, search engines could not index files in PDF, thus missing an enormous quantity of vendor white papers and technical reports, not to mention government documents. Special formats become less of an issue as index engines become smarter.

Arguably the most valuable Deep Web resources are searchable databases. There are thousands of high-quality, authoritative online specialty databases. These resources are extremely useful for a focused search.

Many Web sites act as front ends to searchable databases. Complete Planet, IncyWincy Spider and The Librarians’ Internet Index provide quick links for quality Web database searching. This technique is called split-level searching. Enter the key phrase “searchable database” into the above for more.

You can find other subject searchable databases by entering the keyword phrase “subject_name database” into your favorite search engine (e.g., “jazz database,” “virus database”).

A naive searcher typically enters a keyword into a general-purpose search engine, gets too many hits and then expends time and energy sorting through relevant and irrelevant results. Alternatively, they get no hits and wonder why. It is difficult to get all relevant hits and no irrelevant hits. (Information scientists call this the Law of Recall and Precision.)

Almost by definition, authoritative searchable specialty databases contain relevant information and minimal irrelevant information.

Don’t forget to bookmark a variety of special topic searchable databases into a Deep Web folder for ready reference.

Deep Web Search Strategies

— Be aware that the Deep Web exists.

— Use a general search engine for broad topic searching.

— Use a searchable database for focused searches.

— Register on special sites and use their archives.

— Call the reference desk at a local college if you need a proprietary Web site. Many college libraries subscribe to these services and provide free on-site searching (and a friendly trained librarian to help you).

— Check the Web site of your local public library. Many libraries offer free remote online access to commercial and research databases for anyone with a library card.

Summary

The Deep Web contains valuable resources not easily accessible by automated search engines but readily available to enlightened searchers.

Make the online search process more efficient and productive with resources missed in the Shallow Web. The truth is out there.

Lee Ratzan is a system analyst at a health care agency in New Jersey and teaches library technology at Rutgers University. Contact him at lratzan@scils.rutgers.edu.

COMMENT ON THIS ARTICLE