GED 579 / GSA 579
Comparing Internet Search Engines Part II
Search
Engine Components - What is an Internet Search Engine?
Invisible Web -
What is it and What is composed of.
Note the different Evaluations given the following Search Engines
What Search Engines Can't Search (Unless they have the Capability to Search "deeper")
A Search Engine has 3 Basic Parts
1. Spider (crawler, link finder): a computer program that harvests web links from page to page
2. Index: a database that is organized and searchable of the Spider's harvested results
3. Search and retrieval mechanism: Software that allows users to search the Index and return results in a predetermined order.
But a Search Engine also is commonly used to refer to any software that searches an Index of Words or material types
Examples
A. "Small" page related Search Engine - To search this page for the word "Usenet" Click on EDIT in your browser menu, then Click (Find (on this page) . Enter your term and search.
B. Database or Index Specific - Searches only for content within an
enclosed site
Examples - Searching the www.smcvt.edu site,
Online Catalogs, Periodical Indexes like Lexis-Nexis
C. Directory Search Engines -- Searching for content or web
pages submitted by hand. In other words materials are found and maintained
by a human being.
Examples - Yahoo, Google Directory
D. Large Search Engines - These search engines use the "3 Basic Parts" listed above. They try to find everything on the Internet and fail for a number of reasons.
Surface and Invisible Web - What is it and What it is composed of.
The Visible Web (sometimes referred to as the Surface Web or Public Web or Fixed Web pages stored as individual files
on servers)
-Uses search engines as the primary
means for finding information on the "surface" Web. Authors may submit their own Web pages for listing
[Directories like Yahoo]. Or, search engines "crawl" or
"spider" documents by following one hypertext link to another.
According to "Invisible Planet" - The Invisible Web is the content that resides in searchable databases, the results from which can only be discovered by a direct query. Without the directed query, the database does not publish the result. When queried, deep Web sites post their results as dynamic Web pages in real-time. Though these dynamic pages have a unique URL address that allows them to be retrieved again later, they are not persistent.
Major Point - what can be found via a search engine like Google is much less than what exists in total on the Internet
But Search Engines might not search "deep web" because:
1. Dynamically (Database) driven - websites. Search Engines may have difficulty harvesting non-html mark-upped websites.
2. Search Engines CAN NOT search password driven sites like EscoHost journal databases or online catalogs.
3. Search Engines (often) CAN NOT search for Adobe, word PowerPoint files on a web page
The Invisible Web is 500 to 1,000 times the size of the the Surface Web.
Figure 6 displays the distribution of deep Web sites by type of content.

Figure 6. Distribution of Deep Web Sites by Content - From White Paper @ http://www.brightplanet.com/deepcontent/tutorials/DeepWeb/index.asp
Size of WWW
According to the results of the Web Characterization Project's most recent survey, the surface / public Web, as of June 2002, contained 3,080,000 Web sites, or 35 percent of the Web as a whole. Public sites accounted for approximately 1.4 billion Web pages. The average size of a public Web site was 441 pages.
Source: Trends in the Evolution of the Public Web @
http://www.dlib.org/dlib/april03/lavoie/04lavoie.html
D-Lib Magazine
April 2003
------------------------------------------------------------------------------------------
|
Table 1.13: The size of the Internet in terabytes. |
|
|
Medium |
2002 Terabytes |
|
Surface Web |
167 |
|
Deep Web |
91,850 |
|
Email (originals) |
440,606 |
|
Instant messaging |
274 |
|
TOTAL |
532,897 |
Source: How much information 2003 @ http://www.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm#summary
Comparing and ranking different search engines: is a another method of evaluating search engines. But ranking the different search engines depends on the emphasis one gives the following evaluation criteria:
1. Size of the database (see http://www.searchenginewatch.com/reports/sizes.html
for a report on size)
- everything included, dual numbers
- selected and reviewed content
2. File Types (allows "field" searching)
- Web Pages, Usenet News, gopher, FTP, PDF (Adobe), Word (Blogs
and Wikis are findable by search engines)
- Other [software, sound, images, video]
- Material type: Location (country), language, newspapers, journals,
books
3. Interface
- modes: simple or complex, look over for details of boolean
searching, etc. Search
Engine Evaluation Part I,
4. Ranking of results
- frequency of word choices found on web pages
- location: words found in meta-tags, first paragraph
- reviewed sites
- fee paid to rank sites higher in results list
- proximity of words to each other
- Link Popularity (Google, Inkotomi), also known as Peer
Ranking
A. PeerRank Technology (Google) - ranks by how other highly ranked sites link to other sites. A link from the NASA web site would boost the ranking of a web site for example
B. Results based on other searchers selections and length of time spent at site
- bundling of results into concepts, domains, and sites example Teoma @ www.teoma.com
5. Limits
- Language, Geography, file type, date
6. Timeliness
- Frequency of Discovery
- Timelag
- Weeding
7. Description of sources (annotations) found in hit list
8. Speed
Note the Different Ranks
given the following Search Engines from different established review sites:
http://www.searchenginewatch.com/reports/sizes.html
http://www.notess.com/search/stats/size.shtml
What Search Engines
Often Don't Search
- the
following listing is often referred to as the "Invisible Web"
1. Contents of Adobe PDF and formatted files - but search engines are more
commonly including these different file types
2. The content of Sites requiring a log-in
3. CGI-Bin Output such as data requested by a form
4. Intranets
5. Commercial or proprietary indexes like ERIC, UMI, Lexis-Nexis
6. Sites that use a robots.txt to keep robots (search engines) away
7. Non-html resources: Telnet, ftp, gopher, etc.
8. Web sites that are "Database Driven" An example is the
"old" SMC Web Page. Any page within the site is organized by a database
algorithm that looks like the following example- http://www.smcvt.edu/Admin3.asp?SiteAreaID=193&Level=1
Note that the URL ends not with .htm or .html Also note the ? in
the URL. Search Engine Spiders will generally not retrieve or harvest
these URLs.
1. Robot software (spiders, crawlers) uses HTTP to request documents associated with a certain URL.
2. Robots use either a depth-first or breadth-first
search strategy for following URLs.
- depth-first robot follows the first link on the initial page, then the
first link of the second, and so on. This is used more commonly for
subject specific search engines.
- breadth-first robot searches the first link of initial page, then
retreats back to the initial page and follows the second link, and so on.
This is used most commonly for broad search engines.
3. URLs are organized in a database.
4. The URLs from the database are "reharvested" and text
from the sites are put in an index. How much text is harvested varies amongst the
various search engines.
5. Harvesters generate text summaries. Most copy the
<title> and a fixed amount of the initial text.
6. The search engine (ex. Alta Vista) uses search software to search the index
created by the robot searches.
7. Algorithms are used to set each individual search engines search
parameters: boolean, wildcards, etc.
8. Algorithms are used by search engines to Rank the
results of
the search. Factors that may be considered in Ranking: Which fields the search terms
are found (<title>, URL field,) Number of times the word appears in a single
document, Where the search term appears in the document. Payment by companies to
have their pages ranked high or first.
9. Netiquette for Robots. The root directory of a Web server can be named robots.txt. The robot should leave these web files alone for privacy reasons. In our SMC web account, Web files may be located in a folder named "private" to prevent a "local search engine" from viewing.
Web Search Features:
Online guides
1. http://www.notess.com/search/features/