GED 579 / GSA 579 Comparing Internet Search Engines   Part II


Search Engine Components

A Search Engine has 3 Basic Parts

1. Spider (crawler, link finder):  a computer program that harvests web links from page to page

2. Index: a database that is organized and searchable of the Spider's harvested results

3. Search and retrieval mechanism: Software that allows users to search the Index and return results in a predetermined order.

But a Search Engine also is commonly used to refer to any software that searches an Index of Words or material types

Examples

A. "Small" page related Search Engine - To search this page for the word   "Usenet" Click on EDIT in your browser menu, then Click (Find (on this page) .  Enter your term and search.

B. Database or Index Specific - Searches only for content within an enclosed site
Examples - Searching the www.smcvt.edu site, Online Catalogs, Periodical Indexes like Lexis-Nexis

C. Directory Search Engines  -- Searching for content or web pages submitted by hand.  In other words materials are found and maintained by a human being.
Examples - Yahoo, Google Directory

D. Large Search Engines - These search engines use the "3 Basic Parts" listed above.   They try to find everything on the Internet and fail for a number of reasons.


Surface and Invisible Web - What is it and What it is composed of.

The Visible Web (sometimes referred to as the Surface Web or Public Web or Fixed Web pages stored as individual files on servers)
-Uses search engines as the primary means for finding information on the "surface" Web. Authors may submit their own Web pages for listing [Directories like Yahoo]. Or, search engines "crawl" or "spider" documents by following one hypertext link to another. 

According to "Invisible Planet" - The Invisible Web is the content that resides in searchable databases, the results from which can only be discovered by a direct query. Without the directed query, the database does not publish the result. When queried, deep Web sites post their results as dynamic Web pages in real-time. Though these dynamic pages have a unique URL address that allows them to be retrieved again later, they are not persistent.

Major Point - what can be found via a search engine like Google is much less than what exists in total on the Internet

But Search Engines might not search "deep web" because:

1. Dynamically (Database) driven - websites.  Search Engines may have difficulty harvesting non-html mark-upped websites.

2. Search Engines CAN NOT search password driven sites like EscoHost journal databases or online catalogs.

3. Search Engines (often) CAN NOT search for Adobe, word PowerPoint files on a web page

The Invisible Web is 500 to 1,000 times the size of the the Surface Web.

Figure 6 displays the distribution of deep Web sites by type of content.

PIE CHART

Figure 6. Distribution of Deep Web Sites by Content - From White Paper @ http://www.brightplanet.com/deepcontent/tutorials/DeepWeb/index.asp

Size of WWW

According to the results of the Web Characterization Project's most recent survey, the surface / public Web, as of June 2002, contained 3,080,000 Web sites, or 35 percent of the Web as a whole. Public sites accounted for approximately 1.4 billion Web pages. The average size of a public Web site was 441 pages.

Source: Trends in the Evolution of the Public Web @ http://www.dlib.org/dlib/april03/lavoie/04lavoie.html
D-Lib Magazine
April 2003

------------------------------------------------------------------------------------------

Table 1.13: The size of the Internet in terabytes.

Medium

2002 Terabytes

Surface Web

167

Deep Web

91,850

Email (originals)

440,606

Instant messaging

274

TOTAL

532,897

Source: How much information 2003 @ http://www.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm#summary

 


Comparing and ranking different  search engines: is a another method of evaluating search engines.  But ranking the different search engines depends on the emphasis one gives the following evaluation criteria:

1. Size of the database  (see http://www.searchenginewatch.com/reports/sizes.html for a report on size)
- everything included, dual numbers
- selected and reviewed content

2. File Types (allows "field" searching)
    - Web Pages, Usenet News, gopher, FTP, PDF (Adobe), Word (Blogs and  Wikis are findable by search engines)
    - Other [software, sound, images, video]
    - Material type: Location (country), language, newspapers, journals, books

3. Interface
    - modes: simple or complex, look over for details of boolean searching, etc. Search Engine Evaluation Part I,

4. Ranking of results
    - frequency of word choices found on web pages
    - location: words found in meta-tags, first paragraph
    - reviewed sites
    - fee paid to rank sites higher in results list
    - proximity of words to each other
    - Link Popularity (Google, Inkotomi), also known as Peer Ranking

A. PeerRank Technology (Google) - ranks by how other highly ranked sites link to other sites.  A link from the NASA web site would boost the ranking of a web site for example
B. Results based on other searchers selections and length of time spent at site

 - bundling of results into concepts, domains, and sites example Teoma @ www.teoma.com 

5. Limits
    - Language, Geography, file type, date

6. Timeliness
    - Frequency of Discovery
    - Timelag
    - Weeding

7. Description of sources (annotations) found in hit list

8. Speed


Note the Different Ranks given the following Search Engines from different established review sites:

    http://www.searchenginewatch.com/reports/sizes.html

    http://www.notess.com/search/stats/size.shtml


What Search Engines Often Don't Search
- the following listing is often referred to as the "Invisible Web"

1. Contents of Adobe PDF and formatted files - but search engines are more commonly including these different file types
2. The content of Sites requiring a log-in
3. CGI-Bin Output such as data requested by a form
4. Intranets
5. Commercial or proprietary indexes like ERIC, UMI, Lexis-Nexis
6. Sites that use a robots.txt to keep robots (search engines) away
7. Non-html resources: Telnet, ftp, gopher, etc.
8. Web sites that are "Database Driven"  An example is the "old" SMC Web Page.  Any page within the site is organized by a database algorithm that looks like the following example-  http://www.smcvt.edu/Admin3.asp?SiteAreaID=193&Level=1 
Note that the URL ends not with .htm or .html  Also note the ? in the URL.  Search Engine Spiders will generally not retrieve or harvest these URLs.


Spiders or Robots

1. Robot software (spiders, crawlers) uses HTTP to request documents associated with a certain URL. 


2. Robots use either a depth-first or breadth-first search strategy for following URLs.
- depth-first robot follows the first link on the initial page, then the first link of the second, and so on.  This is used more commonly for subject specific search engines.

- breadth-first robot searches the first link of initial page, then retreats back to the initial page and follows the second link, and so on.   This is used most commonly for broad search engines.

3. URLs are organized in a database. 

4. The URLs from the database are "reharvested" and text from the sites are put in an index.  How much text is harvested varies amongst the various search engines.

5. Harvesters generate text summaries.  Most copy the <title> and a fixed amount of the initial text.

6. The search engine (ex. Alta Vista) uses search software to search the index created by the robot searches.

7. Algorithms are used to set each individual search engines search parameters: boolean, wildcards, etc.

8.  Algorithms are used by search engines to Rank the results of the search.  Factors that may be considered in Ranking: Which fields the search terms are found (<title>, URL field,) Number of times the word appears in a single document, Where the search term appears in the document.  Payment by companies to have their pages ranked high or first.

9. Netiquette for Robots.  The root directory of a Web server can be named robots.txt.  The robot should leave these web files alone for privacy reasons.   In our SMC web account, Web files may be located in a folder named "private" to prevent a "local search engine" from viewing.

 


Web Search Features: Online guides

1. http://www.notess.com/search/features/

2. http://searchenginewatch.com/facts/ataglance.html