|
|
|
|
| About site: Internet/Searching - Indexing the Internet |
Return to Computers |
| About site: http://www.tk421.net/essays/babel.html |
Title: Internet/Searching - Indexing the Internet An essay by John Hubbard analyzing the question of what is the best way to index the Internet, with references. |
| Alexa statistic for http://www.tk421.net/essays/babel.html |
Please visit: http://www.tk421.net/essays/babel.html
|
| Related sites for http://www.tk421.net/essays/babel.html |
| ISEDB A directory of search engines and directories worldwide, organized by geographical area and subject, includes articles on site submission, marketing, and optimization. | | Search_Ability A guide to directories of specialized search engines. Each directory is reviewed and classified in some detail. | | Search_Engine_Colossus Directory of hundreds of search engines, organised by country and topic. | | Search_Engine_Relationship_Chart Bruce Clay's pictoral view of how search engines provide data to partner engines and directories. Requires Acrobat Reader. | | Search_Engine_Showdown A user's guide to Web searching with features, comparative analysis, strategies, discussion, news and reviews. | | Search_Engines_Worldwide Search engines sorted by geographical location. |
|
This is best-2006.com cache of m/ as retrieved on 2009.01.07 best-2006.com's cache is the snapshot that we took of the page as we crawled the web. The page may have changed since that time.
|
Indexing the Internet
Indexing the Internet
Q: Which is stronger: your teeth or your lips?
A: Your lips, because your teeth can be broken or fall out.
- Chinese riddle
One of the things computers have not done for an organization is to be able to
store random associations between disparate things, although this is something the brain has always
done relatively well.
- Tim Berners-Lee
There are a wealth of options available for today’s Internet searcher.
Computer-driven search engines offer the ultimate in depth of indexing by completely crawling
through Websites and compiling full-text databases. Internet directories present a more
disseminable record structure by cataloging sites within a hierarchical classification scheme.
Giant portal pages are driven by databases with hundreds of millions of entries, in contrast to a
range of specialized finding tools, designed to provide comprehensive or quality coverage within
limited areas. Other search sites offer variations, combinations, or even compilations of the above
types of searching tools.
While acknowledging that no one search tool is right for all search needs, this
paper will analyze the question of what is the best way to index the Internet. Cosmetic concerns
over the best user interface design and style will be addressed only insofar as they relate to the
actual content representation of the search tools. It is my contention that in order to meet the
needs of Internet searchers with the types of human and technological resources currently
available, a reliance on human-powered indexing methods - especially the classification and
description of documents by topical experts - is and probably will always be necessary. Allowing
for the differentiation between the needs of Internet and library searchers, methods of library
cataloging can be adapted to the Internet environment, and enhanced with the power of humans over
machines to best catalog information.
Information scientists have spent thousands of years developing systems for
classifying and retrieving information. Library cataloging systems such as the Dewey Decimal or
Library of Congress classification schemes were developed in accordance with the physical
limitations of the cataloged material. The Internet explosion, in contrast, can be partially
accredited to the virtual characteristics of online information. The flexibility of being able to
offer multiple access points to information reveals some inappropriateness in using archaic schemes
like the Dewey Decimal System to catalog the Internet. Many users of the Internet, moreover, may
often want nothing to do with anything that looks like traditional library systems or research methodologies.
The sophisticated nature of library cataloging tools, often involving many
intermediary steps between searching and finding information, such as using a printed directory of
reference works to locate a subject index to find a journal article, is reflective of those tools
being developed by experts for use by expert searchers or those with help from a trained librarian.
The ease of use of online search engines, where terms can be easily entered and searched, has
generated mass appeal by offering quick responses and alleviating the psychological vulnerability
that people may feel when asking a librarian for search help. Yet in many ways a need still exists
to educate users so that they can make a more productive and efficient use of searching resources.
An analysis of search engine queries, for example, shows that searchers seldom use the available
tools to hone search statements, such as Boolean operators or quotes to form phrases (Jansen, et.
al., 1998), while a generally low retrieval effectiveness of search engine results has been
documented (Gordon & Pathak, 1999). One solution to improving the search results from ambiguous
queries is the enforcement of more rigorous indexing systems.
Before applying library cataloging methods to the Internet outright, the
differences between libraries and the Internet need to be considered. Without being drawn into a
debate of libraries versus the Internet, it should be understood that while the exponentially
increasing amount of materials available on the Internet represent a broad range of subjects, the
current depth and density of any serious Internet collections are usually vastly outnumbered by
frivolous, dubious, or redundant indexable matter and outweighed by the caliber of printed works
found in any medium-sized public library. The entailing major difference between libraries and the
Internet is that while libraries depend on funding to supplement scholarly collections, the
Internet, as a more interactive and entertaining medium, is primarily driven by commercial or
personal interests and accompanied with virtually no bibliographic control.
As a consequence of the economic tendencies of online information, many free
Internet services, including the major search engines, are subsidized by corporate sponsorships.
This should not necessarily be taken by users as an overwhelming detriment to their functionality,
no more than is paying taxes to support libraries, as long as the advertising within Internet
indexes is not ambiguously disguised or confused with the supposedly objective editorial content.
The problem here is instead the presence of unwanted and unauthorized advertising, prevalent within
many automated Internet search tools that rely on author indexing and lacking human editorial
control, resulting in deceptive and faulty database entries being successfully made for commercial gain.
The production of fake entries in search engines helps in exposing the
vulnerabilities of cataloging methods that work so well in library automation when they are instead
used in an anarchistic commercial setting. The practices of those Internet publishers ruthlessly
out for commercial gain, such as ‘spamming the index’ - done by submitting pages with
attributes, sometimes as comprehensive as a small dictionary, that do not reflect the true content
of the site with the hopes of creating more access points - are constantly being matched with the
editorial efforts of computerized databases, producing a seemingly endless technological cycle
between the opposing sides. More aggressive methods of deception, such as ‘pagejacking’ -
done by stealing and representing the work of reputable organizations but then forwarding Internet
visitors to unrelated destinations, a practice which an estimated 25 million pages employ - are
being met with limited legal countermeasures such as policing actions by the Federal Trade
Commission, done in the name of consumer protection and punishing misleading Website advertisers
(Sullivan, 1999b).
It should be emphasized that these spamming vulnerabilities of search engines are
almost entirely due to their automated nature. Efforts to present search results not just based on
author-presented data, such as the frequency, positioning, and proximity of search terms, but with
also somehow computing more objective data based on the source domain of the indexed file, how
often searchers choose the link, and especially a sophisticated type of citation analysis that
charts authoritative pages and hubs by counting the number of links pointing to a page, do hold
promise for offering more relevant search results (Brin & Page, 1998; Chakrabarti, et. al.,
1999; Notess, 1999). It is reasonable to assume, however, that no matter how sophisticated the
spamming countermeasures adopted by automated indexes become, new ways of fooling the machines
could be crafted.
[See Henzinger,
et. al. (2002) for an update regarding this predition.]
Some amount of human editorial power therefore seems necessary.
As well as the need for human control against fake records, there are potentially
insurmountable difficulties in unleashing computers to comprehend language. Language is replete
with synonyms, polysemy, homonyms, spelling variations, and slang, and discourse is full of
variable contextual meanings and linguistic nuances such as puns, poetry, and sarcasm, making
full-text databases rather blunt tools in their over-reaching attempts to process natural language.
A human indexer examines a document and identifies its principal concepts with a controlled
vocabulary, using a caliber of mental comprehension unparalleled in the most advanced computer
science or any so-called artificial intelligence. Until computers can comprehend language and hold
their own in a conversation, there is a gap in their capabilities in analyzing and indexing text.
Indeed, book indexing by computers, despite all its promises of mechanized efficiency, has remained
unsuccessful (Korycinski & Newell, 1990), and continues to be the task of trained
professionals. Technological aids currently exist for humans to apply book indexing methods in
cataloging the Internet, not only to acknowledge accurately author-cataloged sites, but to
accurately map the mental structure of search terms in a cross-referenced thesaurus of subject
headings (Humphreys, 1999).
Information scientists with experience managing searchable catalogs have learned
how to enhance retrieval effectiveness by matching indexing tools with search needs, and are
familiar with the inherently related topics in the psychology of language, such as the frequency
and distribution of subjects and vocabulary terms and the high variance of natural language terms
for identical concepts, all of which illustrate the benefits to search effectiveness of presenting
a hierarchical classification of information (Bates, 1998). Another benefit to search precision is
narrowing search domains to specific subjects, accomplished by honing the scope of what information
is searched, perhaps by limiting searches to certain source domains or languages, or conducting
specialized searches in subject-oriented search engines, such as FindLaw.
One potential problem that diminishes with the hierarchical classification of
Websites by humans is the difficulty of accurately ranking the results of simple searches to
full-text databases. Although the exact ranking algorithms used by search tools are company
secrets, the relationships of the search terms to the general Webpage attributes, such as the
frequency of search terms within the document, title, headings, or metatags, are calculated before
displaying search results. A survey of the quality of results ranking by five popular search
engines, which measured the relevance of results from various topical searches, found a
“generally good” presentation ranking of results, but not without errors and
inconsistencies (Courtois & Berry, 1999). Navigators of a hierarchy using categorical searches
and site descriptions can avoid a reliance on automated results ranking, and be sure of examining
all entries within selected categories.
Merely classifying and categorizing the Internet, rather than compiling a full-text
index by computer, does leave users without the capability to conduct the best searches on obscure
topics, for which access to full-text databases is useful. The coverage of the largest search
engines, which by a recent analysis have indexed no more than about 16% of the Web (Lawrence &
Giles, 1999), however, still leaves much to be desired. Meta-search engines (which attempt to
integrate the unique content of individual search engines and directories, each with their own
features and interfaces, by presenting the search results from many services from just one query)
do offer timesaving features for those doing an exhaustive search on obscure topics. But because
meta-search engines cannot take full advantage of the unique features of the individual search
engines, the quality of results from meta-searches is in a sense is only as strong as the weakest
of links offered. They may, however, help users to find which search tools are best for their type
of search needs. Conducting meta-searches may also often not be worth the effort for those seeking
basic information on general topics, as this can be found through most any search engine. Likewise,
even searchers with specific needs will find that these are better met using search tools devoted
to specialized topics.
Using specialized search engines or human-constructed directories does sacrifice
the comprehensiveness of the large full-text databases. When weighing the impact of this sacrifice
we must again consider the characteristics of Internet and the needs of its searchers. Unlike the
doctoral student who scours all available library catalogs to exhaust the coverage on a topic, most
public Internet searchers often want just one good result per search. Topical clearinghouses that
point to quality information are designed to serve these search needs, and may also hold even more
entries for their subjects than are available through comprehensive indexes. Concerns over not
having comprehensive results are therefore outweighed by the need for the quality and relevance
offered by individual Internet catalogers. [Added: Guernsey (2001) provides a good overview of this
topic.
Even if they were to successfully survey the entire Internet, the limitation of
wholly computerized Internet indexing systems is that the automated spider robots that crawl
through Websites retrieving data can only read open-text formats, such as HTML files, and cannot
record any more that the basic file attributes of non-text format files, including PDF, sound,
image, and video files. It is more difficult to mechanically extract cataloging data from
multimedia objects because they are more complicated in format and abstract in subject. While most
comprehensive search engines do offer multimedia searching capabilities (Jacsó, 1999), the image,
sound, video, and other collections of specialized media search engines, such as Corbis and MP3.com, have devoted fuller resources towards creating databases with more
useful methods for finding genre-specific file formats. Furthermore, most search engines cannot
survey frame-based sites or dynamic pages, such as those in a database pulled with cgi or perl
scripting, and have problems with pages in XML format (Sullivan, 1999a).
It could be argued that a reliance on individual subject catalogers and specialized
indexes results in an unacceptable variance among the array of available finding tools. A
comprehensive but automated Internet indexing system, however, also varies in composition from a
reliance on individual page owners to submit and properly code their pages. The use of keywords in
the HTML meta tag, for example, has been shown to cause a significant improvement in the
retrievability of a document (Turner & Brackbill, 1998), and more refined conventions modeled
from the database fields in library catalogs, such as the Dublin Core (Weibel, 1997), offer more
detailed descriptions and capabilities for specialized access points. Yet without a consistent use
format being adopted by the millions of Internet publishers, and again as well as the inherent
spamming vulnerabilities of an automated self-cataloging system, there remain little consistent
benefits for using automated cataloging over human selection. Any information or codes hidden from
the screen of Web browsers, such as meta tags, could be just as likely to be up to tricks like
pagejacking rather than providing authentic cataloging information. Quality pages not properly
tagged or submitted to a search engine, or especially those restricting access from automated
indexing robots, may not be included in computerized indexes, whereas human indexers will be more
likely to include only important and relevant sites.
One way to alleviate the problems of the improper use of metadata is by diminishing
the scope of indexed pages from that of Internet-wide searching services to more trusted domains.
The full capabilities of automated cataloging tools such as metadata can best be harnessed within a
realm of authors known to be responsible or inside well-regulated domains, such as corporate
intranets or academic institutions. Human-driven quality control efforts within limited-scope
search engines, such as at Noesis [Offline] - a
finding tool for online philosophical research that only allows entries by authors with a doctoral
degree - successfully demonstrate that: “it is technologically possible and economically
feasible to build a system of dissemination for academic resources that is completely administrated
by the scholarly world without the intervention of economic interests.” (Beavers, 1998).
Specialized searching services can also focus on creating a directory of their topic on the
Internet more easily than can the staffs of large search engines who must maintain a broader
coverage of links.
Considering the benefits of an Internet directory, we also find that many of the
historical difficulties of library cataloging disappear when classification systems are augmented
within a computerized format. Since a digitized hierarchy easily allows room for expansion when new
terms and categories arise, and natural language queries can be readily mapped to retrieve terms in
the classification system, many of the supposed disadvantages to using a classification scheme and
its controlled vocabulary are avoided (Mitchell, 1998). Rather than letting searchers only wade
through an unstructured mass of open text database entries (even if accompanied with automated
tools that attempt to cross-link entries, such as ‘more like this’ links, which by their
mechanized nature will produce results of variable quality and accuracy), it is preferable to allow
users to search and browse organized categories with multiple access points to information (Ellis
& Vasconcelos, 1999).
The need for Internet directories is exemplified by AltaVista, one of the largest, oldest, and
reputable full-text search engines. In addition to their full-text database, AltaVista has added a
hierarchical format available for searchers, taken from the Open Directory Project. In addition,
phrased questions presented to the search engine are processed by the Ask Jeeves system - a database with human-indexed pointers to Websites
providing the answers to many commonly asked questions.
Further alternatives to both search engines and hierarchical directories will
always be available because any Website can make links to other pages. While the practice of simply
surfing through links provides searchers with random experiences (although perhaps with the benefit
of serendipity), more structured surfing methods are available. Webrings, for example, are a
collection of common sites that are all interlinked, allowing their navigators to browse through
Websites with related topics and communally maintained connections (Casey, 1998).
The great promise of automated indexing tools is that they provide a level of
detail greater than any humanly powered method of indexing. Automated searching aids are therefore
somewhat necessary to keep up with the millions of pages being added to the Internet. Finding dead
links maintaining timely information within any Internet search tool are needs that have
increasingly improving automated solutions. When it comes to sorting through all of this data,
however, and making the best sense of what is available online, the power and assurance of human
understanding and editorial control must also be called upon.
One possible difficulty of Internet finding tools driven by human power is that
they cannot keep up with the capabilities of automated systems. While they cannot do full-text
indexing, the combined efforts of the Internet publishers who maintain quality subject indexes do
in fact meet most searching needs. Virtually every Website contains a list of links maintained by
the author. Offering the authoritative and best subject indexes available to searchers can produce
a far greater information retrieval system than any single search tool. Links for Chemists, an index to
over 8,000 Chemistry-related Websites, is a cooperative example of a broad subject index that
includes subsection contributions from different editors. Such gateway pages can be found with
intermediate finding tools such as Invisible
Web, a database of over 10,000 specialized online search tools, or Argus Clearinghouse, a selective index of
quality Internet subject catalogs.
An Internet-wide example of a communally produced directory is the Open Directory Project. Maintained mostly by
volunteers, the Open Directory has cataloged over 1.2 million Websites within a hierarchical
classification system. In contrast to a staff-run directory such as Yahoo!, which seems more intent on retaining surfers for more advertising
opportunities - accomplished by offering services such as online games, fantasy league
competitions, chat, and e-mail - than it has devoted resources towards maintaining a quality
directory, the challenges of a human-maintained comprehensive Internet index seem to have been met
by the combined efforts of the over 30,000 Open Directory contributors (Dunn, 1999).
Depending on the type of information being sought, it may be best to use a large
full-text search engine with sophisticated relevance ranking abilities, such as Google, an Internet-wide hierarchical
classification system such as the Open
Directory Project (which as been added to Google), or locate a quality subject index through an
intermediate directory such as the Argus
Clearinghouse. Due to the inabilities of computers to comprehend language or practice quality editorial
control, the available capabilities of human-powered cataloging systems for now and in the foreseeable
future remain essential tools for indexing the Internet.
References
Bates, M. (1998). Indexing and access for digital libraries and the Internet: Human,
database, and domain factors. Journal of the American Society for Information Science,
49(13), 1185-1205.
Beavers, A. F. (1998). Evaluating search engine models for scholarly purposes.
D-Lib Magazine, December 1998. Available: http://www.dlib.org/dlib/december98/12beavers.html.
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web
search engine. Seventh International World Wide Web Conference, Brisbane, Australia, April
14-18. Available: http://infolab.stanford.edu/~backrub/google.html.
Casey, C. (1998). Web rings: An alternative to search engines. College &
Research Libraries News, 59(10), 761-763.
Chakrabarti, S., Dom, B., Kumar, S. R., Raghavan, P., Rajagopalan, S., Tomkins, A.,
Kleinberg, J. M., & Gibson, D. (1999). Hypersearching the Web. Scientific American, June
1999.
Courtois, M. P., & Berry, M. W. (1999). Results-ranking in Web search engines.
Online, 23(3), 39-40.
Dunn, A. (1999). Open Directory in search of the best of the Web. Los Angeles
Times, 18 October 1999, C1.
Ellis, D., & Vasconcelos, A. (1999). Ranganathan and the Net: Using facet
analysis to search and organise the World Wide Web. Aslib Proceedings, 51(1), 3-10.
Gordon M., & Pathak P. (1999). Finding information on the World Wide Web: The
retrieval effectiveness of search engines. Information Processing & Management, 35(2), 141-180.
Guernsey, L. (2001). Mining the 'Deep Web' With Specialized Drills. New York Times, January 25, 2001 . Available: http://www.nytimes.com/2001/01/25/technology/25SEAR.html.
Humphreys, N. K. (1999). Mind maps: Hot new tools proposed for cyberspace
librarians. Searcher, 7(6).
Jacsó, P. (1999). Sorting out the wheat from the chaff. Information Today,
16(6), 38.
Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998). Real life
information retrieval: A study of user queries on the Web. SIGIR Forum, 32(1), 5-17.
Korycinski, C., & Newell, A. F. (1990). Natural-language processing and
automatic indexing. The Indexer, 17(1), 21-29.
Lawrence, S., & Giles, L. (1999). Accessibility of information on the Web.
Nature, 400(6740), 107-109.
Mitchel, J. S. (1998). In this age of WWW is classification redundant? Catalogue
& Index, 127, 5.
Notess, G. R. (1999). Rising relevance in search engines. Online, 23(3),
84-86.
Sullivan, D. (1999a). Crawling under the hood: An update on search engine
technology. Online, 23(3), 30-32.
Sullivan, D. (1999b). FTC steps in to stop spamming. The Search Engine
Report, 4 October 1999. Available: http://searchenginewatch.com/showPage.html?page=2167501.
Turner, T. P., & Brackbill, L. (1998). Rising to the top: Evaluating the use of
the HTML meta tag to improve retrieval of World Wide Web documents through internet search engines.
Library Resources & Technical Services, 42(4), 259-271.
Weibel, S. (1997). The Dublin Core: A simple content description model for
electronic resources. Bulletin of the American Society for Information Science, 24(1), 9-11.
This essay was written by John Hubbard for the Drexel University College of Information
Science and Technology course "INFO 622: Content Representation" in December 1999.
Although changes such as updating Web links, numbers, and appending additional references have been made, it has not
been significantly altered from the original version; the last modified date shown below indicates when
this Webpage was last uploaded in its present form.
Sounds
Femme
Hamlet
Library Link
Humor
Trivia
Sorcery!
Ultima
Wizardry
Contact
Home
Links
Pictures
LOTR
Essays
Random
Quotes
Character
StarCraft
WarCraft
Vita
Guestbook
Created, maintained and © by John Hubbard (write to me). Disclaimers.
Hosted by Dreamhost.
Last modified: August-09-2007.
|
|
| |
An | essay | by | John | Hubbard | analyzing | the | question | of | what | is | the | best | way | to | index | the | Internet, | with | references. | |
http://www.tk421.net/essays/babel.html
Indexing the Internet 2009 January
dvd rental
dvd
An essay by John Hubbard analyzing the question of what is the best way to index the Internet, with references.
Rules
|
© 2005 Internet Explorer 5+ or Netscape 6+
|
|
Recommended Sites: 1.
Arts -
Business -
Computers -
Games -
Health -
Home -
Kids and Teens -
News -
Recreation -
Reference -
Regional -
Science -
Shopping -
Society -
Sports -
World
Miss Gallery
- Top Anime Hentai
- DVD rental by mail
- Credit Counseling - Free Ringtone - Credit Counseling - Free Ringtone - Facebook ProxyKody Do Gier
- Stacje Cng
- Toys4u Zabawki Dla Ciebie
- Kredyty Na Dom
- Szablony Stron
|
2009-01-07 22:50:37
Copyright 2006 by Rules
|