Tuesday, November 4, 2008

week 10 reading notes

Web Search Engine Part 1
It was believed in 1995 that indexing the web couldn't be done due to the exponential growth of the web. In order to provide a the most useful and cost effective services search engines must reject as much low-value automated content as possible. Also, web search engines have no access to restricted content. Large search engines operate multiple geographically distributed data centers. Then within a data center services are built up from clusters of commodity PC's. The current amount of web data that search engines crawl and index is about 400 terabytes. A simple crawling algorithm uses a queue of URLs yet to be visited and a fast mechanism if it has already been seen as a URL. A crawler works by making an HTTP request to get the page at the first URL in the queue when it gets that page it scans the content for links to other URLs ans uses each unseen URL to the queue. A simple crawling algorithm must be extended to address the following issues speed, politeness, excluded content, duplicate content, and continuous crawling, and spam rejection. As can be seen crawlers are highly complex systems.

Web Search Engines Part 2
Search engines are using an inverted file to rapidly identify indexing terms. An indexer can create an inverted file in two phases. First is scanning. Scanning is when the indexer scans the text of each input documents. For an indexable term it encounters the indexer writes a posting consisting of a document number and a term number to a temporary file. Second is inversion. In inversion the indexer sorts the temporary files into term number order and it also records the starting point and length of the lists for each entry in the term dictionary.

The Deep Web Hidden Value
The deep web refers to information that is hidden on the web and wont come up in a search engine. This is because most of the web information is buried far down on dynamically generated sites and therefore standard search engines never find it. The deep web has public information that is 400 to 550 times larger than the commonly defined world wide web. Search engines get there listings to ways. One is when someone gives them a site and the second was mentioned in the other article and that was crawling. Crawling can retrieve too many results. The goal of the study was to
- quantify the size and importance of the deep web
- characterize the deep webs content, quality and relevance to info seekers
- begin the process of educating the internet searching public about this heretofore hidden and valuable information storehouse
This study did not investigate non web sources or private intranet information. They then came up with a common denominator for size comparisons. All of the retrieval, aggregation and characterizations in the study used bright planet technology. When the analyzed deep web sites it involved a number of discrete tasks such as qualification as a deep website. We see from results that deep web sites that exists cover a wide range of topics. The article also mentions the differences between deep websites. The deep web is 500 times larger than the surface web. It is possible for deep web information to surface and for surface web information to remain hidden.

Current Development and Future Trends for the OAI protocol for Metadata Harvesting
The OAI-PMH was initially developed as a means to federate access to diverse e-print archives through metadata harvesting and aggregation. This OAI-PMH has demonstrated its potential usefulness to a broad range of communities. There are over 300 active data provider using it. One notable use of it is the Open Language Archives Community. Their mission is to create a "worldwide virtual library of language resources." There are registries of OAI repositories. They have two shortcomings. One they maintain very sparse records about individual repositories and two they lack completeness. The UIUC research group built the Experimental OAI registry to address the computer. The registry is now fully operational but there remains many improvements the group would like to make to increase its usefulness. There today remain ongoing challenges for the OAI community such as metadata variations, and metadata formats.

2 comments:

Anonymous said...

Your overview of the Deep Web article was really excellent. I just find it incredible that such a thing as "the deep web" even exists and that most people who use the web...dont even know about it. It's sort of the dark hole of the internet.

I wonder how one really access the "deep web".
???

Anonymous said...

What is considered bright planet technology?