Sunday, November 16, 2008

week 11 comments

https://www.blogger.com/comment.g?blogID=301150766198525940&postID=5836373957895676096&page=1

https://www.blogger.com/comment.g?blogID=1057727177405306622&postID=1506011019081134724&page=1

Friday, November 14, 2008

week 11 readings

Digital Libraries Challenges and Influential Work
To effectively search all digital resources over the internet remains a problem filled and challenging task. People have been working to turn the vast amount of digital collections into true digital libraries. "Federal programmatic support for digital library research was formulated in a series of community-based planning workshops sponsored by the National Science Foundation (NSF) in 1993-1994." "The first significant federal investment in digital library research came in 1994 with the funding of six projects under the auspices of the Digital Libraries Initiative (now called DLI-1) program" After the DLI-1 They created a DLI-2. "In aggregate, between 1994 and 1999, a total of $68 million in federal research grants were awarded under DLI-1 and DLI-2." "DLI-1 funded six university-led projects to develop and implement computing and networking technologies that could make large-scale electronic test collections accessible and interoperable." Several of the projects examined issues connected with federation. "There has been a surge of interest in metasearch or federated search technologies by vendors, information content providers, and portal developers. These metasearch systems employ aggregated search (collocating content within one search engine) or broadcast searching against remote resources as mechanisms for distributed resource retrieval. Google, Google Scholar and OAI search services typify the aggregated or harvested approach."

Dewey Meets Turing Libraries, Computer Scientists, and the digital libraries
The google search engine emerged from the funded work of the DLI. An interesting aspect of the DLI was how it united librarians and computer scientist. "For computer scientists NSF's DL Initiative provided a framework for exciting new work that was to be informed by the centuries-old discipline and values of librarianship" "For librarians the new Initiative was promising from two perspectives. They had observed over the years that the natural sciences were beneficiaries of large grants, while library operations were much more difficult to fund and maintain. The Initiative would finally be a conduit for much needed funds." "...the Initiative understood that information technologies were indeed important to ensure libraries' continued impact on scholarly work." "The Web's advent significantly changed many plans. The new phenomenon's rapid spread propelled computer scientists and libraries into unforeseen directions. Both partners suddenly had a somewhat undisciplined teenager on their hands without the benefit of prior toddler-level co-parenting." "The Web not only blurred the distinction between consumers and producers of information, but it dispersed most items that in the aggregate should have been collections across the world and under diverse ownership. This change undermined the common ground that had brought the two disciplines together."

Institutional Repositories: Essential Infrastructure for Scholarship in the Digital Age
"The development of institutional repositories emerged as a new strategy that allows universities to apply serious, systematic leverage..." Many things have made this possible such as the price of online storage costs dropping and the Open archives metadata harvesting protocol. "The leadership of the Massachusetts Institute of Technology (MIT) in the development and deployment of the DSpace institutional repository system http://www.dspace.org/, created in collaboration with the Hewlett Packard Corporation, has been a model pointing the way forward for many other universities." "... a university-based institutional repository is a set of services that a university offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members." "At the most basic and fundamental level, an institutional repository is a recognition that the intellectual life and scholarship of our universities will increasingly be represented, documented, and shared in digital form." The author includes another use for IR's and some cautions as well. ". I have argued that research libraries must establish new collection development strategies for the digital world, taking stewardship responsibility for content that will be of future scholarly importance..." The article ends by mentioning the future of IR.

Thursday, November 13, 2008

muddiest point week 11

My muddiest point for this week is do we only need 10 weeks worth of comments, muddiest points, and comments or not? I remember hearing that I think, but I am not 100% sure and want to make sure I do not miss any assignments.

Monday, November 10, 2008

assignment

I had trouble using the ftp and posting to Pitt I spent hours frustrated trying to get it to work. I used another software to upload the page to, but still followed all the guidelines. Here is my site:http://www.freewebs.com/karategirl611/

Sunday, November 9, 2008

week 10 comments

Here are my comments for the week:https://www.blogger.com/comment.g?blogID=633484337573796975&postID=4757088958536674311&page=1
https://www.blogger.com/comment.g?blogID=4736393327020365268&postID=8240492140679815932&page=1

Tuesday, November 4, 2008

week 10 reading notes

Web Search Engine Part 1
It was believed in 1995 that indexing the web couldn't be done due to the exponential growth of the web. In order to provide a the most useful and cost effective services search engines must reject as much low-value automated content as possible. Also, web search engines have no access to restricted content. Large search engines operate multiple geographically distributed data centers. Then within a data center services are built up from clusters of commodity PC's. The current amount of web data that search engines crawl and index is about 400 terabytes. A simple crawling algorithm uses a queue of URLs yet to be visited and a fast mechanism if it has already been seen as a URL. A crawler works by making an HTTP request to get the page at the first URL in the queue when it gets that page it scans the content for links to other URLs ans uses each unseen URL to the queue. A simple crawling algorithm must be extended to address the following issues speed, politeness, excluded content, duplicate content, and continuous crawling, and spam rejection. As can be seen crawlers are highly complex systems.

Web Search Engines Part 2
Search engines are using an inverted file to rapidly identify indexing terms. An indexer can create an inverted file in two phases. First is scanning. Scanning is when the indexer scans the text of each input documents. For an indexable term it encounters the indexer writes a posting consisting of a document number and a term number to a temporary file. Second is inversion. In inversion the indexer sorts the temporary files into term number order and it also records the starting point and length of the lists for each entry in the term dictionary.

The Deep Web Hidden Value
The deep web refers to information that is hidden on the web and wont come up in a search engine. This is because most of the web information is buried far down on dynamically generated sites and therefore standard search engines never find it. The deep web has public information that is 400 to 550 times larger than the commonly defined world wide web. Search engines get there listings to ways. One is when someone gives them a site and the second was mentioned in the other article and that was crawling. Crawling can retrieve too many results. The goal of the study was to
- quantify the size and importance of the deep web
- characterize the deep webs content, quality and relevance to info seekers
- begin the process of educating the internet searching public about this heretofore hidden and valuable information storehouse
This study did not investigate non web sources or private intranet information. They then came up with a common denominator for size comparisons. All of the retrieval, aggregation and characterizations in the study used bright planet technology. When the analyzed deep web sites it involved a number of discrete tasks such as qualification as a deep website. We see from results that deep web sites that exists cover a wide range of topics. The article also mentions the differences between deep websites. The deep web is 500 times larger than the surface web. It is possible for deep web information to surface and for surface web information to remain hidden.

Current Development and Future Trends for the OAI protocol for Metadata Harvesting
The OAI-PMH was initially developed as a means to federate access to diverse e-print archives through metadata harvesting and aggregation. This OAI-PMH has demonstrated its potential usefulness to a broad range of communities. There are over 300 active data provider using it. One notable use of it is the Open Language Archives Community. Their mission is to create a "worldwide virtual library of language resources." There are registries of OAI repositories. They have two shortcomings. One they maintain very sparse records about individual repositories and two they lack completeness. The UIUC research group built the Experimental OAI registry to address the computer. The registry is now fully operational but there remains many improvements the group would like to make to increase its usefulness. There today remain ongoing challenges for the OAI community such as metadata variations, and metadata formats.

muddiest point

My muddiest point has to do with writing XML. In one slide of the powerpoint it was mentioned that: and

were the same. I do not understand why though?