Randall's Site

Blog EntryWebsite Scraping for AllApr 21, '08 4:40 PM
for everyone
website scrapingIf you're interested in website scraping, here's an excerpt from a post I made this weekendwebsite scraping script:

Website Scraping Platforms

Web Harvest is an Open Source, Java based platform geared towards website data extraction. As they put it, Web Harvest "offers a way to collect desired Web pages and extract useful      data from them. In order to do that, it leverages well      established techniques and technologies for text/xml manipulation such as      XSLT, XQuery and Regular Expressions.

Web Harvest      mainly focuses on HTML/XML based web sites which still make vast majority of      the Web content. "Web Harvest looks to be extremely powerful and flexible, and it's free, which is always nice. If you're able to write code in Java, you may want to look at it pretty closely.

The Twit88 blog has two excellent tutorials on using Java/Web Harvest to extract data from websites. Web Scraping using Web Harvest, and Java - Writing a Web Page Scraper or Web Data Extraction Tool.

Thanks to MIT's SIMILIE Project, you can use two of their programs - Piggy Bank, and Solvent - to turn your copy of Mozilla FireFox into a data scraping platform. Both plugins are free under the BSD License, and come with sample scrapers to help you get started.

black hat seoRead the rest of the post at the BookMarkMoney.com Bloggray hat seo

Add a Comment
   
© 2008 Multiply, Inc.    About · Blog · Terms · Privacy · Corp Info · Contact Us · Help