public
Description: Parsley is a simple language for extracting structured data from web pages. Parsley consists of an powerful selector language wrapped with a JSON structure that can represent page-wide formatting.
Home | Edit | New

Home

Parsley is a simple language for extracting structured data from web pages. Parsley consists of an powerful Selector Language wrapped with a JSON Structure that can represent page-wide formatting.

Check out A Simple Tutorial? to extract a CSV file of beers by brewery and rating in only a few lines of code.

Parsley has a Command-line Interface, Ruby Bindings?, Python Bindings?, and a C Interface?, and can output to JSON?, CSV?, and XML?.

The following parselet parses a Yelp business listing (no endorsement implied).

{
  "name": "h1",
  "phone": "#bizPhone",
  "address": "address",
  "reviews(.nonfavoriteReview)": [
    {
      "date": ".ieSucks .smaller",
      "user_name": ".reviewer_info a",
      "comment": "with-newlines(.review_comment)"
    }
  ]
}

You can get JSON out by typing:

sh$: parsley businesses.let http://www.yelp.com/biz/amnesia-san-francisco

To get a site-wide crawl that will dump a businesses.csv, and a reviews.csv (with foreign key to businesses), run:

sh$: csvget --parselet=businesses.let http://www.yelp.com/biz/amnesia-san-francisco

It’s that easy. Get started with Installation Instructions?.

Sites are for example purposes. Please obey robots.txt.

Last edited by fizx, Sun Aug 30 12:11:51 -0700 2009
Home | Edit | New
Versions: