Project Detail

linux craigslist crawler / scraper / harvester  

linux craigslist crawler / scraper / harvester is project number 520096
posted at Freelancer.com. Click here to post your own project.

 

| More
Free Trial For New Buyers
 

Status:

Selected Providers: zeke

Budget: N/A

Created: 10/02/2009 at 16:27 EDT

Bid Count: 13

Average Bid:
$ 355

10/07/2009 at 16:27 EDT

Project Creator: puravida
Employer Rating: 10/1010/1010/1010/1010/1010/1010/1010/1010/1010/10 (15 reviews)

Bid On This Project
 

Description

I want this script done in linux, to be ran at the command prompt. No GUI needed; I won't be running it from a web-browser. Just through shell access. You can make recommendations as to what programming language you feel would be best.

I have a .csv file of URLs on Craigslist that I need to be scraped and parsed. The script will parse the email address, city, subject line of the ad, and the date that the ad was posted. I need the ability to specify a specific date range for the script to scrape data from, as well as just the option for the script to scrape everything. If you go to any of the links in the text file, there is usually a link at the bottom that says "next 100 postings" (http://austin.craigslist.org/sss/ is an example - just scroll down to the bottom); when the script encounters this, it will automatically parse that link, and continue onto the next page, until no more of these are found. This function would only be used if I have selected to scrape everything. If I am only scraping a specific date range, then the script will still have to use the 'next 100 postings' link at times, but won't need to continue until there are no more of the 'next 100 postings' links.

The script must be multi-threaded (must be able to handle up to 500 simultaneous threads), and must support the usage of http/https/socks4/socks5 proxies. I will have a text file of proxies, and the script will randomly grab a proxy for each URL that it scrapes.

The .csv file will have 3 columns in it:
1. The URL to begin scraping
2. The Country that is being scraped
3. The City that is being scraped

The script will use the country value to place the data scraped from that country into its' own folder, and it will use the city value in the .csv files that it outputs after it parses each page. As an example:

http://austin.en.craigslist.org/acc,USA,Austin
http://vancouver.en.craigslist.ca/acc,Canada,Vancouver
http://canberra.en.craigslist.com/acc.au,Australia,Canberra
http://cambridge.en.craigslist.co.uk/acc,UK,Cambridge

In this sample, the script will go to http://austin.en.craigslist.org/acc, and it will see numerous posts. If I have it set to only scrape a specific date range, it will only parse the URLs that are in that date range. If not, it will parse all of those URLs, as well as go to the 'next 100 postings' link and do the same, etc.

As of the the time I wrote this, the very first link link to be parsed is the "Expanding Firm Hiring - Marketing & Management" link - http://austin.craigslist.org/acc/1388754458.html. The script will parse this link, and will save this data to a .csv file called Austin.csv, in a folder called USA. This is what the output of the Austin.csv file will look like, just from scraping that link:

email_address_here,Austin,Expanding Firm Hiring - Marketing & Management (AUSTIN),9/23/2009

I know that the date is shown as 2009-09-23, but I would need whatever format the date is in to be formatted in the above example (month/date/year).

I also need the option to select either scrape all countries, or just certain countries. For instance, if I just wanted to scrape the USA, or I wanted to scrape the USA, Canada, and Australia, etc.

The script will do the exact same thing for the other 3 examples, in Canada, Australia, and the UK.

I will own the exclusive rights to this script; you will not be able to re-sell it, and I will obtain full rights to this script.

If you have any questions, please don't hesitate to ask.

Messages Posted:0 View project clarification board Post message on project clarification board

Bid On This Project
 

If you are the project creator or one of the bidders Log In for more options

 

400

3 days

10-02-2009 21:31 EDT

Hi, Please see the private message. Thank You

help

 

400

15 days

10-03-2009 17:57 EDT

please check pmb.

help

 

300

3 days

10-03-2009 07:35 EDT

Dear Customer! This is my favourite kind of project and I have a lot of experience wrigint crawlers/scrappers/web bots/etc. Please see PMB for examples of my previous works in this field. Ready to start right now and finish as soon as possible. My bid is for fast professional service exciting my customers. Please contact in PMB to discuss details. Best Regards, Zeke

help

 

250

5 days

10-03-2009 18:41 EDT

Hej, Steve. I'm very much interested in this project. I'll get this job done and meet all your requirements. If you want I can make a demo. I prefer to use java for this scraper.

help

 

220

3 days

10-03-2009 05:41 EDT

I can do this in bash using wget

help

 

350

4 days

10-03-2009 00:07 EDT

Please check PM..

help

 

250

4 days

10-03-2009 15:00 EDT

Please check PM, Already have some thing

help

 

200

5 days

10-02-2009 16:42 EDT

Please, check you pmb.

help

 

0

0 days

10-02-2009 16:38 EDT

(No Feedback Yet)

I can do it on Perl.

help

 

900

20 days

10-02-2009 19:32 EDT

(No Feedback Yet)

Hi, We have already done similar crawler for cityserch's web site using Microsoft technologies. Have all data form the same. Please feel free to call me on 001 408 218 8015 or mail me your contact information to swagatkajale(at)gmail Regards Swagat Kajale Calshtra Technologies USA / India

help

 

600

14 days

10-02-2009 19:51 EDT

(No Feedback Yet)

Hi, I read your requirement carefully, I have such experience, I can take this job. thanks.

help

 

350

7 days

10-03-2009 14:03 EDT

(No Feedback Yet)

Hi, I have no reviews to show for as I have registered recently but I have rich experience of scraping of about 4 years in which I have scraped not less than 500 sites of all hue and design.Entrust me with this work and you will not have to rue your decision. Thanks, ppan279.

help

 

400

7 days

10-04-2009 11:24 EDT

(No Feedback Yet)

pls see PM

help


    Bid on this Project