Project Detail

Multithreaded Java crawler  

Multithreaded Java crawler is project number 357774
posted at Freelancer.com. Click here to post your own project.

 

| More Free Trial For New Buyers
 

Status: Cancelled

Selected Providers: -

Budget: $250-750

Created: 12/15/2008 at 10:05 EST

Bid Count: 25

Average Bid:
$ 403

12/19/2008 at 10:05 EST

Project Creator: profseo08
Employer Rating: (No Feedback Yet)

Bid On This Project
 

Description

I need a multithreaded Java program that can be controlled from the command line interface. I must be able to control things like number of threads, timeouts, and max number of simultaneous requests to the same IP. The program must be fast and well tested and able to resume processing from the point it got to before a crash. It can use a maximum of 128 MB of memory and should extract info on a huge list (millions) of URLs. For every URL the following needs to be stored in the output file in one line:

URL
IP address
HTTP response code
Resolving URL if response is 301 or 302
Size of document in bytes
Content of title-tag
Content of meta-description
Content of meta-keywords
Content of all heading tag
Number of words in the body-section striped from html code
Number of links
(+ some other data)
Error message if any


Additional information submitted:

12/16/2008 at 2:43 EST:
All textual data should be decoded and 'clean' before written to the output file in UTF-8. I think apache.org has some pretty good decoding packages that help determine the encoding of the source code.

The program needs only to deal with http -not https

The program should run on JRE version 1.5.

It will most likely be tested on Mac OS X and deployed on a Linux box with 4 or 8 cores. I expect to process around 1000 to 1500 URLs every minute when deployed and tuned.

The amount of data in and out of the program will not fit in memory. The program must read data from file and write to the output file as it runs. The input file will contain from 10.000.000 to 20.000.000 URLs

12/16/2008 at 8:55 EST:
This is a sample input file with 196 URLs


Additional files submitted:
Sample-urls.txt

Messages Posted:4 View project clarification board Post message on project clarification board

Bid On This Project
 

If you are the project creator or one of the bidders Log In for more options

 

750

2 days

12-15-2008 11:39 EST

+++++++++++ Please refer your pmb +++++++++++++++++

help

 

750

5 days

12-15-2008 10:30 EST

Hi, Please check pmb

help

 

450

12 days

12-15-2008 12:24 EST

Please check PMB. Thanks.

help

 

250

5 days

12-16-2008 17:30 EST

Please see more details in PM.

help

 

400

7 days

12-16-2008 04:12 EST

Please check the PMB for more information.

help

 

400

10 days

12-15-2008 18:43 EST

Hello, Please Check PMB

help

ssw

 

250

7 days

12-15-2008 12:51 EST

Please check PMB. Thanks

help

 

250

5 days

12-16-2008 03:31 EST

Can be done, let's talk more details.

help

 

750

7 days

12-15-2008 17:04 EST

Please check PMB. Thanks.

help

 

300

10 days

12-17-2008 08:30 EST

Hello! I have experiance with Java multithreadin programming and fetching-parsing HTMl code. This project will be done in time with high quality. Regards, Serg

help

kci

 

250

4 days

12-18-2008 12:32 EST

Hello I would like to have the opportunity to work for you. Your complete satisfaction is guaranteed!

help

 

400

12 days

12-17-2008 19:26 EST

I can provide a spider.

help

 

350

15 days

12-18-2008 07:02 EST

Quality work. Proven experience in developing Java crawlers.

help

 

400

7 days

12-15-2008 12:25 EST

I have 4 years exp of working in multi thread and web programming. Please see PM.

help

 

600

6 days

12-15-2008 11:51 EST

(No Feedback Yet)

Hi, I am interested in working on this project. I have been working on multi threaded Java programmings for last 5 years in Telecom networking for EMS. I hope to provide you a unique solution for this. Regds, Upananda

help

 

500

5 days

12-15-2008 14:10 EST

(No Feedback Yet)

I have expirience in that area: I wrote user info grabber for YouTube.

help

 

400

14 days

12-15-2008 15:38 EST

(No Feedback Yet)

Hello. 14 days for a reliable, well tested program with a 100% of control from command line, to be run on linux OS. I am new on this site but I have 2.5 years of experience in the field of IT. This bid a realistic one. The job to be performed in a part time schedule.

help

 

400

7 days

12-15-2008 15:43 EST

(No Feedback Yet)

Hey, I can do this for you in a week.

help

 

250

7 days

12-16-2008 07:58 EST

(No Feedback Yet)

I can work this for you in 7 days. I have been working for 3 years on a java multithreaded webcrawler. Similar to this one. Thanks

help

 

250

15 days

12-16-2008 02:16 EST

(No Feedback Yet)

I will provide this in the given time line with best quality and code which can be reusable and well tested with all the test case.

help

 

700

5 days

12-16-2008 06:47 EST

(No Feedback Yet)

I have written many site scrappers for the travel agency i work with to get the the best deals available and offers better packages plus i have a sound java and j2ee experience. Also we are group of 4 guys with experience in java j2ee so we can work very fast.

help

 

275

5 days

12-16-2008 10:01 EST

(No Feedback Yet)

I have good experience develop in crawler using java.I were developed Japanese patent crawler project.

help

 

250

10 days

12-16-2008 13:57 EST

(No Feedback Yet)

Please check PMB

help

 

250

5 days

12-17-2008 04:18 EST

(No Feedback Yet)

I just can make this application fast, stable and good extendable.

help

 

250

14 days

12-18-2008 13:30 EST

(No Feedback Yet)

Hi, I will be able to deliver a complete solution, testet, benchmarked and ready for production use, in very short time!

help


    Bid on this Project