Multithreaded Java crawler is project number 357774
posted at Freelancer.com. Click here to post your own project.
Status: Cancelled
Selected Providers: -
Budget: $250-750
Created: 12/15/2008 at 10:05 EST
Bid Count: 25
Average Bid:
$ 403
12/19/2008 at 10:05 EST
Project Creator:
profseo08
Employer Rating: (No Feedback Yet)
12/16/2008 at 2:43 EST:
All textual data should be decoded and 'clean' before written to the output file in UTF-8. I think apache.org has some pretty good decoding packages that help determine the encoding of the source code.
The program needs only to deal with http -not https
The program should run on JRE version 1.5.
It will most likely be tested on Mac OS X and deployed on a Linux box with 4 or 8 cores. I expect to process around 1000 to 1500 URLs every minute when deployed and tuned.
The amount of data in and out of the program will not fit in memory. The program must read data from file and write to the output file as it runs. The input file will contain from 10.000.000 to 20.000.000 URLs
12/16/2008 at 8:55 EST:
This is a sample input file with 196 URLs
|
Job Type |
|