GetAFreelancer.com
 
Find projectsSearch
Sign Up | Log in | Top Rated Users | Browse projects | Post Project | RSS feeds | Articles
 

Multi-Threaded PHP Search Engine Scraper

  Featured  Click here to post similar project

Multi-Threaded PHP Search Engine Scraper is project number 279277 posted at GetAFreelancer.com. Click here to post your own project.

Closed
(Project expired)
Status: Closed
Budget: $750-1500
Created: 06/25/2008 at 21:56 EDT
Bidding Ends: 07/02/2008 at 21:56 EDT
Project Creator: puravida View PM Post PM
Buyer Rating: 10.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/10
(9 reviews)
Description: Task:
Create a multi-threaded PHP script that will scrape data from search engines and record the scraped data into a sql database. Also create a Proxy scraper / tester.

Conditions:
Forum / Board url scraper

1. The script will run on a dedicated Linux server.
2. The Script will include a web browser type gui for total user control.
3. The Script will include updated stats as the scraper is progressing.
4. The Script will upload a compressed file of search term files and decompress and loop thru said search term files. The search term files will contain search terms 1 term per line.
5. The Script will have the ability to search Google, Google Blogsearch, Yahoo, MSN & Boardreader with the ability to add engines in the future.
6. The Script will have the ability to use proxies (see PROXIES below)
7. The script will include a gui to enable user to chose proxy settings for each search engine. Proxy settings will be proxy timeout time in seconds & whether or not to use proxies.
8. The script needs to allow user to choose the ips to broadcast to the search engine in the event no proxies are used.
9. The script needs to allow the user to set the thread timeout for each search to avoid proxy banning by the search engines.
10. The script needs to include error handling for captchas. If a Captcha image is encountered the script needs to choose a new proxy and try again with the same keyword.
11. Running stats for total urls found
12. Running stats for keyword list
Example keyword 238 of 2,000
13. Running stats for keyword file
Example: file 57 of 400
14. Running stats of running time and time to completion
15. Pause Resume & Stop Functions

Proxy Scraper

1. The proxy scraper will scrape urls from google's cache based on urls input via a text file.
2. Visit the google cache links and retrieve both the google "text only" cache for these links obtained by adding "&strip1" to the url and the actual link to the site with the proxies.
3. Provide the choice of downloading the google cache proxy links the regular proxy links or both.
4. Provide the option to use proxies when downloading the google cache proxy links and use proper proxy error handling as outlined above.

Proxy Tester

1. Query an environment script to check for anonymity
2. Options to set connect timeout when testing proxies.
3. Option to set time to retest proxies.
4. Create seperate local files for each search engine that encounters
a captcha (see above).
Example:
proxy 231.32.135.34:8080 encounters a captcha while attempting to
scrape from google that proxy should then be placed in the google's
bad proxy text file as to not be used again with google. There would
be a seperate file for all engines.
5. The ability to set a time to reset the files in step 4.
6. Reset the files in step 4 whenever the program is closed (not paused).

STANDARDS:

To be coded entirely in PHP
C/C++ should be used where PHP is not possible/plausible

The Anatomy of a search:

1. Queries engine 1 for dog.
2. Collect applicable url on landing page.
3. Press the next button (if available) and collect all urls.
4. Repeat step 3 until not available.
5. Remove duplicates from applicable urls
a) To do this it would query the good url db and the timeout url db
6. Connect to all urls collected (without the use of proxies after removing duplicates and again scrape all applicable urls.
7. Place timeout urls in the timeout db and good urls in the good url db. (the timeout for connecting to sites is set by user in the settings).
8. Have to ability to retest timeout urls as a separate process.

In addition it should collect the following information and place into the DB.

1. Google PR of the actual full URL (not just the domain name).
2. Age of the domain name.
3. Number of back links that URL has going to it
4. Meta-Tag keyword
a) create an entry to the the db for every meta tag keyword associate with the url.
Example: Keyword = dog Meta-Tags = dogs, dog toys, dog food, dog treats", the scraper would make 4 'entries' into the sql database, and record that URL 4 times. So that when I query the database for 'dog food', or for 'dog treats', that same URL would be returned, along with any other URL that also had dog food or dog treats in its meta tag keywords. If a URL has no meta tag keywords, then obviously there is nothing to record for that. In this instance, if a URL has no meta tags, I still want all available data saved into the database
5. Date URL was scraped

All 5 variables must be searchable in the browser.

EXAMPLE:

Search for URLS with a Google PR of 3 or higher, with domains that are 1 year old or older, with 1,000 back links or more, with keyword 'dog toys' in meta tag keywords, scraped within the past 90 days.
The interfaces should have the ability to choose any or none of these terms and the ability to export the URLs returned. It should also have the ability to get urls from the database based on a count, all or by keyword.

Duplicates DB

The duplicates db should contain all applicable urls scraped to include the timeout urls. while the scraped url db itself will keep the entire url the Duplicates db should only keep the information
between http:// and the first occurrence of a forward slash "/".

Example:
Scraped URL= http://redcross.com.cn/cng/forums/p/35/41.aspx
Saved to duplicates db = redcross.com.cn
Save to Scraped url db = http://redcross.com.cn/cng/forums/p/35/41.aspx

CONCLUSION:

If C/C++ is also used, I will receive the source code for it as well. You will not be able to sell this script in any format, nor parts of it, to any other party. I am not sure what this will cost, so I am keeping the price open, but I do have a budget. I want this scraper to be started on as soon as possible, and completed in a timely manner. If you have any questions, comments, or suggestions, please feel free to PM me. Thank you.
Report violation
Job Type:
  • PHP
Database: MySQL
Operating system: Linux
Bid count: 19
Average bid: $ 1082

FREE Trial project for new buyers!

 

View Project Message Board     Post Message on Project Message Board
Messages Posted: 1

If you are the project creator or one of the bidders Log in as project creator or bidder for more options

Bid on This Project

Service Providers PMB Bid Delivery Within Time of Bid Provider Rating
cssaglobal1
View PM Post PM
$ 1400 45 days 06-26-2008 02:36 9.52/109.52/109.52/109.52/109.52/109.52/109.52/109.52/109.52/109.52/10
(54 reviews)
Hello, Please see the PMB. Regards, Bhupendra
adnansandhila
View PM Post PM
$ 899 20 days 06-27-2008 05:27 10.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/10
(25 reviews)
ready to start. thanx
Shattenjagger
View PM Post PM
$ 1100 5 days 06-26-2008 00:21 10.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/10
(22 reviews)
Let's do it!
pgcoding
View PM Post PM
$ 1200 30 days 06-26-2008 01:15 9.50/109.50/109.50/109.50/109.50/109.50/109.50/109.50/109.50/109.50/10
(18 reviews)
pl check pmb.
rolanro
View PM Post PM
$ 1000 10 days 06-26-2008 17:01 9.86/109.86/109.86/109.86/109.86/109.86/109.86/109.86/109.86/109.86/10
(14 reviews)
Have a nice day, Your project requires a dedicated person how work before whit this kind of project, I develop a blogger generator and I know how to work whit elite proxies, regards.
zeke
View PM Post PM
$ 1000 10 days 06-26-2008 06:36 9.71/109.71/109.71/109.71/109.71/109.71/109.71/109.71/109.71/109.71/10
(7 reviews)
Dear Sir! I am professional developer, have a lot of experience with UNIX/Linux Network Programming. This script could be written entirely in C programming language, you do not have php installed on your linux server at all. Of Course, you receive all source code. I am very interested in your project (have done similar web robots before), so please contact in PMB to discuss details. My bid is for fast professional job, ready to start right now and finish within 10 days. Best Regards, Zeke
ridhisolutions
View PM Post PM
$ 1100 35 days 06-26-2008 02:05 10.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/10
(1 reviews)
Hi we have gone through your existing site. I can do according to your requirment . I will show a professional design with finctionality We are providing services for Web Templates, Logo & Banner Designs, and Web Development in ASP, PHP, OS commerce, .NET and JavaScript, Ajex, Jumla, Wordpress.I assure you that you will get 100% satisfaction. Hope to hear from you soon. Thanks Sahil
moneypro200x
View PM Post PM
$ 1000 30 days 06-30-2008 09:50 10.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/1010.00/10
(1 reviews)
Please check your pm
semosoftware
View PM Post PM
$ 1400 45 days 06-26-2008 07:29 9.33/109.33/109.33/109.33/109.33/109.33/109.33/109.33/109.33/109.33/10
(3 reviews)
We have gone through the information provided by you and we understand your requirement very well. We have excellent expertise in PHP5 , MYSQL, JOOMLA, HTML, CSS, AJAX, ASP.NET, VB.NET. We have worked on variety of web applications like Content Management Systems Facebook Applications Shopping carts RSS Feeds PDF generators Banner Management Modules Social Networking portals Video portals IMPORT/ EXPORT to excel ERP's We have a very good hard working team in place who will be working on your project. Also we have a good bench strength. We are looking for long term relationships and are open on price and payment terms. Please contact us right away to know more about our past work and team. We would be glad working with you.
sinfotechindia
View PM Post PM
$ 1305 40 days 06-26-2008 02:22 8.50/108.50/108.50/108.50/108.50/108.50/108.50/108.50/108.50/108.50/10
(2 reviews)
Sir Look at your PMB
CDNProjects
View PM Post PM
$ 1200 15 days 06-26-2008 02:51 (No Feedback Yet)
Please check PM.
jagtapinfotech
View PM Post PM
$ 1300 15 days 06-26-2008 00:48 (No Feedback Yet)
Dear Sir/Madam, We are INDIA based leading ISO 9001:2000 Certified Web Development Company, located at Baroda, Gujarat. Our development teams who are able to provide skills and expertise in almost every area of web design and development like E-Commerce, OS Commerce, Graphic design, Animation, Multimedia, Flash, Photoshop, Dream weaver, Illustrator ASP, .NET, PHP, CSS, Mysql, HTML, Server management, Action script, JavaScript, SEO,Win 2003 server, etc. We are working with many companies in USA, UK, NEWZELAND AUSTRILIA,DUBAI and we have completed over 1500+ of designs. regards, Dinesh
raidarmax
View PM Post PM
$ 1000 10 days 06-26-2008 01:59 (No Feedback Yet)
Hi, We are an outsourcing firm with a panel of talented designers and coders. We have specialized in web development and internet marketing since 2003 and we have proven work delivery policies. Please see PMB.
amoltaskar
View PM Post PM
$ 750 15 days 06-26-2008 09:07 (No Feedback Yet)
Best web designing, hosting & software development available here
namtv42
View PM Post PM
$ 1100 20 days 06-26-2008 22:32 (No Feedback Yet)
I can handle this Project. Contact me now.
Samarnarendra
View PM Post PM
$ 1000 21 days 06-27-2008 06:58 (No Feedback Yet)
I am a software engineer in reputed us mnc, i able to deliver u qualityful product on time. So u should provide this bid to me because i am fullfilling all ur requirements... Thanks
solutionwizards
View PM Post PM
$ 900 14 days 06-29-2008 08:24 (No Feedback Yet)
gifdear sir, we are very exprence in that tipe of project.Each of our team member has a minimum of three years of experience in their related field.Our vision is to create a well-rounded team to deliver quality products and services to clients across the globe. thank you
raawmedia
View PM Post PM
$ 1100 10 days 06-30-2008 14:43 (No Feedback Yet)
Hello. We look forward to working with you. Thank you.
john2496
View PM Post PM
$ 800 5 days 07-01-2008 01:46 (No Feedback Yet)
This project sounds like a lot of fun! I've developed many apps similar to this in the past and already have most of the code developed! I have the highest coding & design standards and over 8 years working with php! This project idea is fun and I'd love to get started on it! :S
Bid on This Project

 

[ GAF Top Users ] [ Website Design ] [ EU Freelance ]

What is GetAFreelancer.com? ( Read about the company )

Outsource projects and save a lot of money. Getting affordable freelance work, freelance programming and custom web design done for your website has never been easier. A freelancer is an independent worker, not on salary, hired instead on a project basis. Outsourcing is hiring an outside organization to perform services such as information processing and applications development. We have thousands of satisfied clients around the world. It's easy, fun and very affordable to outsource your project with our site.

Our site is global and we have freelancers from India, Romania, Russia, Ukraine, United States, UK and many other parts of the world. GetAFreelancer.com helps webmasters, web designers, programmers, software developers and business owners to develop their projects. Buy services with help from our secure escrow system. Would you like to outsource your next project? Would you like to make money as a freelancer? Click Sign Up to start! Our escrow feature is developed to protect both buyers and sellers.

Find Webmaster Resources and Webmaster Forum. Take a look at Search Engine Submission.