Project Detail

Simple data mining script/ Perl REGEX  

Simple data mining script/ Perl REGEX is project number 428688
posted at Freelancer.com. Click here to post your own project.

 

| More Free Trial For New Buyers
 

Status:

Selected Providers: edatawiz

Budget: $30-250

Created: 05/03/2009 at 10:18 EDT

Bid Count: 10

Average Bid:
$ 38

05/10/2009 at 10:18 EDT

Project Creator: docdudetheman
Employer Rating: 10/1010/1010/1010/1010/1010/1010/1010/1010/1010/10 (1 reviews)

Bid On This Project
 

Description

Dear all,

I need somebody to write a simple parser in PERL. The task is quite straightforward.

I have a large data set of the following structure

id_1 | paragraph_1
id_2 | paragraph_2
id_3 | paragraph_3
etc.

where id_n is a running id and paragraph_n is a variable containing text data.

paragraph_n:
In FY 2005 que posuere aucibulum justo. Classa. Maecenas quam sociosque nunc ultrices. Nunc ipsum accumsan vive. Vitae mollis ut tor ristique mauristibus feugiat in 1991. As of 2008 Intesque augue nunc, rutrum, urnar. Donec vel, orna et mollicies wisis fermentum aliquam. Nulla fafring. In March 2006 Nullam leo for 2008 trisuscetuer condisse vitae mollis ut tor ristique mauristibus feugiat in nissim a. Donec vel, orna et mollicies wisis fermentum 8769. mollicies wisis fermentum aliquam in FY 2007. Nullam leo trisuscetuer condisse vitae mollis ut tor before 2006. During 2007 Nullam leo trisuscetuer condisse vitae mollis ut tor. From 2007 through 2010 condisse vitae mollis ut.

This paragraph contains 9 time references in total:

In FY 2005
in 1991
As of 2008
In March 2006
for 2008
in FY 2007
before 2006
During 2007
From 2007 through 2010


The script is required to perform two different tasks:

1.) The script should extract all time references. Time references are of two different kinds: Some are located at the beginning of a sentence, example given "In FY 2005", whereas others are inside a sentence, e.g. "in 1991". For each paragraph the script should extract all time references and write these to a pipe-separated output file with the corresponding id’s , that is:

id | time_ref_1 | time_ref_2 | time_ref_3 | time_ref_4 | ..... | time_ref_n

Please note that each paragraph is likely to have a different number of time references (maximum is around 20). I already have a list of regular expressions identifying time references which I will supply. However, the list is incomplete, and one task of the coder would be to find more regular expressions which identify time/years/months.

2.) The script should also split each paragraph into subparagraphs. Whenever a sentence in a paragraph starts with a capital “time reference” (Examples: In FY, In March 2003, As of etc.), this sentence and all sentences until the next capital time reference should be parsed into a new sub_par_n variable and written into a pipe-separated output file. For the example at hand

id | sub_par_1 | 1
id | sub_par_2 | 2
id | sub_par_3 | 3
id | sub_par_4 | 4
id | sub_par_5 | 5

where

sub_par_1:
In FY 2005 que posuere aucibulum justo. Classa. Maecenas quam sociosque nunc ultrices. Nunc ipsum accumsan vive. Vitae mollis ut tor ristique mauristibus feugiat in 1991.

sub_par_2:
As of 2008 Intesque augue nunc, rutrum, urnar. Donec vel, orna et mollicies wisis fermentum aliquam. Nulla fafring.

sub_par_3 :
In March 2006 Nullam leo for 2008 trisuscetuer condisse vitae mollis ut tor ristique mauristibus feugiat in nissim a. Donec vel, orna et mollicies wisis fermentum 8769. mollicies wisis fermentum aliquam in FY 2007. Nullam leo trisuscetuer condisse vitae mollis ut tor before 2006.

sub_par_4
During 2007 Nullam leo trisuscetuer condisse vitae mollis ut tor.

sub_par_5:
From 2007 through 2010 condisse vitae mollis ut.

Whenever there is only one single paragraph, sub_par equals the paragraph.

If you should require further information, please feel free to private message me.

Best regards

Philipp


Additional files submitted:
regex.txt

Messages Posted:1 View project clarification board Post message on project clarification board

Bid On This Project
 

If you are the project creator or one of the bidders Log In for more options

 

30

1 day

05-04-2009 03:27 EDT

we are ready to start it.

help

 

50

1 day

05-03-2009 13:10 EDT

Hi - Please check your PM for details.

help

 

50

2 days

05-03-2009 10:52 EDT

Hi! Please see PM.

help

 

30

0 days

05-03-2009 11:56 EDT

Not a bid; please see PMB.

help

 

30

0 days

05-03-2009 23:41 EDT

I can do this quite easily.Give me a chance. Thanks.

help

 

40

1 day

05-03-2009 11:32 EDT

(No Feedback Yet)

Please send the complete details. thnx

help

 

30

1 day

05-03-2009 19:48 EDT

(No Feedback Yet)

I can do this please forward me further details

help

 

30

2 days

05-03-2009 23:57 EDT

(No Feedback Yet)

hi, pls see pm. tks.

help

 

50

2 days

05-04-2009 01:12 EDT

(No Feedback Yet)

Im already working with the perl programming, im expecting your response.

help

 

40

1 day

05-04-2009 02:46 EDT

(No Feedback Yet)

pls check PM

help


    Bid on this Project