Show Bid Request
Faassst File Parser
Bid Request Id: 33708
|
|
|
Description:
I require a very fast file parser (executable) that will do the following:
1. Go over each and every subdirectory (recursively) starting from a given ROOT drive/dir.
2. For every file (that is not excluded) parse all <HTML> tag and Javascript code.
3. For every stripped (text only) file try to find keywords from a pre-defined list. Count the occurences of keywords in the document.
4. If the document consist at least N% of occurences (out of the entire distinct word list), mark the file as attribute "ARCHIVE". If not, make it a zero-byte file (but keep the old attribute and filename).
Notes: 1. Should work very efficiently, on thousands and millions of documents. 2. I should be able to define (on an .ini file) the following: ROOT directory to start parsing from, list of Keywords to find and file EXTENSTIONS to exclude (i.e - .jpg/.bmp/.pdf) and STATISTICS figure (to determine if a file is relevent or not).
This is a personal project, do not post bids higher then 30$.
Michael.
Deliverables: 1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
2) Installation package that will install the software (in ready-to-run condition) on the platform(s) specified in this bid request.
3) Complete ownership and distribution copyrights to all work purchased.
Platform:
c/perl
Must be 100% finished and received by buyer on:
Nov 5, 2002 EDT
Deadline legal notes: All times are expressed in the time zone of the site EDT (UT - 5). If the buyer omitted a time, then the deadline is 11:59:59 PM EDT on the indicated date.
Special Conditions / Other:
asap
Remember that contacting the other party outside of the site (by email, phone, etc.) on all business projects < $500 (before the buyer's money is escrowed) is a violation of both the software buyer and seller agreements.
We monitor all site activity for such violations and can instantly expel transgressers on the spot, so we thank you in advance for your cooperation.
If you notice a violation please help out the site and report it. Thanks for your help.
|
|
Bidding/Comments:
|
All monetary amounts on the site are in United States dollars.
Rent a Coder is a closed auction, so coders can only see their own bids and comments. Buyers can view every posting made on their bid requests. |
See all rejected bids (and all comments)
Name |
Bid Amount |
Date |
Coder Rating |
|
|
This bid was accepted by the buyer!
|
$20 (USD)
|
Nov 3, 2002 6:53:45 AM EDT
|
9.74
(Excellent)
|
|
|
Hello,
I'm glad to see that you are willing to consider Perl for this task.
Perl's find function easily governs processing files in a directory hierarchy. And then its text extraction facilities are unmatched. A (free) CPAN module parses HTML tags seemlessly.
Because the code is compiled before execution, Perl scripts are very efficient. And while I might brag about being an ace programmer, I know that I'm not about to parse text more efficiently than their highly optimized library.
A IDLER Chief Software Architect Idleswell Software Creations
|
|
|
|
|
|