Show Bid Request
Parse Large Text File (XML)
Bid Request Id: 45966
|
|
|
Posted by: |
unclejohn (4 ratings)
(Software buyer rating 10)
|
Non-action Ratio: |
Very Good - 0.00%
|
Buyer Security Verifications: |
Good
|
Approved on: |
Jan 29, 2003 11:06:20 PM EDT
|
Bidding Closes: |
Feb 12, 2003 10:31:12 PM EDT
|
Viewed (by coders): |
96 times
|
Deadline: |
2/20/2003
TIME EXPIRED
|
|
|
|
Description:
I need a rather large XML (text) file parsed. I believe the text file is about 55mb uncompressed. I think you could use one of several techniques to parse this file:
1) You could use XSLIT to transform this file into its components.
2) You could use PERL or some other scripting language to rewrite the file
There is one extraordinary problem with using XSLIT to translate the file—I think the XML file is NOT well-formed.
The entire file is composed of 23,000+ records. A simplified representation of a record looks like this:
4609
<authrecord>
<Title>Journal Of The American Medical Association</Title>
<id jakeid="4609" type="serial"/><serialinfo issn="0098-7484"/>
<subject classtype=”lcsh” source=”lc”>American Medical Association</subject>
<subject classtype=”lcsh” source=”lc”>Medicine</subject>
<subject classtype=”lcsh” source=”lc”>Medical Care</subject>
Jama: Journal Of The American Medical Association</preferredtitles>
Jama; Journal Of The American Medical Association</preferredtitles>
Jama</preferredtitles>
</authrecord>
Of course, each “record” is much more detailed. I am looked to remove the extraneous information. I need these records parsed in several ways. The first way I need it parsed is by turning each individual record into several records based on
The format would look like this:
Preferredtitle, id jakeid, serialinfo issn, Title
An example of the records derived from the record above would be:
Jama: Journal Of The American Medical Association, 4609, 0098-7484, Journal Of The American Medical Association
Jama; Journal Of The American Medical Association, 4609, 0098-7484, Journal Of The American Medical Association
Jama, 4609, 0098-7484, Journal Of The American Medical Association
I need the fields in these records delimited with a “|”
The second way I need this file parsed is to create several records based on the
<subject classtype=”lcsh” source=”lc”> field
The format would be:
Jakeid, subject class=”lcsh” source=”lc”
An example of the records derived from the record above would be:
4609, American Medical Association
4609, Medicine
4609, Medical Care
I have attached a file that shows a real record
Deliverables: 1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
2) Installation package that will install the software (in ready-to-run condition) on the platform(s) specified in this bid request.
3) Complete ownership and distribution copyrights to all work purchased.
4) Two parsed text files and the code that parsed it
Platform:
VB, PERL or XSLIT--your choice
Must be 100% finished and received by buyer on:
Feb 20, 2003 EDT
Deadline legal notes: All times are expressed in the time zone of the site EDT (UT - 5). If the buyer omitted a time, then the deadline is 11:59:59 PM EDT on the indicated date.
Additional Files:
This bid request includes IMPORTANT additional attached files. Please download and read fully before bidding.
Remember that contacting the other party outside of the site (by email, phone, etc.) on all business projects < $500 (before the buyer's money is escrowed) is a violation of both the software buyer and seller agreements.
We monitor all site activity for such violations and can instantly expel transgressers on the spot, so we thank you in advance for your cooperation.
If you notice a violation please help out the site and report it. Thanks for your help.
|
|
Bidding/Comments:
|
All monetary amounts on the site are in United States dollars.
Rent a Coder is a closed auction, so coders can only see their own bids and comments. Buyers can view every posting made on their bid requests. |
See all rejected bids (and all comments)
Name |
Bid Amount |
Date |
Coder Rating |
|
|
This bid was accepted by the buyer!
|
$20 (USD)
|
Jan 30, 2003 6:21:09 PM EDT
|
10
(Excellent)
|
|
|
I have done extensive work on XML processing in Java, but I could do it in Perl instead, if you like.
In either case the file would be processed as a stream, the whole file will not be loaded into memory and must not be well formed. The two resulting files will be produced in one execution of the parser.
I know that I have no rating yet, but I have many years of experience of developing production level code, as you can see from my resume.
Hope to hear from you.
|
|
|
|
|
|