I have a project that I want to do with the encyclicals of John Paul II to make them more accessible to folks, but I need some help from someone with more scripting experience than me.
Here’s the deal: I’ve got the encyclicals in HTML format, but I need to have the footnotes and (preferably) the parenthetical Scripture references STRIPPED OUT of them.
I could do this with my limited scripting experience, but it would be a needlessly long and painful process.
THEREFORE,
If someone with more scripting experience than I would like to step up to the plate and volunteer, I’d be most appreciative.
Lemme know.
More details on request.
Much obliged!
So long, and thanks for all the fish!
Unless you want somebody to write a batch script just open the HTML up in an editer, view the source code and use Find and Replace (sometimes just called Replace). Make sure that you have the Wild Cards turned on (Front Page/MS Office)
Using the vatican’s html coding
will replace all the superscripts with nothing
will replace all the footnotes with nothing
(cf.*)
will replace all the parenthetical references with nothing
SORRY!
Unless you want somebody to write a batch script just open the HTML up in an editer, view the source code and use Find and Replace (sometimes just called Replace). Make sure that you have the Wild Cards turned on (Front Page/MS Office)
Using the vatican’s html coding
[a href=*[/a]
(when you type this use < for [ and > for ])
will replace all the superscripts with nothing
[a name=*[/a]
(when you type this use < for [ and > for ])
will replace all the footnotes with nothing
(cf.*)
will replace all the parenthetical references with nothing
Send me the files, and I’ll see if I can get around to it (I’m supposed to be learning how to script in PERL for work anyway, this would be good practice).
Excuse me…
Can I ask a “What does GOD need with a starship” question?
With all due respect. How do you propose to make them MORE ACCESSIBLE than the Vatican website which has them in multiple languages?
Come and see. . . .
My guess: get a robot to read them and save as downloadable mp3s?
I could whip up a perl script to not only strip that stuff out, but re-mangle the encyclicals in DocBook format. From that they can be exported to HTML, PDF, text, Microsoft Help files, whatever.
It could be a bit tricky because it’s not uniform. I took a look at Fides et Ratio as an example (Catholic geek that I am). Indeed there are times when there is just a footnote number, i.e. (3). That’s easy. Then there are times when it says (cf. Rom 1:21-21). That would be easy to find — parenthetical expressions beginning with cf. Then there are times when it skips the ‘cf.’ and just says (1 Cor 1:20) or even just (2:17). Did you want to do anything with the paragraph numbering? Or the occasional use of roman numerals?
If I or another reader were to attempt to write it, the script would start by matching regular expressions that match () regions. It would then examine the contents within. Just digits (and perhaps colons and commas) – remove it. Does it open with cf. – remove it. Is there any abbreviation of a bible book – remove it. That last one would be the trickiest, but not impossible.
I only examined the one encyclical. Who knows how consistent the Vatican website is amongst the others. Were you looking to get rid of the footnote explanations at the bottom as well?
As far as accessibility, besides mp3’s for audio players, they also could be linked text docs within an iPod’s Notes feature. Not sure how often I’d read ’em that way though… 😉
PS. Are you using GMT for posting times? Movabletype/Typepad does allow you to set your local (San Diego) time zone.
Forget the comment about posting times. It’s only the comment Preview script that wasn’t using your local time. Once the comment went live, it was Pacific time. As Emily Litella used to say, “Ohhh…Never mind”.
I might be able to help. Could you please provide me with additional details.
You might want to get detagger from jafsoft. I use it to strip html coding from EWTN (and other) pages.
http://www.jafsoft.com/detagger/
If you post a link to a particular encyclical I’ll run it through detagger on my machine and verify that it does strip out the html tags you don’t want.
Michael
Check out this file. I’ve converted it and left the footnoted URL’s in (copy and paste into your brower):
http://www.onehawk.com/freefiles/Fides_et_Ratio-John_Paul_II-Encyclical_Letter_September_15_1998.txt
You can keep the linked urls as footnotes at the end of the doc or you can discard them during the conversion.
Michael
Michael’s notes do a good job of removing HTML tags — not unlike what one might get from running lynx in batch mode. I checked Michael’s txt file. However I think what Jimmy is looking for is removing the text portions that reference footnotes and scriptural references, not stripping the HTML necessarily.
If you’d be fine with removing all parenthetical expressions within the text, that’d be easy.
Have to admit I’m also in the ‘how could they be more accessable’ category. Even more than that, how could the removal of tags make them more accessable? A good explanation of the problem is critical in solving the problem.
Hi All
Are these any use?
http://www.catholic-pages.com/dir/fides_et_ratio.asp
Formatting is nice in the pdf version.
Footnotes etc still in.
I would like your thoughts on how we might pray for John Ankerberg to come out of his Reformation heresey into the light of God’s one true Holy Apostalic Roman Catholic Church? Ankerberg almost seems demon possessed in his vendatta against Rome. I just wish he would up and Cross the Tiber as it were.