HTML cleanup for ePub

gyuen · Post by **gyuen** » Tue Oct 30, 2012 3:42 pm

Hey,

A lot of times in Word, cuz of search and replace or other things, the HTML code can get a bit messy. Like if something in a word .doc has attributes like bold or italic, and then I copy and paste, search and replace, or do something with it to insert other attributed text, when the same, I can get:

this gets m</bessy

So in my ePub, I have stuff like this:

Chapitre VII : Du traitement de la lèpre (kuṣṭha) et autres affections cutanées

which is really just one span class, the whole line bold, and then one word in italic. Is it possible to auto clean that up in the future? I know of no HTML tool that does that otherwise I'd use that afterwards.

Post by **Robert** » Tue Oct 30, 2012 6:44 pm

Hi,
You might want to try your luck with this approach:

1. Save your code as a HTML file in a pure text editor.
2. Display that “.htm” or “.html” file in Firefox or any similar browser.
3. Select, then copy (Ctrl+C) the Web page contents as displayed in the browser.
4. Paste (Ctrl+V) the clipboard contents into an Atlantis new document (Ctrl+N).
5. Save that new document as a Web page from Atlantis (File | Save Special > Save As Web Page…)

Saved from Atlantis, your HTML code should be a lot cleaner than the original code.

HTH.
Cheers,
Robert

gyuen · Post by **gyuen** » Tue Oct 30, 2012 8:35 pm

thanks. I could try that though I'd lose stuff like css styles, would have to re-apply them, and footnotes. I understand it's a problem of with Microsoft Word ; it should be smart enough to remove redundant formatting. Though if Atlantis could fix it with an ePub, that'd be really great.

Post by **Robert** » Tue Oct 30, 2012 8:53 pm

Atlantis does not open or edit EPUB files (code) directly.
But Atlantis saves RTF, DOC, or DOCX documents to EPUB code that is both clean and completely true to the official specifications.
All you actually need is to create appropriate RTF, DOC, or DOCX contents in Atlantis, and save them to EPUB from Atlantis.

You could also try selecting and copying the contents of your EPUB file as it displays in an EPUB reader, then paste it into Atlantis, and save these contents as EPUB (again), from Atlantis this time.

gyuen · Post by **gyuen** » Tue Oct 30, 2012 8:59 pm

What I'm doing is taking a .doc from Microsoft Word, opening it in Atlantis, and converting to ePub. The ePub HTML is what is really messy. Yes it is cuz MS Word isn't great at reducing redundant and unnecessary format codes. Yet Atlantis could clean it up a bit when making ePubs.

Post by **Robert** » Tue Oct 30, 2012 9:06 pm

Instead of saving the MS Word DOC files to EPUB directly, you could try this:

1. Open the MS Word DOC file in Atlantis.
2. Press Ctrl+A, then Ctrl+C.
3. Press Ctrl+N to create a new Atlantis document, then Ctrl+V, still in Atlantis.
4. Save that new document to EPUB from Atlantis.

Give it a try. Your code might be cleaner.

gyuen · Post by **gyuen** » Tue Oct 30, 2012 9:18 pm

thanks for the help Robert. Tried it. It might be a bit cleaner; I'm not sure. The doc in question I've cleaned quite a bit (the ePub) so it's hard to compare. But when I tried the copy and paste you suggested:

<h1 class="h11"><a id="a412"></a><a id="a413"></a><a id="a410"></a><a id="a409"></a><a id="a411"></a><a id="a414"></a>II – Section des diagnostics : Nidānasthāna</h1>

I'd guess the formatting (codes) really aren't changed.

I don't think there is any suggestion that would really help the situation. What I'm asking for is a hope that Atlantis in the future could clean-up and make up for a shortcomiing in Word (that may never be fixed.); it really does complicate HTML and probably makes rendering slower. If Atlantis could take care of this, it would prove again the small guys can do things that the big ones like Microsoft always overlook.

btw, the <a id="a412">, I'm not sure if that's necessary? I can't find any reference to a use of anchor links, which I often see, yet Atlantis seems to add that code. Maybe it's from Word but dunno.

Post by **Robert** » Wed Oct 31, 2012 5:36 pm

What I'm doing is taking a .doc from Microsoft Word, opening it in Atlantis, and converting to ePub

Hi,

Here is from http://stackoverflow.com/questions/6796 ... -word-html:

Word 2007 has a "publish > blog" menu item on the Office menu (top left corner).
Using this feature seems to do an incredibly good job of cleaning the HTML, far better than any of the other HTML exporters built into Word (like "save as HTML Filtered").
I have actually set up a bogus free blog somewhere just to use this HTML-cleaning capability. Most long articles on Joel on Software originated in Word 2007 and was published to a fake blog just to clean up the HTML.

Edit: as pointed in comments, be sure that you enter a title for the fake post. If you don't, Word will show a generic error "Can't publish your post"
this worked great for me. i created a livespaces account. be careful not to accidentally select the link at the top of the blog that links you back to livespaces. also when creating the blog entry you need to enter a title otherwise it will give you a useless error telling you it it cant be posted

Now here is still from the same Web page:

In my opinion? Don't use it.
But in the real world, I've found that FCKEditor does a decent job of cleaning up Words fantastically hideous HTML.
I'm impressed. To use it, click "Paste from Word" button, which opens a dialog. Paste the contents of your Word document there.
this seemed to be the best option for me. just use the demo on the FCKEditor site.
I found this while Googling options to clean-up MS Word HTML. OMG. At face value, this seemed ridiculous option. So of course, I tried it. CKeditor (name has changed since OP) was AWESOME at cleaning up a very large and messy HTML file. I was very impressed at the cripsness of the HTML that came out of the MS Word mess. Amazing transformation. This saved me hours of REGEX hand cleaning HTML. Got my vote.

Finally, I tried using the FCKEditor at http://ckeditor.com/demo myself.

Here is what I did:

1. I first selected then copied the code you gave as an example in this same thread:

Code: Select all

<span class="t3"><b>Chapitre VII &#58; Du traitement de la lèpre &#40;</b></span><span class="t3"><b><i>ku</i></b></span><span class="t3"><b><i>&#7779;&#7789;ha</i></b></span><span class="t3"><b>&#41;</b></span><span class="t3"><b> </b></span><span class="t3"><b>et autres affections cutanées</b></span>

2. I pasted that code into Notepad++ (http://notepad-plus-plus.org/) and saved that “document” as a HTML file.

3. I opened that HTML file in Pale Moon, my default browser (http://www.palemoon.org/). But you could use Firefox. The aim of this is to be able to copy the rendered text as it is displayed in the browser window.

Note that you don’t need to go through steps 2 and 3 yourself. You only need to open your DOC file in MS Word, select its text, and copy it to the Windows clipboard (Ctrl+C).

I then went to http://ckeditor.com/demo, created a New Page, then used their “Paste from Word” button. The WYSIWYG display was OK. I pressed the “Source” button. I got this code instead of the code originally pasted:

Code: Select all

<p><strong>Chapitre VII &#58; Du traitement de la l&egrave;pre &#40;</strong><strong><em>ku</em></strong><strong><em>&#7779;&#7789;ha</em></strong><strong>&#41;</strong> <strong>et autres affections cutan&eacute;es</strong></p>

Is this the kind of cleaner code you’re after?

HTH.
Cheers,
Robert

gyuen · Post by **gyuen** » Fri Nov 02, 2012 4:54 pm

Thanks Robert,

though the code still has redundant tags that aren't removed. In my example, it could be:

Chapitre VII : Du traitement de la lèpre (kuṣṭha) et autres affections cutanées

or in yours:

Chapitre VII : Du traitement de la lèpre (kuṣṭha) et autres affections cutanées

I can search and replace some (like "") but lots of it like in my example are because of Word (maybe there's a macro out there somewhere?) and there still remains no software I know of that can clean it up. Anyway, in your example, the css gets removed.