Hey,
A lot of times in Word, cuz of search and replace or other things, the HTML code can get a bit messy. Like if something in a word .doc has attributes like bold or italic, and then I copy and paste, search and replace, or do something with it to insert other attributed text, when the same, I can get:
<b>this gets m</b<b>essy</b>
So in my ePub, I have stuff like this:
<span class="t3"><b>Chapitre VII : Du traitement de la lèpre (</b></span><span class="t3"><b><i>ku</i></b></span><span class="t3"><b><i>ṣṭha</i></b></span><span class="t3"><b>)</b></span><span class="t3"><b> </b></span><span class="t3"><b>et autres affections cutanées</b></span>
which is really just one span class, the whole line bold, and then one word in italic. Is it possible to auto clean that up in the future? I know of no HTML tool that does that otherwise I'd use that afterwards.
HTML cleanup for ePub
Hi,
You might want to try your luck with this approach:
1. Save your code as a HTML file in a pure text editor.
2. Display that “.htm” or “.html” file in Firefox or any similar browser.
3. Select, then copy (Ctrl+C) the Web page contents as displayed in the browser.
4. Paste (Ctrl+V) the clipboard contents into an Atlantis new document (Ctrl+N).
5. Save that new document as a Web page from Atlantis (File | Save Special > Save As Web Page…)
Saved from Atlantis, your HTML code should be a lot cleaner than the original code.
HTH.
Cheers,
Robert
You might want to try your luck with this approach:
1. Save your code as a HTML file in a pure text editor.
2. Display that “.htm” or “.html” file in Firefox or any similar browser.
3. Select, then copy (Ctrl+C) the Web page contents as displayed in the browser.
4. Paste (Ctrl+V) the clipboard contents into an Atlantis new document (Ctrl+N).
5. Save that new document as a Web page from Atlantis (File | Save Special > Save As Web Page…)
Saved from Atlantis, your HTML code should be a lot cleaner than the original code.
HTH.
Cheers,
Robert
Atlantis does not open or edit EPUB files (code) directly.
But Atlantis saves RTF, DOC, or DOCX documents to EPUB code that is both clean and completely true to the official specifications.
All you actually need is to create appropriate RTF, DOC, or DOCX contents in Atlantis, and save them to EPUB from Atlantis.
You could also try selecting and copying the contents of your EPUB file as it displays in an EPUB reader, then paste it into Atlantis, and save these contents as EPUB (again), from Atlantis this time.
But Atlantis saves RTF, DOC, or DOCX documents to EPUB code that is both clean and completely true to the official specifications.
All you actually need is to create appropriate RTF, DOC, or DOCX contents in Atlantis, and save them to EPUB from Atlantis.
You could also try selecting and copying the contents of your EPUB file as it displays in an EPUB reader, then paste it into Atlantis, and save these contents as EPUB (again), from Atlantis this time.
Instead of saving the MS Word DOC files to EPUB directly, you could try this:
1. Open the MS Word DOC file in Atlantis.
2. Press Ctrl+A, then Ctrl+C.
3. Press Ctrl+N to create a new Atlantis document, then Ctrl+V, still in Atlantis.
4. Save that new document to EPUB from Atlantis.
Give it a try. Your code might be cleaner.
1. Open the MS Word DOC file in Atlantis.
2. Press Ctrl+A, then Ctrl+C.
3. Press Ctrl+N to create a new Atlantis document, then Ctrl+V, still in Atlantis.
4. Save that new document to EPUB from Atlantis.
Give it a try. Your code might be cleaner.
thanks for the help Robert. Tried it. It might be a bit cleaner; I'm not sure. The doc in question I've cleaned quite a bit (the ePub) so it's hard to compare. But when I tried the copy and paste you suggested:
<h1 class="h11"><a id="a412"></a><a id="a413"></a><a id="a410"></a><a id="a409"></a><a id="a411"></a><a id="a414"></a>II – Section des diagnostics : <span class="th11">Nid</span><span class="th11">ā</span><span class="th11">nasth</span><span class="th11">ā</span><span class="th11">na</span></h1>
I'd guess the formatting (codes) really aren't changed.
I don't think there is any suggestion that would really help the situation. What I'm asking for is a hope that Atlantis in the future could clean-up and make up for a shortcomiing in Word (that may never be fixed.); it really does complicate HTML and probably makes rendering slower. If Atlantis could take care of this, it would prove again the small guys can do things that the big ones like Microsoft always overlook.
btw, the <a id="a412">, I'm not sure if that's necessary? I can't find any reference to a use of anchor links, which I often see, yet Atlantis seems to add that code. Maybe it's from Word but dunno.
<h1 class="h11"><a id="a412"></a><a id="a413"></a><a id="a410"></a><a id="a409"></a><a id="a411"></a><a id="a414"></a>II – Section des diagnostics : <span class="th11">Nid</span><span class="th11">ā</span><span class="th11">nasth</span><span class="th11">ā</span><span class="th11">na</span></h1>
I'd guess the formatting (codes) really aren't changed.
I don't think there is any suggestion that would really help the situation. What I'm asking for is a hope that Atlantis in the future could clean-up and make up for a shortcomiing in Word (that may never be fixed.); it really does complicate HTML and probably makes rendering slower. If Atlantis could take care of this, it would prove again the small guys can do things that the big ones like Microsoft always overlook.
btw, the <a id="a412">, I'm not sure if that's necessary? I can't find any reference to a use of anchor links, which I often see, yet Atlantis seems to add that code. Maybe it's from Word but dunno.
Hi,What I'm doing is taking a .doc from Microsoft Word, opening it in Atlantis, and converting to ePub
Here is from http://stackoverflow.com/questions/6796 ... -word-html:
Now here is still from the same Web page:Word 2007 has a "publish > blog" menu item on the Office menu (top left corner).
Using this feature seems to do an incredibly good job of cleaning the HTML, far better than any of the other HTML exporters built into Word (like "save as HTML Filtered").
I have actually set up a bogus free blog somewhere just to use this HTML-cleaning capability. Most long articles on Joel on Software originated in Word 2007 and was published to a fake blog just to clean up the HTML.
Edit: as pointed in comments, be sure that you enter a title for the fake post. If you don't, Word will show a generic error "Can't publish your post"
this worked great for me. i created a livespaces account. be careful not to accidentally select the link at the top of the blog that links you back to livespaces. also when creating the blog entry you need to enter a title otherwise it will give you a useless error telling you it it cant be posted
Finally, I tried using the FCKEditor at http://ckeditor.com/demo myself.In my opinion? Don't use it.
But in the real world, I've found that FCKEditor does a decent job of cleaning up Words fantastically hideous HTML.
I'm impressed. To use it, click "Paste from Word" button, which opens a dialog. Paste the contents of your Word document there.
this seemed to be the best option for me. just use the demo on the FCKEditor site.
I found this while Googling options to clean-up MS Word HTML. OMG. At face value, this seemed ridiculous option. So of course, I tried it. CKeditor (name has changed since OP) was AWESOME at cleaning up a very large and messy HTML file. I was very impressed at the cripsness of the HTML that came out of the MS Word mess. Amazing transformation. This saved me hours of REGEX hand cleaning HTML. Got my vote.
Here is what I did:
1. I first selected then copied the code you gave as an example in this same thread:
Code: Select all
<span class="t3"><b>Chapitre VII : Du traitement de la lèpre (</b></span><span class="t3"><b><i>ku</i></b></span><span class="t3"><b><i>ṣṭha</i></b></span><span class="t3"><b>)</b></span><span class="t3"><b> </b></span><span class="t3"><b>et autres affections cutanées</b></span>
3. I opened that HTML file in Pale Moon, my default browser (http://www.palemoon.org/). But you could use Firefox. The aim of this is to be able to copy the rendered text as it is displayed in the browser window.
Note that you don’t need to go through steps 2 and 3 yourself. You only need to open your DOC file in MS Word, select its text, and copy it to the Windows clipboard (Ctrl+C).
I then went to http://ckeditor.com/demo, created a New Page, then used their “Paste from Word” button. The WYSIWYG display was OK. I pressed the “Source” button. I got this code instead of the code originally pasted:
Code: Select all
<p><strong>Chapitre VII : Du traitement de la lèpre (</strong><strong><em>ku</em></strong><strong><em>ṣṭha</em></strong><strong>)</strong> <strong>et autres affections cutanées</strong></p>
HTH.
Cheers,
Robert
Thanks Robert,
though the code still has redundant tags that aren't removed. In my example, it could be:
<span class="t3"><b>Chapitre VII : Du traitement de la lèpre (<i>kuṣṭha</i>) et autres affections cutanées</b></span>
or in yours:
<p><strong>Chapitre VII : Du traitement de la lèpre (<em>kuṣṭha</em>) et autres affections cutanées</strong></p>
I can search and replace some (like "</i><i>") but lots of it like in my example are because of Word (maybe there's a macro out there somewhere?) and there still remains no software I know of that can clean it up. Anyway, in your example, the css gets removed.
though the code still has redundant tags that aren't removed. In my example, it could be:
<span class="t3"><b>Chapitre VII : Du traitement de la lèpre (<i>kuṣṭha</i>) et autres affections cutanées</b></span>
or in yours:
<p><strong>Chapitre VII : Du traitement de la lèpre (<em>kuṣṭha</em>) et autres affections cutanées</strong></p>
I can search and replace some (like "</i><i>") but lots of it like in my example are because of Word (maybe there's a macro out there somewhere?) and there still remains no software I know of that can clean it up. Anyway, in your example, the css gets removed.