Archive The Web With MHTML Files

by JL Beeken on 2-14-2012

There’s another way to archive web pages besides saving them in personal software or printing them to PDF. It’s called MHTML.

Scrapbook is a good place to save pages. I have my whole website there. But printing them is messy.

Printing to PDF out of Firefox means you get the whole page including sidebars and advertising. And it splits text and images across page breaks and throws sidebars across the bottom.

Posts and pages can be captured in OneNote but OneNote (2007) is so draggy slow about pasting anything unless it’s very short. And it doesn’t keep paragraph spacing and that’s a deal breaker.

Leaving out the sidebars and pasting into EverNote 2 does the best job of it. You can copy & paste just the post and comments out of a blog, then use a PDF driver (like PDF Creator) to print to pdf. It puts a page break between each post and doesn’t split the images.

PDF Printing Kills Links

All the PDF printers I’ve seen (and that would be about a dozen) kill links unless the links are in the raw http format.

Print Friendly as well as Web2PDF will print web pages keeping the links intact but Print Friendly only prints posts leaving the comments out, and Web2PDF will print entire pages including things you might not want.

MHTML Printing

When you save a web page as HTML what you get is an HTML file and a folder of all the images and other files attached to it.

MHTML is a way of printing web pages that embeds the images and other files into the page so your output is just one page.

MHTML in Different Browsers

Internet Explorer

IE9 can output MHTML pages but it doesn’t do it well. The sidebars get thrown around and the font is inconsistent.

MHTML, Internet Explorer

It reads MHTML produced by Firefox and EverNote 2 just fine.

Firefox

Firefox before v.4 doesn’t either produce or read MHTML files. Firefox, after version 3, can output MHTML pages using a plugin called UnMHT. It makes very nice looking pages and Firefox renders them well.

MHTML Archiving With UnMHT

It can also save multiple tabs at once but makes only one page for each tab.

EverNote 2

Where I first discovered MHTML was through EverNote 2 under Export. Right-click on a note header to find this option.

Export MHTML From EverNote 2

And then choose Web Archive under the Save As options.

Saving MHTML From EverNote 2

It’s possible in EverNote 2 to select more than one note at a time and export them all into one tidy MHTML page that opens in your browser.

And it does a lovely job of it, putting a thin blue border around each note and separating the notes with a blue spacer bar.

MHTML Archived Page From EverNote 2

I exported 45 blog posts from EverNote 2 at one time for a total of 7.4 MB and it took IE9 about 15 seconds to open it.

Google Chrome

Google Chrome reads MHTML but not consistently. It likes pages produced with UnMHT in Firefox. It cannot read MHTML produced by IE9 at all. It reads MHTML produced by EverNote 2 but leaves out the images.

The only pages that produce well right now (of what I tested) are through EverNote 2 (best read in either IE9 or Firefox) and the Firefox UnMHT plugin if you also want to read it in Chrome.

Generally, GIF images tend to come out a bit blurry. JPGs are fine. As far as I can tell the images are actually embedded in the pages, not just linked to them, which is the point.

Text links are preserved as they are in web pages. Unless the links are going to a site that no longer exists.

Combining MHTML files

The only way I know of combining MHTML files right now is to keep my web pages or snippets in EverNote 2 and then select the ones I want and export them to MHTML together.

You can also use the Scrapbook extension for Firefox but it’s not a great solution if there are images involved. First save each page as an MHT file and then save them with Scrapbook. Then use the Combine wizard to combine multiple pages. This will only work well with text-only pages because any images will be linked from the location of each individual MHT page and if you delete those or move them there go your images.

{ 4 comments… read them below or add one }

Ulrich 4-08-2012 at 11:52 PM

Dear EverNote 2.2 users,

Recently I started using Evernote 2.2 again, because IMHO it is still the best portable clipper on Windows. I want to get my notes out of Evernote again so I created a script to perform the Evernote2.2 to pdf conversion with embedded clickable links, which you can grab here:
http://dl.dropbox.com/u/705149/EverSplit.zip

You will need a python 2 installation (it is not compatible with python 3, because I wrote the first version in 2009.) I have used PortablePython from my USB stick where my Evernote Database is installed. In this case but you will have to enter the full path to the interpreter.
http://www.python.org/ or http://www.portablepython.com/

You will also need the tool
http://code.google.com/p/wkhtmltopdf/

Download the wkhtmltopdf-0.9.9-installer.exe because Version 0.11.0 had troubles with gifs and did not work for me. After installation put the binaries and dlls in a directory “wkhtmltopdf” next to the script or change the path to the binary which is hard coded in script. If you trust me, this binary is already included in the zip-file above.

In Evernote export some of your notes as html. Start with less than 100 because the pdf formatting will need some time. Put the EverNote.html together with the image directory next to the script.

Start the command line, change into the directory and start the program with
> python EverSplit.py Evernote.html
and you will get a nice directory structure with all your notes as pdf sorted by year and month.

Afterwards the directory itself looks a bit messy with all the splitted html files. The script is still a bit rough around the edges because it is my first and only script parsing html but it gets the job done.

Good luck and tell me what you think.

Ulrich

Reply

JL 4-09-2012 at 8:37 AM

Not sure exactly what your point is here. Clickable links? If you install a pdf driver, like PDF Creator, you can select multiple notes and print them to PDF. I do it all the time. As long as the links are in http format they’re clickable. Of course, if a note is longer than one page with images, the images may be split across the page break.

As the above post suggests, an alternative is to print multiple notes to MHTML out of EverNote in one long ‘non-messy’ directory with all links clickable and all images intact.

Reply

Ulrich 4-09-2012 at 9:28 AM

I wanted all my notes converted into single pdf files sorted by date. Printing manually note by note is too cumbersome because I have hundreds of them in my database. On MacOSX printing to pdf always leaves all links within the html clickable even if the url is hidden behind an alias text or image. On Windows wkhtmltopdf was the only program I have found so far which is capable of doing this and it does not split the images on a page break.

I do not consider MHT as a solution for me because it depends on the windows platform or on Firefox extensions. So no preview on MacOSX and no support for indexing and searching. The same holds for .webarchive files which are the MacOSX analogy to MHT. Well supported by MacOSX but not usable on Windows. Even Safari on Windows cannot open these files. PDF is the way to go here.

The first version of my script in 2009 split the evernote html into single html files note by note. All of them had to go into the same directory to keep the links to the image directory intact. So this is why it may get a little ‘messy’ in this step. On MacOSX I can use the textutil command to convert all these files to .webarchive(s). So I did and transferred all my notes to be used by other Software on my Mac. As I have explained above I was never really satisfied with webarchives and converting to pdf one by one was not always successful.

Recently I started using Evernote2 again, because I was fed up with all the other cloud syncing software on Windows. Evernote 2 runs very fast on my USB stick at work. Then I discovered the wkhtmltopdf tool and decided to call it directly from the script itself. In addition I wanted all my notes sorted by date and month in a special directory structure. This is possible because of the date string evernote adds to each note but not possible with evernote 3 and up.

There may be other ways of printing, but this script is for exporting many notes out of EN2 in a well supported format.

Reply

JL 4-09-2012 at 10:29 AM

OK. It sounds like you’ve created a script for getting around problems with Macs. As far as I know, you can use Firefox on a Mac.

I don’t use a Mac so I haven’t run into the problems you have. MHTML is just another type of html page and can be searched like any other. Surprisingly, as noted above, it’s not supported well by all browsers, yet.

I print multiple notes as PDF from EverNote 2 all the time. Select All and Print. They’re already in dated order because they go into EverNote chronologically. Most of what I print is non-html email so the links are fine.

I’ve archived my entire website, month by month, using EverNote and export to MHTML. In this case, PDF is not the way to go because of the broken links and split images.

Thanks for your ideas. Maybe some of my readers will find them useful.

Reply

Leave a Comment

Powered by sweet Captcha

Previous post:

Next post: