Blog-Archive (bash)

Ever wanted to archive your blog offline? Don't know a way that would be both convenient and create nice individual files for each little (or big) text you wrote? Well, this little script for Ubuntu 12.10 downloads your posts and arrange them in individual ODT files using bash commands only (like grep, mkdir, sed, wget, find, and so on).

If you're using Blogger and Ubuntu, there's not too much to do for you: Download the script, specify in the script what is the main URL of your blog (there's a variable for that), the name of your blog "categories" (there's an array for that), and the name of the folder where you want your posts to be downloaded, and you're already pretty much good to go!

Download the script here blog-archive.sh. Almost every line is commented in order to explain what's going on in the code.

What does the script? Well, it "analyses" the HTML code of the downloaded webpages and individually separates each blog post from each "category" webpage. The script put the data in a HTML file according to the title of the post online.

Some formatting is also done to prettify the file: If the script finds images in the post body, it will save them separetely. Your images will still be displayed in the final ODT files. If there is an iframe (most likely an embedded YouTube video), the script will save the link of the video and add it to the HTML file.

Then, a command will run the LibreOffice HTML/ODT converter. Of course, you also need to have LibreOffice installed in order to achieve the conversion of the files! If you want to use a different word processor, feel free to change the crucial line in the script, or any other part of it.

Don't forget to set the custom variables according to your own blog! You can add/remove/change as many categories as you want. Note that all paths (excluding URLs) are relative to the directory where the script is launched. Therefore, wherever you put the script, the folder tree will always be the same according to the "root" (I mean, where the script is launched) folder .

Enjoy!


What you need to know before you starting using this script is that it is specifically designed for this blog. Like many blogs over the Internet, you can browse this one by clicking on "labels." On a newspaper website, these categories gather posts dealing with the same theme (Culture, Politics, International, ...). On this blog, there are categories, too. When you click on them, you get access to a page where all the posts of a particular category and their content are displayed all at once. This structure is pretty useful for organizational purposes. It also makes the archiving process easier. However, if your blog doesn't use "thematic pages" (if you don't have a page for each category which displays all the thematic posts) you'll need to adapt the script a bit in order to use it. Furthermore, this script is based on the way HTML pages are currently coded on a Blogger platform. Post titles, for examples, are surrounded by <h3> tags. If your blog follows a different logic, again, you'll need to adapt the script. If you don't mind hacking the code a bit, it'll be easy for you.

Aucun commentaire:

Enregistrer un commentaire