SOLIDIFY

 

NAME

solidify - download a web site, turning it to static HTML  

SYNOPSIS

solidify [options] out-dir https://site.dom/path/ [https://site.dom/path/page...]  

DESCRIPTION

Solidify downloads all URL's that start with the base URL (the second argument), and saves them in out-dir. The objective is to turn the site into static HTML so it can served by a web server using just the files downloaded. Intra site URL's are made relative, and pages and URL's may be altered so they can be served from static files.

If there are hidden URL's (URL's that can't be found by following links from the base URL) they can be supplied as additional URL's on the command line.

 

OPTIONS

--abort=status[,...]
Abort immediately if the server returns one of the numeric HTTP status codes in the comma separated list. Normally all pages are saved and scanned for URL's.
--header=H:V
Add a http header H: V to each page fetch request. Adding a Cookie header can be used to bypass login credentials, adding a User-Agent header can make some picky sites accept requests.
--progress
Print progress messages to stdout.
--replace=filename
Before scanning and saving HTML pages, update it by running search and replaces in filename. See MODIFYING THE DOWNLOADED PAGES below.
--skip=status[,...]
Don't save or scan a page for URL's if the server returns one of the numeric HTTP status codes in the comma separated list. Normally all pages are saved and scanned.
--threads=count
Download up to count pages concurrently. The default is 5, unless --wait is also present in which case 1 must be used.
--wait=S[:T]
Wait between S seconds between each page fetch. If T is supplied, wait a random amount of time between S and T seconds.

 

RESTARTING

Solidify goes through multiple phases, looking at the entire site during each phase. The first phase, which is the longest by far, is downloading all pages in the site to a temporary area. If interrupted this first phase can be restarted by rerunning the program with the same parameters. Once the first phase is complete the run can not be restarted.

 

MODIFYING THE DOWNLOADED PAGES

HTML pages can be modified using before they are scanned for URL's and saved using the --update= option. Since this happens before solidify processes the page it can be used to change what URL's solidify sees and then downloads.

The search and replace is done using Python's re.sub(search, replace, HTML) function. This means the search text is a Python regular expression, and the replacement can include text found by the search. If you aren't familiar with this function you will need to consult the Python re module's documentation before using this option.

Each search in filename is introduced by a line containing the exact string:

.....Search-for:
That string may be proceeded by blank lines and comment lines starting with a # sign. The lines following contain the Python regular expression to search for. Regular expressions can be made more readable (for example, they can have comments) by starting them with (?x) which turns the re.VERBOSE mode.

If present the replace string must be immediately proceeded by a line containing exactly:

.....Replace-with:
and following it with the text to replace the search string by. If no replace string is present the text found by the search is deleted.

Example:


    #
    # Don't include revision diff's.
    #
    .....Search-for:
    (?sx)
        <input\stype="radio"\sname="rev[^>]*>
        <input\stype="radio"[^>]*>
        <a\shref="[^"]*[?]action=diff&[^>]*>[^<]*</a>
    .....Replace-with:
    <font color=grey>[diff elided]</font>
    
    #
    # Don't include the login page.
    #
    .....Search-for:
    (?sx)
        <a\s[^>]*href="[^"]*[?]action=login\b[^>]*>Login</a>
    .....Replace-with:
    <font color=grey>[Login elided]</font>

 

MESSY DETAILS AND LIMITATIONS

Solidify works by downloading the first supplied URL. It discovers additional URL's to download by scanning html and css for additional URL's with the same site. It does not scan other file types (for example, javascript or json).

The HTML, css and javascript downloaded isn't parsed, just quickly scanned for strings that look like URL's. Solidify tries hard not to miss URL's, but that comes at the expense of mistaking random strings for URL's. This can generate unexpected 404's (http page not found errors). 404 errors are mostly harmless. If they really bother you the #http-error=404 suffix appended to the filename makes it's easy to find and delete the resulting downloads. Javascript is not run, so dynamic content the browser fetches with AJAX will not be seen. All that means the translation process is not an exact science; it's more of a best effort.

Usually solidify preserves URL's and page contents, but in some circumstances URL's must be changed so a static site will behave similarly to the dynamic one, regardless of what base URL it is served from:

-
All intra site URL's are made relative, so the site can be service from any base URI.
-
As web servers strip the query component queries are converted into path components. For example, https://some.dom/path/file.js?query is turned into https://some.dom/path/file.%3fquery.js.
-
If the file extension does not match the Content-Type as new file extension that does match will be appended. For example, https://some.dom/path/file.php?foo=bar&x=y might be turned into: https://some.dom/path/file.%3ffoo=bar&x=y.php.html
-
A URL ending with a '/' is changed so it fetches a file called 'index' with an extension matching the http Content-Type. For example, https://some.dom/path/ might be turned into: https://some.dom/path/index.html.
-
If a URL is both a page and has pages underneath it (that is, it is also a directory), then the page will be renamed. For example, if both https://site.com/dir.html/foo.html and https://site.com/dir.html exist then the latter will be renamed to: https://site.com/dir~1.html.
-
Pages that redirect are saved as html pages with a <meta http-equiv=Refresh content=0;redirect-url> to achieve the same effect.
-
https://site.com/a-very-long-filename.html (that is, filenames too long for the filesystem) are truncated. For example, to: https://site.com/a-very-long-file~truncated.html.
-
https://site.com/i-dont-exist.html (that is, any page that doesn't return a http OK or redirect) gets saved to a file with #http-error=xxx appended, for example https://site.com/i-dont-exist.html#http-error=404. Unlike the previous cases URL's in the downloaded pages aren't adjusted to this new form, so internal URL's that returned a http errors when fetched will returned "404 Not Found" in the downloaded site.

All downloaded text is re-encoded as UTF-8, and html pages have a <meta http-equiv=Content-Type content=text/html; charset=UTF-8> added to reflect that.

 

AUTHOR

Russell Stuart, <russell+solidify@stuart.id.au>.