SOLIDIFY
NAME
solidify - download a web site, turning it to static HTML
SYNOPSIS
solidify
[options]
out-dir
https://site.dom/path/
[https://site.dom/path/page...]
DESCRIPTION
Solidify
downloads all URL's that start with the base URL
(the second argument), and saves them in
out-dir.
The objective is to turn the site into static HTML
so it can served by a web server
using just the files downloaded.
Intra site URL's are made relative,
and pages and URL's may be altered so they can be served from static files.
If there are hidden URL's
(URL's that can't be found by following links from the base URL)
they can be supplied as additional URL's on the command line.
OPTIONS
- --abort=status[,...]
-
Abort immediately
if the server returns one of the numeric HTTP status codes in
the comma separated list.
Normally all pages are saved and scanned for URL's.
- --header=H:V
-
Add a http header
H: V
to each page fetch request.
Adding a
Cookie
header can be used to bypass login credentials,
adding a
User-Agent
header can make some picky sites accept requests.
- --progress
-
Print progress messages to stdout.
- --replace=filename
-
Before scanning and saving HTML pages,
update it by running search and replaces in
filename.
See
MODIFYING THE DOWNLOADED PAGES
below.
- --skip=status[,...]
-
Don't save or scan a page for URL's
if the server returns one of the numeric HTTP status codes in
the comma separated list.
Normally all pages are saved and scanned.
- --threads=count
-
Download up to
count
pages concurrently.
The default is 5, unless
--wait
is also present in which case 1 must be used.
- --wait=S[:T]
-
Wait between
S
seconds between each page fetch.
If
T
is supplied, wait a random amount of time between
S
and
T
seconds.
RESTARTING
Solidify
goes through multiple phases, looking at the entire site during each phase.
The first phase, which is the longest by far,
is downloading all pages in the site to a temporary area.
If interrupted this first phase can be restarted by rerunning the program
with the same parameters.
Once the first phase is complete the run can not be restarted.
MODIFYING THE DOWNLOADED PAGES
HTML pages can be modified using before they are scanned for URL's and saved
using the
--update=
option.
Since this happens before
solidify
processes the page it can be used to change what URL's
solidify
sees and then downloads.
The search and replace is done using Python's
re.sub(search, replace, HTML)
function.
This means the search text is a Python regular expression,
and the replacement can include text found by the search.
If you aren't familiar with this function
you will need to consult the Python
re
module's documentation before using this option.
Each search in
filename
is introduced by a line containing the exact string:
-
.....Search-for:
That string may be proceeded by blank lines
and comment lines starting with a
#
sign.
The lines following contain the Python regular expression to search for.
Regular expressions can be made more readable
(for example, they can have comments)
by starting them with
(?x)
which turns the re.VERBOSE mode.
If present the replace string
must be immediately proceeded by a line containing exactly:
-
.....Replace-with:
and following it with the text to replace the search string by.
If no replace string is present the text found by the search is deleted.
Example:
#
# Don't include revision diff's.
#
.....Search-for:
(?sx)
<input\stype="radio"\sname="rev[^>]*>
<input\stype="radio"[^>]*>
<a\shref="[^"]*[?]action=diff&[^>]*>[^<]*</a>
.....Replace-with:
<font color=grey>[diff elided]</font>
#
# Don't include the login page.
#
.....Search-for:
(?sx)
<a\s[^>]*href="[^"]*[?]action=login\b[^>]*>Login</a>
.....Replace-with:
<font color=grey>[Login elided]</font>
MESSY DETAILS AND LIMITATIONS
Solidify
works by downloading the first supplied URL.
It discovers additional URL's to download by scanning html and css
for additional URL's with the same site.
It does not scan other file types (for example, javascript or json).
The HTML, css and javascript downloaded isn't parsed,
just quickly scanned for strings that look like URL's.
Solidify
tries hard not to miss URL's,
but that comes at the expense of mistaking random strings for URL's.
This can generate unexpected 404's (http page not found errors).
404 errors are mostly harmless.
If they really bother you the
#http-error=404
suffix appended to the filename makes it's easy to find and
delete the resulting downloads.
Javascript is not run, so dynamic content the browser fetches with AJAX
will not be seen.
All that means the translation process is not an exact science;
it's more of a best effort.
Usually
solidify
preserves URL's and page contents,
but in some circumstances URL's must be changed
so a static site will behave similarly to the dynamic one,
regardless of what base URL it is served from:
- -
-
All intra site URL's are made relative, so the site can be service
from any base URI.
- -
-
As web servers strip the query component queries are converted into
path components. For example,
https://some.dom/path/file.js?query
is turned into
https://some.dom/path/file.%3fquery.js.
- -
-
If the file extension does not match the Content-Type
as new file extension that does match will be appended.
For example,
https://some.dom/path/file.php?foo=bar&x=y
might be turned into:
https://some.dom/path/file.%3ffoo=bar&x=y.php.html
- -
-
A URL ending with a '/' is changed so it fetches a file called 'index'
with an extension matching the http Content-Type.
For example,
https://some.dom/path/
might be turned into:
https://some.dom/path/index.html.
- -
-
If a URL is both a page and has pages underneath it
(that is, it is also a directory), then the page will be renamed.
For example, if both
https://site.com/dir.html/foo.html
and
https://site.com/dir.html
exist then the latter will be renamed to:
https://site.com/dir~1.html.
- -
-
Pages that redirect are saved as html pages with a
<meta http-equiv=Refresh content=0;redirect-url>
to achieve the same effect.
- -
-
https://site.com/a-very-long-filename.html
(that is, filenames too long for the filesystem) are truncated.
For example, to:
https://site.com/a-very-long-file~truncated.html.
- -
-
https://site.com/i-dont-exist.html
(that is, any page that doesn't return a http OK or redirect)
gets saved to a file with
#http-error=xxx
appended, for example
https://site.com/i-dont-exist.html#http-error=404.
Unlike the previous cases URL's in the downloaded pages
aren't adjusted to this new form,
so internal URL's that returned a http errors
when fetched
will returned "404 Not Found" in the downloaded site.
All downloaded text is re-encoded as UTF-8,
and html pages have a
<meta http-equiv=Content-Type content=text/html; charset=UTF-8>
added to reflect that.
AUTHOR
Russell Stuart, <russell+solidify@stuart.id.au>.