devdocs/docs/file-scrapers.md

287 lines
9.6 KiB
Markdown
Raw Normal View History

# File Scraper Reference
This lists the docs that use `FileScraper` and instructions for building some of them.
If you open a PR to update one of these docs, please add/fix the instructions.
## C
Download the HTML book from https://en.cppreference.com/w/Cppreference:Archives
and copy `reference/en/c` from the ZIP file into `/path/to/devdocs/docs/c`.
## C++
Download the HTML book from https://en.cppreference.com/w/Cppreference:Archives
and copy `reference/en/cpp` from the ZIP file into `/path/to/devdocs/docs/cpp`.
## Dart
Click the “API docs” link under the “Stable channel” header on
https://www.dartlang.org/tools/sdk/archive. Rename the expanded ZIP to `dart~2`
and put it in `/path/to/devdocs/docs/`
2020-11-20 19:57:30 +01:00
Or run the following commands in your terminal:
```sh
curl https://storage.googleapis.com/dart-archive/channels/stable/release/$RELEASE/api-docs/dartdocs-gen-api-zip > dartApi.zip; \
unzip dartApi.zip; mv gen-dartdocs docs/dart~$VERSION
```
2022-09-06 19:41:21 +02:00
## date-fns
```sh
git clone https://github.com/date-fns/date-fns docs/date_fns
cd docs/date_fns
git checkout v2.29.2
yarn install
node scripts/build/docs.js
ls tmp/docs.json
```
## Django
Go to https://docs.djangoproject.com/, select the version from the
bubble in the bottom-right corner, then download the HTML version from the sidebar.
```sh
mkdir --parent docs/django\~$VERSION/; \
curl https://media.djangoproject.com/docs/django-docs-$VERSION-en.zip | \
bsdtar --extract --file - --directory=docs/django\~$VERSION/
```
2020-02-10 20:46:15 +01:00
## Elisp
2021-10-22 15:49:35 +02:00
Go to https://www.gnu.org/software/emacs/manual/elisp.html, download the HTML tarball and extract its content in `/path/to/devdocs/docs/elisp` or run the following command:
```sh
mkdir /path/to/devdocs/docs/elisp \
&& curl curl https://www.gnu.org/software/emacs/manual/elisp.html_node.tar.gz | \
tar --extract --gzip --strip-components=1 --directory=/path/to/devdocs/docs/elisp
```
## Erlang
Go to https://www.erlang.org/downloads and download the HTML documentation file.
2021-01-29 22:46:19 +01:00
```ah
mkdir --parent docs/erlang\~$VERSION/; \
2021-05-13 12:18:40 +02:00
curl http://erlang.org/download/otp_doc_html_$RELEASE.tar.gz | \
2021-01-29 22:46:19 +01:00
bsdtar --extract --file - --directory=docs/erlang\~$VERSION/
```
## Gnu
### Bash
Go to https://www.gnu.org/software/bash/manual/, download the HTML tar file (with one web page per node) and extract its content in `/path/to/devdocs/docs/bash` or run the following command:
```sh
mkdir /path/to/devdocs/docs/bash \
&& curl https://www.gnu.org/software/bash/manual/bash.html_node.tar.gz | \
tar --extract --gzip --directory=/path/to/devdocs/docs/bash
```
### GCC
Go to https://gcc.gnu.org/onlinedocs/ and download the HTML tarball of GCC Manual and GCC CPP manual or run the following commands to download the tarballs:
```sh
2020-11-18 20:44:32 +01:00
# GCC manual
mkdir docs/gcc~${VERSION}; \
curl https://gcc.gnu.org/onlinedocs/gcc-$RELEASE/gcc-html.tar.gz | \
tar --extract --gzip --strip-components=1 --directory=docs/gcc~${VERSION}
# GCC CPP manual
mkdir docs/gcc~${VERSION}_cpp; \
curl https://gcc.gnu.org/onlinedocs/gcc-$RELEASE/cpp-html.tar.gz | \
tar --extract --gzip --strip-components=1 --directory=docs/gcc~${VERSION}_cpp
```
### GNU Fortran
Go to https://gcc.gnu.org/onlinedocs/ and download the HTML tarball of Fortran manual or run the following commands to download the tarball:
```sh
mkdir docs/gnu_fortran~$VERSION; \
curl https://gcc.gnu.org/onlinedocs/gcc-$RELEASE/gfortran-html.tar.gz | \
tar --extract --gzip --strip-components=1 --directory=docs/gnu_fortran~$VERSION
```
2021-12-26 23:06:07 +01:00
## GNU Make
Go to https://www.gnu.org/software/make/manual/, download the HTML tarball and extract its content in `/path/to/devdocs/docs/gnu_make` or run the following command:
```sh
mkdir /path/to/devdocs/docs/gnu_make \
&& curl https://www.gnu.org/software/make/manual/make.html_node.tar.gz | \
tar --extract --gzip --strip-components=1 --directory=/path/to/devdocs/docs/gnu_make
```
2020-01-13 18:15:00 +01:00
## Gnuplot
2020-01-14 18:22:49 +01:00
The most recent release can be found near the bottom of
https://sourceforge.net/p/gnuplot/gnuplot-main/ref/master/tags/
2020-01-13 18:15:00 +01:00
```sh
DEVDOCS_ROOT=/path/to/devdocs
2020-01-14 18:22:49 +01:00
mkdir gnuplot-src $DEVDOCS_ROOT/docs/gnuplot
2020-01-13 18:15:00 +01:00
git clone -b $RELEASE --depth 1 https://git.code.sf.net/p/gnuplot/gnuplot-main ./gnuplot-src
cd gnuplot-src/
./prepare
2020-01-14 18:22:49 +01:00
./configure
2020-01-13 18:15:00 +01:00
cd docs/
make nofigures.tex
latex2html -html 5.0,math -split 4 -link 8 -long_titles 5 -dir $DEVDOCS_ROOT/docs/gnuplot -ascii_mode -no_auto_link nofigures.tex
```
2020-01-14 18:22:49 +01:00
To install `latex2html` on macOS: `brew install basictex latex2html`, then edit
`/usr/local/Cellar/latex2html/2019.2/l2hconf.pm` to include the path to LaTeX:
<details>
On line 21 (approximately):
```
# Give the paths to latex and dvips on your system:
#
$LATEX = '/Library/TeX/texbin/latex'; # LaTeX
$PDFLATEX = '/Library/TeX/texbin/pdflatex'; # pdfLaTeX
$LUALATEX = '/Library/TeX/texbin/lualatex'; # LuaLaTeX
$DVILUALATEX = '/Library/TeX/texbin/dvilualatex'; # dviLuaLaTeX
$DVIPS = '/Library/TeX/texbin/dvips'; # dvips
$DVIPNG = ''; # dvipng
$PDFTOCAIRO = '/usr/local/bin/pdf2svg'; # pdf to svg converter
$PDFCROP = ''; # pdfcrop
$GS = '/usr/local/opt/ghostscript/bin/gs'; # GhostScript
```
</details>
## NumPy
2020-11-19 21:41:01 +01:00
```sh
mkdir --parent docs/numpy~$VERSION/; \
curl https://numpy.org/doc/$VERSION/numpy-html.zip | \
bsdtar --extract --file=- --directory=docs/numpy~$VERSION/
```
2020-05-24 11:42:28 +02:00
## OCaml
2021-01-03 11:56:45 +01:00
Download from https://www.ocaml.org/docs/ the HTML reference:
https://v2.ocaml.org/releases/4.14/ocaml-4.14-refman-html.tar.gz
2020-05-24 11:42:28 +02:00
and extract it as `/path/to/devdocs/docs/ocaml`:
```sh
curl https://v2.ocaml.org/releases/$VERSION/ocaml-$VERSION-refman-html.tar.gz | \
2021-03-02 09:13:38 +01:00
tar xz --transform 's/htmlman/ocaml/' --directory docs/
2020-05-24 11:42:28 +02:00
```
## OpenJDK
2021-01-04 20:43:00 +01:00
Search 'Openjdk' in https://www.debian.org/distrib/packages, find the `openjdk-$VERSION-doc` package,
download it, extract it with `dpkg -x $PACKAGE ./` and move `./usr/share/doc/openjdk-16-jre-headless/api/`
to `path/to/devdocs/docs/openjdk~$VERSION`
2022-10-08 21:54:25 +02:00
```
curl http://ftp.at.debian.org/debian/pool/main/o/openjdk-19/openjdk-19-doc_19+36-2_all.deb &&
tar xf openjdk-19-doc_19+36-2_all.deb
tar xf data.tar.xz
mv ./usr/share/doc/openjdk-19-jre-headless/api/ path/to/devdocs/docs/openjdk~$VERSION
```
2021-01-04 20:43:00 +01:00
If you use or have access to a Debian-based GNU/Linux distribution you can run the following command:
2020-11-13 23:17:55 +01:00
```sh
2021-01-04 20:43:00 +01:00
apt download openjdk-$VERSION-doc
dpkg -x $PACKAGE ./
# previous command makes a directory called 'usr' in the current directory
mv ./usr/share/doc/openjdk-16-jre-headless/api/ path/to/devdocs/docs/openjdk~$VERSION
2020-11-13 23:17:55 +01:00
```
2021-11-01 10:59:56 +01:00
## Pandas
```sh
2022-01-22 17:09:06 +01:00
curl https://pandas.pydata.org/docs/pandas.zip | bsdtar --extract --file - --directory=docs/pandas~1
2021-11-01 10:59:56 +01:00
```
## PHP
2021-11-29 18:41:10 +01:00
Click the link under the "Many HTML files" column on https://www.php.net/download-docs.php, extract the tarball, change its name to `php` and put it in `docs/`.
2020-12-27 00:48:17 +01:00
Or run the following commands in your terminal:
```sh
curl https://www.php.net/distributions/manual/php_manual_en.tar.gz > php.tar; \
2021-11-29 18:41:10 +01:00
tar -xf php.tar; mv php-chunked-xhtml/ docs/php/
2020-12-27 00:48:17 +01:00
```
2021-04-30 00:02:19 +02:00
## Python 3.6+
```sh
mkdir docs/python~$VERSION
cd docs/python~$VERSION
curl -L https://docs.python.org/$VERSION/archives/python-$RELEASE-docs-html.tar.bz2 | \
tar xj --strip-components=1
```
2021-04-30 00:02:19 +02:00
## Python < 3.6
```sh
mkdir docs/python~$VERSION
cd docs/python~$VERSION
curl -L https://docs.python.org/ftp/python/doc/$RELEASE/python-$RELEASE-docs-html.tar.bz2 | \
tar xj --strip-components=1
```
## R
```bash
2021-06-10 09:55:32 +02:00
DEVDOCSROOT=/path/to/devdocs/docs/r
RLATEST=https://cran.r-project.org/src/base/R-latest.tar.gz # or /R-${VERSION::1}/R-$VERSION.tar.gz
RSOURCEDIR=${TMPDIR:-/tmp}/R/latest
RBUILDDIR=${TMPDIR:-/tmp}/R/build
mkdir -p "$RSOURCEDIR" "$RBUILDDIR" "$DEVDOCSROOT"
# Download, configure, and build with static HTML pages
curl "$RLATEST" | tar -C "$RSOURCEDIR" -xzf - --strip-components=1
(cd "$RBUILDDIR" && "$RSOURCEDIR/configure" --enable-prebuilt-html --with-recommended-packages --disable-byte-compiled-packages --disable-shared --disable-java)
make _R_HELP_LINKS_TO_TOPICS_=FALSE -C "$RBUILDDIR"
# Export all html documentation built global, and per-package
cp -r "$RBUILDDIR/doc" "$DEVDOCSROOT/"
ls -d "$RBUILDDIR"/library/*/html | while read orig; do
dest="$DEVDOCSROOT${orig#$RBUILDDIR}"
mkdir -p "$dest" && cp -r "$orig"/* "$dest/"
done
```
## RDoc
### Nokogiri
### Ruby / Minitest
### Ruby on Rails
2022-01-09 00:32:41 +01:00
* Download a release at https://github.com/rails/rails/releases or clone https://github.com/rails/rails.git (checkout to the branch of the rails' version that is going to be scraped)
* Open "railties/lib/rails/api/task.rb" and comment out any code related to sdoc ("configure_sdoc")
* Run "bundle install --without db && bundle exec rake rdoc" (in the Rails directory)
* Run "cd guides && bundle exec rake guides:generate:html"
* Copy the "guides/output" directory to "html/guides"
* Copy the "html" directory to "docs/rails~[version]"
### Ruby
2020-12-31 18:49:33 +01:00
Download the tarball of Ruby from https://www.ruby-lang.org/en/downloads/, extract it, run
`./configure && make html` in your terminal (while your are in the ruby directory) and move
`.ext/html` to `path/to/devdocs/docs/ruby~$VERSION/`.
Or run the following commands in your terminal:
```sh
curl https://cache.ruby-lang.org/pub/ruby/$VERSION/ruby-$RELEASE.tar.gz > ruby.tar; \
tar -xf ruby.tar; cd ruby-$RELEASE; ./configure && make html; mv .ext/html path/to/devdocs/docs/ruby~$VERSION
```
To generate the htmls file you have to run `make` command but it does not install Ruby in your system, only generates html files so you have not
to worry about cleaning or removing a new Ruby installation.
## Scala
See `lib/docs/scrapers/scala.rb`
## SQLite
Download the docs from https://sqlite.org/download.html, unzip it, and rename
it to `/path/to/devdocs/docs/sqlite`
2021-11-29 18:08:56 +01:00
```sh
2022-08-27 19:44:45 +02:00
curl https://sqlite.org/2022/sqlite-doc-3390200.zip | bsdtar --extract --file - --directory=docs/sqlite/ --strip-components=1
2021-11-29 18:08:56 +01:00
```