**Table of contents:** * [Overview](#overview) * [Instance methods](#instance-methods) * [Core filters](#core-filters) * [Custom filters](#custom-filters) - [CleanHtmlFilter](#cleanhtmlfilter) - [EntriesFilter](#entriesfilter) ## Overview Filters use the [HTML::Pipeline](https://github.com/jch/html-pipeline) library. They take an HTML string or [Nokogiri](http://nokogiri.org/) node as input, optionally perform modifications and/or extract information from it, and then outputs the result. Together they form a pipeline where each filter hands its output to the next filter's input. Every documentation page passes through this pipeline before being copied on the local filesystem. Filters are subclasses of the [`Docs::Filter`](https://github.com/freeCodeCamp/devdocs/blob/main/lib/docs/core/filter.rb) class and require a `call` method. A basic implementation looks like this: ```ruby module Docs class CustomFilter < Filter def call doc end end end ``` Filters which manipulate the Nokogiri node object (`doc` and related methods) are _HTML filters_ and must not manipulate the HTML string (`html`). Vice-versa, filters which manipulate the string representation of the document are _text filters_ and must not manipulate the Nokogiri node object. The two types are divided into two stacks within the scrapers. These stacks are then combined into a pipeline that calls the HTML filters before the text filters (more details [here](./scraper-reference.md#filter-stacks)). This is to avoid parsing the document multiple times. The `call` method must return either `doc` or `html`, depending on the type of filter. ## Instance methods * `doc` [Nokogiri::XML::Node] The Nokogiri representation of the container element. See [Nokogiri's API docs](http://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Node) for the list of available methods. * `html` [String] The string representation of the container element. * `context` [Hash] **(frozen)** The scraper's `options` along with a few additional keys: `:base_url`, `:root_url`, `:root_page` and `:url`. * `result` [Hash] Used to store the page's metadata and pass back information to the scraper. Possible keys: - `:path` — the page's normalized path - `:store_path` — the path where the page will be stored (equal to `:path` with `.html` at the end) - `:internal_urls` — the list of distinct internal URLs found within the page - `:entries` — the [`Entry`](https://github.com/freeCodeCamp/devdocs/blob/main/lib/docs/core/models/entry.rb) objects to add to the index * `css`, `at_css`, `xpath`, `at_xpath` Shortcuts for `doc.css`, `doc.xpath`, etc. * `base_url`, `current_url`, `root_url` [Docs::URL] Shortcuts for `context[:base_url]`, `context[:url]`, and `context[:root_url]` respectively. * `root_path` [String] Shortcut for `context[:root_path]`. * `subpath` [String] The sub-path from the base URL of the current URL. _Example: if `base_url` equals `example.com/docs` and `current_url` equals `example.com/docs/file?raw`, the returned value is `/file`._ * `slug` [String] The `subpath` removed of any leading slash or `.html` extension. _Example: if `subpath` equals `/dir/file.html`, the returned value is `dir/file`._ * `root_page?` [Boolean] Returns `true` if the current page is the root page. * `initial_page?` [Boolean] Returns `true` if the current page is the root page or its subpath is one of the scraper's `initial_paths`. ## Core filters * [`ContainerFilter`](https://github.com/freeCodeCamp/devdocs/blob/main/lib/docs/filters/core/container.rb) — changes the root node of the document (remove everything outside) * [`CleanHtmlFilter`](https://github.com/freeCodeCamp/devdocs/blob/main/lib/docs/filters/core/clean_html.rb) — removes HTML comments, `