ScrapyFSharp


Openning ScrapyFSharp.CssSelectorExtensions module will enable CSS selectors.

1: 
2: 
open FSharp.Data
open ScrapyFSharp.CssSelectorExtensions

Practice 1: Search something on Google

We will parse links of a Google to search for FSharp.Data like in HTML Parser article.

1: 
let doc = HtmlDocument.Load "https://www.google.com/search?q=FSharp.Data"

To be sure we get search results, we will parse links in the div with id search. Then, for example, we could ensure we the HTML's structure is really compliant with the parser using the direct descendants selector.

1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
let links = 
    doc.CssSelect "div#search div#ires cite"
    |> List.map (
        fun n -> 
            match n.InnerText() with
            | t when (t.StartsWith("https://") || t.StartsWith("http://"))-> t
            | t -> "http://" + t
    )

"li.g > div.s" skips the 4 sub results targeting github pages.

["http://fsharp.github.io/FSharp.Data/"; "https://github.com/fsharp/FSharp.Data";
 "https://www.nuget.org/packages/FSharp.Data/";
 "https://msdn.microsoft.com/fr-fr/library/hh362324.aspx";
 "http://fsharp.org/guides/data-access/";
 "http://tomasp.net/blog/fsharp-data.aspx/";
 "http://stackoverflow.com/questions/.../scripts-dont-recognize-fsharp-data";
 "http://fslab.org/"]

Now we could want the pages titles associated with their urls with a List.zip

1: 
2: 
3: 
4: 
let searchResults = 
    doc.CssSelect "div#search div.g > h3"
    |> List.map (fun n -> n.InnerText())
    |> List.zip (links)
[("http://fsharp.github.io/FSharp.Data/", "F# Data: Library for Data Access");
 ("https://github.com/fsharp/FSharp.Data", "fsharp/FSharp.Data · GitHub");
 ("https://www.nuget.org/packages/FSharp.Data/", "NuGet Gallery | F# Data 2.2.5");
 ("https://msdn.microsoft.com/fr-fr/library/hh362324.aspx",
  "Microsoft.FSharp.Data.TypeProviders, espace de noms (F#) - MSDN");
 ("http://fsharp.org/guides/data-access/",
  "Guide - Data Access | The F# Software Foundation");
 ("http://tomasp.net/blog/fsharp-data.aspx/",
  "F# Data: New type provider library - Tomas Petricek");
 ("http://stackoverflow.com/questions/.../scripts-dont-recognize-fsharp-data",
  "f# - scripts don't recognize FSharp.Data - Stack Overflow");
 ("http://fslab.org/", "FsLab - Data science and machine learning with F#")]

Practice 2: Search FSharp books on Youscribe

We will parse links of a Youscribe to search result for F#.

1: 
let doc2 = HtmlDocument.Load "http://en.youscribe.com/o-reilly-media/?quick_search=f%23"

We simply ensure to match good links with their CSS's styles and DOM's hierachy

1: 
2: 
3: 
4: 
let books = 
    doc2.CssSelect "div.document-infos a.doc-explore-title"
    |> List.map(fun a -> a.InnerText().Trim(), a.AttributeValue "href")
    |> List.filter(fun (t,h) -> t.Contains "F#")
[("Building Web, Cloud, and Mobile Solutions with F#",
  "http://en.youscribe.com/catalogue/books/professional-resources/it-systems/building-web-cloud-and-mobile-solutions-with-f-2504344");
 ("Programming F# 3.0",
  "http://en.youscribe.com/catalogue/books/professional-resources/it-systems/programming-f-3-0-2504232")]

JQuery selectors

Attribute Contains Prefix Selector

Finds all links with an english hreflang attribute.

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
let englishLinks = 
    """<!doctype html>
        <html lang="en">
        <body>
        <a href="example.html" hreflang="en">Some text</a>
        <a href="example.html" hreflang="en-UK">Some other text</a>
        <a href="example.html" hreflang="english">will not be outlined</a>
        </body>
        </html>""" 
        |> HtmlDocument.Parse
        |> fun html -> html.CssSelect "a[hreflang|=en]"
[("http://fsharp.github.io/FSharp.Data/", "F# Data: Library for Data Access");
 ("https://github.com/fsharp/FSharp.Data", "fsharp/FSharp.Data · GitHub");
 ("https://www.nuget.org/packages/FSharp.Data/", "NuGet Gallery | F# Data 2.2.5");
 ("https://msdn.microsoft.com/fr-fr/library/hh362324.aspx",
  "Microsoft.FSharp.Data.TypeProviders, espace de noms (F#) - MSDN");
 ("http://fsharp.org/guides/data-access/",
  "Guide - Data Access | The F# Software Foundation");
 ("http://tomasp.net/blog/fsharp-data.aspx/",
  "F# Data: New type provider library - Tomas Petricek");
 ("http://stackoverflow.com/questions/.../scripts-dont-recognize-fsharp-data",
  "f# - scripts don't recognize FSharp.Data - Stack Overflow");
 ("http://fslab.org/", "FsLab - Data science and machine learning with F#")]

Attribute Contains Selector

Finds all inputs with a name containing "man".

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
let case1 = 
    """<!doctype html>
        <html lang="en">
        <body>
        <input name="man-news">
        <input name="milkman">
        <input name="letterman2">
        <input name="newmilk">
        </body>
        </html>""" 
        |> HtmlDocument.Parse
        |> fun html -> html.CssSelect "input[name*='man']"
[<input name="man-news" />; <input name="milkman" />;
 <input name="letterman2" />]

Attribute Contains Word Selector

Finds all inputs with a name containing the word "man".

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
let case2 = 
    """<!doctype html>
        <html lang="en">
        <body>
        <input name="man-news">
        <input name="milkman">
        <input name="milk man">
        <input name="letterman2">
        <input name="newmilk">
        </body>
        </html>""" 
        |> HtmlDocument.Parse
        |> fun html -> html.CssSelect "input[name~='man']"
[<input name="milk man" />]

Attribute Ends With Selector

Finds all inputs with a name ending with "man".

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
let case3 = 
    """<!doctype html>
        <html lang="en">
        <body>
        <input name="newsletter">
        <input name="milkman">
        <input name="jobletter">
        </body>
        </html>""" 
        |> HtmlDocument.Parse
        |> fun html -> html.CssSelect "input[name$='man']"
[<input name="milkman" />]

Attribute Equals Selector

Finds all inputs with a name equal to "man".

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
let case4 = 
    """<!doctype html>
        <html lang="en">
        <body>
        <input name="newsletter">
        <input name="milkman">
        <input name="man">
        <input name="jobletter">
        </body>
        </html>""" 
        |> HtmlDocument.Parse
        |> fun html -> html.CssSelect "input[name='man']"
[<input name="man" />]

Attribute Not Equal Selector

Finds all inputs with a name different to "man".

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
let case5 = 
    """<!doctype html>
        <html lang="en">
        <body>
        <input name="newsletter">
        <input name="milkman">
        <input name="man">
        <input name="jobletter">
        </body>
        </html>""" 
        |> HtmlDocument.Parse
        |> fun html -> html.CssSelect "input[name!='man']"
[<input name="newsletter" />; <input name="milkman" />;
 <input name="jobletter" />]

Attribute Starts With Selector

Finds all inputs with a name starting with "man".

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
let case6 = 
    """<!doctype html>
        <html lang="en">
        <body>
        <input name="newsletter">
        <input name="milkman">
        <input name="manual">
        <input name="jobletter">
        </body>
        </html>""" 
        |> HtmlDocument.Parse
        |> fun html -> html.CssSelect "input[name^='man']"
[<input name="manual" />]

Forms helpers

There are some syntax shorcuts to find forms controls.

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
let htmlForm = 
    """<!doctype html>
        <html>
        <body>
        <form>
          <fieldset>
            <input type="button" value="Input Button">
            <input type="checkbox" id="check1">
            <input type="hidden" id="hidden1">
            <input type="password" id="pass1">
            <input name="email" disabled="disabled">
            <input type="radio" id="radio1">
            <input type="checkbox" id="check2" checked="checked">
            <input type="file" id="uploader1">
            <input type="reset">
            <input type="submit">
            <input type="text">
            <select><option>Option</option></select>
            <textarea class="comment box1">Type a comment here</textarea>
            <button>Go !</button>
          </fieldset>
        </form>
        </body>
        </html>"""
    |> HtmlDocument.Parse

Find all buttons.

1: 
let buttons = htmlForm.CssSelect ":button"
[<button>Go !</button>; <input type="button" value="Input Button" />]

Find all checkboxes.

1: 
let checkboxes = htmlForm.CssSelect ":checkbox"
[<input type="checkbox" id="check1" />;
 <input type="checkbox" id="check2" checked="checked" />]

Find all checked checkboxs or radio.

1: 
let ``checked`` = htmlForm.CssSelect ":checked"
[<input type="checkbox" id="check2" checked="checked" />]

Find all disabled controls.

1: 
let disabled = htmlForm.CssSelect ":disabled"
[<input name="email" disabled="disabled" />]

Find all inputs with type hidden.

1: 
let hidden = htmlForm.CssSelect ":hidden"
[<input type="hidden" id="hidden1" />]

Find all inputs with type radio.

1: 
let radio = htmlForm.CssSelect ":radio"
[<input type="radio" id="radio1" />]

Find all inputs with type password.

1: 
let password = htmlForm.CssSelect ":password"
[<input type="password" id="pass1" />]

Find all files uploaders.

1: 
let file = htmlForm.CssSelect ":file"
[<input type="file" id="uploader1" />]

Implemented and missing features

Basic CSS selectors are implemented, but some JQuery selectors are missing

This table lists all JQuery selectors and their status

Selector name

Status

specification

All Selector

TODO

specification

:animated Selector

not possible

specification

Attribute Contains Prefix Selector

implemented

specification

Attribute Contains Selector

implemented

specification

Attribute Contains Word Selector

implemented

specification

Attribute Ends With Selector

implemented

specification

Attribute Equals Selector

implemented

specification

Attribute Not Equal Selector

implemented

specification

Attribute Starts With Selector

implemented

specification

:button Selector

implemented

specification

:checkbox Selector

implemented

specification

:checked Selector

implemented

specification

Child Selector (“parent > child”)

implemented

specification

Class Selector (“.class”)

implemented

specification

:contains() Selector

TODO

specification

Descendant Selector (“ancestor descendant”)

implemented

specification

:disabled Selector

implemented

specification

Element Selector (“element”)

implemented

specification

:empty Selector

implemented

specification

:enabled Selector

implemented

specification

:eq() Selector

TODO

specification

:even Selector

implemented

specification

:file Selector

implemented

specification

:first-child Selector

TODO

specification

:first-of-type Selector

TODO

specification

:first Selector

TODO

specification

:focus Selector

not possible

specification

:gt() Selector

TODO

specification

Has Attribute Selector [name]

implemented

specification

:has() Selector

TODO

specification

:header Selector

TODO

specification

:hidden Selector

implemented

specification

ID Selector (“#id”)

implemented

specification

:image Selector

implemented

specification

:input Selector

implemented

specification

:lang() Selector

TODO

specification

:last-child Selector

TODO

specification

:last-of-type Selector

TODO

specification

:last Selector

TODO

specification

:lt() Selector

TODO

specification

Multiple Attribute Selector [name=”value”][name2=”value2″]

implemented

specification

Multiple Selector (“selector1, selector2, selectorN”)

TODO

specification

Next Adjacent Selector (“prev + next”)

TODO

specification

Next Siblings Selector (“prev ~ siblings”)

TODO

specification

:not() Selector

TODO

specification

:nth-child() Selector

TODO

specification

:nth-last-child() Selector

TODO

specification

:nth-last-of-type() Selector

TODO

specification

:nth-of-type() Selector

TODO

specification

:odd Selector

implemented

specification

:only-child Selector

TODO

specification

:only-of-type Selector

TODO

specification

:parent Selector

TODO

specification

:password Selector

implemented

specification

:radio Selector

implemented

specification

:reset Selector

not possible

specification

:root Selector

useless[1]

specification

:selected Selector

implemented

specification

:submit Selector

implemented

specification

:target Selector

not possible

specification

:text Selector

implemented

specification

:visible Selector

not possible

specification

[1] :root Selector seems to be useless in our case because with the HTML parser the root is always the html node.

Multiple items
namespace FSharp

--------------------
namespace Microsoft.FSharp
Multiple items
namespace FSharp.Data

--------------------
namespace Microsoft.FSharp.Data
namespace ScrapyFSharp
module CssSelectorExtensions

from ScrapyFSharp
val doc : HtmlDocument

Full name: HtmlCssSelectorsExample.doc
Multiple items
module HtmlDocument

from FSharp.Data

--------------------
type HtmlDocument =
  private | HtmlDocument of docType: string * elements: HtmlNode list
  override ToString : unit -> string
  static member AsyncLoad : uri:string -> Async<HtmlDocument>
  static member Load : uri:string -> HtmlDocument
  static member Load : reader:TextReader -> HtmlDocument
  static member Load : stream:Stream -> HtmlDocument
  static member New : children:seq<HtmlNode> -> HtmlDocument
  static member New : docType:string * children:seq<HtmlNode> -> HtmlDocument
  static member Parse : text:string -> HtmlDocument

Full name: FSharp.Data.HtmlDocument
static member HtmlDocument.Load : uri:string -> HtmlDocument
static member HtmlDocument.Load : reader:System.IO.TextReader -> HtmlDocument
static member HtmlDocument.Load : stream:System.IO.Stream -> HtmlDocument
val links : string list

Full name: HtmlCssSelectorsExample.links
static member CssSelectorExtensions.CssSelect : doc:HtmlDocument * selector:string -> HtmlNode list


 Gets descendants matched by Css selector
Multiple items
module List

from Microsoft.FSharp.Collections

--------------------
type List<'T> =
  | ( [] )
  | ( :: ) of Head: 'T * Tail: 'T list
  interface IEnumerable
  interface IEnumerable<'T>
  member GetSlice : startIndex:int option * endIndex:int option -> 'T list
  member Head : 'T
  member IsEmpty : bool
  member Item : index:int -> 'T with get
  member Length : int
  member Tail : 'T list
  static member Cons : head:'T * tail:'T list -> 'T list
  static member Empty : 'T list

Full name: Microsoft.FSharp.Collections.List<_>
val map : mapping:('T -> 'U) -> list:'T list -> 'U list

Full name: Microsoft.FSharp.Collections.List.map
val n : HtmlNode
static member HtmlNodeExtensions.InnerText : n:HtmlNode -> string
val t : string
System.String.StartsWith(value: string) : bool
System.String.StartsWith(value: string, comparisonType: System.StringComparison) : bool
System.String.StartsWith(value: string, ignoreCase: bool, culture: System.Globalization.CultureInfo) : bool
val searchResults : (string * string) list

Full name: HtmlCssSelectorsExample.searchResults
val zip : list1:'T1 list -> list2:'T2 list -> ('T1 * 'T2) list

Full name: Microsoft.FSharp.Collections.List.zip
val doc2 : HtmlDocument

Full name: HtmlCssSelectorsExample.doc2
val books : (string * string) list

Full name: HtmlCssSelectorsExample.books
val a : HtmlNode
static member HtmlNodeExtensions.AttributeValue : n:HtmlNode * name:string -> string
val filter : predicate:('T -> bool) -> list:'T list -> 'T list

Full name: Microsoft.FSharp.Collections.List.filter
val h : string
System.String.Contains(value: string) : bool
val englishLinks : HtmlNode list

Full name: HtmlCssSelectorsExample.englishLinks
static member HtmlDocument.Parse : text:string -> HtmlDocument
val html : HtmlDocument
val case1 : HtmlNode list

Full name: HtmlCssSelectorsExample.case1
val case2 : HtmlNode list

Full name: HtmlCssSelectorsExample.case2
val case3 : HtmlNode list

Full name: HtmlCssSelectorsExample.case3
val case4 : HtmlNode list

Full name: HtmlCssSelectorsExample.case4
val case5 : HtmlNode list

Full name: HtmlCssSelectorsExample.case5
val case6 : HtmlNode list

Full name: HtmlCssSelectorsExample.case6
val htmlForm : HtmlDocument

Full name: HtmlCssSelectorsExample.htmlForm
val buttons : HtmlNode list

Full name: HtmlCssSelectorsExample.buttons
val checkboxes : HtmlNode list

Full name: HtmlCssSelectorsExample.checkboxes
val disabled : HtmlNode list

Full name: HtmlCssSelectorsExample.disabled
val hidden : HtmlNode list

Full name: HtmlCssSelectorsExample.hidden
val radio : HtmlNode list

Full name: HtmlCssSelectorsExample.radio
val password : HtmlNode list

Full name: HtmlCssSelectorsExample.password
val file : HtmlNode list

Full name: HtmlCssSelectorsExample.file
Fork me on GitHub