Documentation
¶
Index ¶
- func OpenGraphResolver(article *Article) string
- func ReadLinesOfFile(filename string) []string
- func RegSplit(text string, reg *regexp.Regexp) []string
- func WebPageResolver(article *Article) string
- type Article
- type Cleaner
- type Configuration
- type ContentExtractor
- type Crawler
- type Goose
- type Helper
- type Parser
- type StopWords
- type VideoExtractor
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func OpenGraphResolver ¶
OpenGraphResolver return OpenGraph properties
func ReadLinesOfFile ¶
ReadLinesOfFile returns the lines from a file as a slice of strings
func WebPageResolver ¶
WebPageResolver fetches the main image from the HTML page
Types ¶
type Article ¶
type Article struct {
Title string `json:"title,omitempty"`
CleanedText string `json:"content,omitempty"`
MetaDescription string `json:"description,omitempty"`
MetaLang string `json:"lang,omitempty"`
MetaFavicon string `json:"favicon,omitempty"`
MetaKeywords string `json:"keywords,omitempty"`
CanonicalLink string `json:"canonicalurl,omitempty"`
Domain string `json:"domain,omitempty"`
TopNode *goquery.Selection `json:"-"`
TopImage string `json:"image,omitempty"`
Tags *set.Set `json:"tags,omitempty"`
Movies *set.Set `json:"movies,omitempty"`
FinalURL string `json:"url,omitempty"`
LinkHash string `json:"linkhash,omitempty"`
RawHTML string `json:"rawhtml,omitempty"`
Doc *goquery.Document `json:"-"`
Links []string `json:"links,omitempty"`
PublishDate string `json:"publishdate,omitempty"`
AdditionalData map[string]string `json:"additionaldata,omitempty"`
Delta int64 `json:"delta,omitempty"`
}
Article is a collection of properties extracted from the HTML body
type Cleaner ¶
type Cleaner struct {
// contains filtered or unexported fields
}
Cleaner removes menus, ads, sidebars, etc. and leaves the main content
func NewCleaner ¶
func NewCleaner(config Configuration) Cleaner
NewCleaner returns a new instance of a Cleaner
type Configuration ¶
type Configuration struct {
// contains filtered or unexported fields
}
Configuration is a wrapper for various config options
func GetDefaultConfiguration ¶
func GetDefaultConfiguration(args ...string) Configuration
GetDefaultConfiguration returns safe default configuration options
type ContentExtractor ¶
type ContentExtractor struct {
// contains filtered or unexported fields
}
ContentExtractor can parse the HTML and fetch various properties
func NewExtractor ¶
func NewExtractor(config Configuration) ContentExtractor
NewExtractor returns a configured HTML parser
type Crawler ¶
type Crawler struct {
RawHTML string
// contains filtered or unexported fields
}
Crawler can fetch the target HTML page
func NewCrawler ¶
func NewCrawler(config Configuration, url string, RawHTML string) Crawler
NewCrawler returns a crawler object initialised with the URL and the [optional] raw HTML body
type Goose ¶
type Goose struct {
// contains filtered or unexported fields
}
Goose is the main entry point of the program
func (Goose) ExtractFromRawHTML ¶
ExtractFromRawHTML returns an article object from the raw HTML content
type Helper ¶
type Helper struct {
// contains filtered or unexported fields
}
Helper is a utility struct to clean up URLs and charsets
func NewRawHelper ¶
NewRawHelper converts the text to UTF8
type Parser ¶
type Parser struct{}
Parser is an HTML parser specialised in extraction of main content and other properties
type StopWords ¶
type StopWords struct {
// contains filtered or unexported fields
}
StopWords implements a simple language detector
func NewStopwords ¶
func NewStopwords() StopWords
NewStopwords returns an instance of a stop words detector
func (StopWords) SimpleLanguageDetector ¶
SimpleLanguageDetector returns the language code for the text, based on its stop words
type VideoExtractor ¶
type VideoExtractor struct {
// contains filtered or unexported fields
}
VideoExtractor can extract the main video from an HTML page
func NewVideoExtractor ¶
func NewVideoExtractor() VideoExtractor
NewVideoExtractor returns a new instance of a HTML video extractor
