Documentation
¶
Index ¶
- type Geziyor
- func (g *Geziyor) Do(req *client.Request, callback func(g *Geziyor, r *client.Response))
- func (g *Geziyor) Get(url string, callback func(g *Geziyor, r *client.Response))
- func (g *Geziyor) GetRendered(url string, callback func(g *Geziyor, r *client.Response))
- func (g *Geziyor) Head(url string, callback func(g *Geziyor, r *client.Response))
- func (g *Geziyor) Post(url string, body io.Reader, callback func(g *Geziyor, r *client.Response))
- func (g *Geziyor) Start()
- type Options
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Geziyor ¶
type Geziyor struct {
Opt *Options
Client *client.Client
Exports chan interface{}
// contains filtered or unexported fields
}
Geziyor is our main scraper type
func NewGeziyor ¶
NewGeziyor creates new Geziyor with default values. If options provided, options
func (*Geziyor) GetRendered ¶
GetRendered issues GET request using headless browser Opens up a new Chrome instance, makes request, waits for rendering HTML DOM and closed. Rendered requests only supported for GET requests.
type Options ¶
type Options struct {
// AllowedDomains is domains that are allowed to make requests
// If empty, any domain is allowed
AllowedDomains []string
// Chrome headless browser WS endpoint.
// If you want to run your own Chrome browser runner, provide its endpoint in here
// For example: ws://localhost:3000
BrowserEndpoint string
// Cache storage backends.
// - Memory
// - Disk
// - LevelDB
Cache cache.Cache
// Policies for caching.
// - Dummy policy (default)
// - RFC2616 policy
CachePolicy cache.Policy
// Response charset detection for decoding to UTF-8
CharsetDetectDisabled bool
// Concurrent requests limit
ConcurrentRequests int
// Concurrent requests per domain limit. Uses request.URL.Host
// Subdomains are different than top domain
ConcurrentRequestsPerDomain int
// If set true, cookies won't send.
CookiesDisabled bool
// ErrorFunc is callback of errors.
// If not defined, all errors will be logged.
ErrorFunc func(g *Geziyor, r *client.Request, err error)
// For extracting data
Exporters []export.Exporter
// Disable logging by setting this true
LogDisabled bool
// Max body reading size in bytes. Default: 1GB
MaxBodySize int64
// Maximum redirection time. Default: 10
MaxRedirect int
// Scraper metrics exporting type. See metrics.Type
MetricsType metrics.Type
// ParseFunc is callback of StartURLs response.
ParseFunc func(g *Geziyor, r *client.Response)
// If true, HTML parsing is disabled to improve performance.
ParseHTMLDisabled bool
// ProxyFunc setting proxy for each request
ProxyFunc func(*http.Request) (*url.URL, error)
// Rendered requests pre actions. Setting this will override the existing default.
// And you'll need to handle all rendered actions, like navigation, waiting, response etc.
// If you need to make custom actions in addition to the defaults, use Request.Actions instead of this.
PreActions []chromedp.Action
// Pass additional custom chromedp ExecAllocatorOptions.
// These are applied in addition to the DefaultExecAllocatorOptions.
AllocatorOptions []chromedp.ExecAllocatorOption
// Request delays
RequestDelay time.Duration
// RequestDelayRandomize uses random interval between 0.5 * RequestDelay and 1.5 * RequestDelay
RequestDelayRandomize bool
// Called before requests made to manipulate requests
RequestMiddlewares []middleware.RequestProcessor
// Called after response received
ResponseMiddlewares []middleware.ResponseProcessor
// RequestsPerSecond limits requests that is made per seconds. Default: No limit
RequestsPerSecond float64
// Which HTTP response codes to retry.
// Other errors (DNS lookup issues, connections lost, etc) are always retried.
// Default: []int{500, 502, 503, 504, 522, 524, 408}
RetryHTTPCodes []int
// Maximum number of times to retry, in addition to the first download.
// Set -1 to disable retrying
// Default: 2
RetryTimes int
// If true, disable robots.txt checks
RobotsTxtDisabled bool
// StartRequestsFunc called on scraper start
StartRequestsFunc func(g *Geziyor)
// First requests will made to this url array. (Concurrently)
StartURLs []string
// Timeout is global request timeout
Timeout time.Duration
// Revisiting same URLs is disabled by default
URLRevisitEnabled bool
// User Agent.
// Default: "Geziyor 1.0"
UserAgent string
}
Options is custom options type for Geziyor
Directories
¶
| Path | Synopsis |
|---|---|
|
Package cache provides a http.RoundTripper implementation that works as a mostly RFC-compliant cache for http responses.
|
Package cache provides a http.RoundTripper implementation that works as a mostly RFC-compliant cache for http responses. |
|
diskcache
Package diskcache provides an implementation of cache.Cache that uses the diskv package to supplement an in-memory map with persistent storage
|
Package diskcache provides an implementation of cache.Cache that uses the diskv package to supplement an in-memory map with persistent storage |
|
leveldbcache
Package leveldbcache provides an implementation of cache.Cache that uses github.com/syndtr/goleveldb/leveldb
|
Package leveldbcache provides an implementation of cache.Cache that uses github.com/syndtr/goleveldb/leveldb |
Click to show internal directories.
Click to hide internal directories.