Documentation
¶
Overview ¶
Package wikiparse is library to understand the wikipedia xml dump format.
The dumps are available from the wikimedia group here:
http://dumps.wikimedia.org/
In particular, I've worked mostly with the enwiki dumps from here:
http://dumps.wikimedia.org/enwiki/
See the example programs in subpackages for an idea of how I've made use of these things.
Index ¶
- Variables
- func FindFiles(text string) []string
- func FindLinks(text string) []string
- func URLForFile(name string) string
- type Contributor
- type Coord
- type IndexEntry
- type IndexReader
- type IndexSummaryReader
- type IndexedParseSource
- type Page
- type Parser
- type ReadSeekCloser
- type Redirect
- type Revision
- type SiteInfo
Constants ¶
This section is empty.
Variables ¶
var ErrNoCoordFound = errors.New("no coord data found")
ErrNoCoordFound is returned from ParseCoords when there's no coordinate date found.
Functions ¶
func FindFiles ¶
FindFiles finds all the File references from within an article body.
This includes things in comments, as many I found were commented out.
func URLForFile ¶
URLForFile gets the wikimedia URL for the given named file.
Types ¶
type Contributor ¶
A Contributor is a user who contributed a revision.
type Coord ¶
type Coord struct {
Lon, Lat float64
}
Coord is Longitude/latitude pair from a coordinate match.
func ParseCoords ¶
ParseCoords parses geographical coordinates as specified in http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Geographical_coordinates
type IndexEntry ¶
An IndexEntry is an individual article from the index.
func (IndexEntry) String ¶
func (i IndexEntry) String() string
type IndexReader ¶
type IndexReader struct {
// contains filtered or unexported fields
}
An IndexReader is a wikipedia multistream index reader.
func NewIndexReader ¶
func NewIndexReader(r io.Reader) *IndexReader
NewIndexReader gets a wikipedia index reader.
func (*IndexReader) Next ¶
func (ir *IndexReader) Next() (IndexEntry, error)
Next gets the next entry from the index stream.
This assumes the numbers were meant to be incremental.
type IndexSummaryReader ¶
type IndexSummaryReader struct {
// contains filtered or unexported fields
}
IndexSummaryReader gets offsets and counts from an index.
If you don't want to know the individual articles, just how many and where, this is for you.
func NewIndexSummaryReader ¶
func NewIndexSummaryReader(r io.Reader) (rv *IndexSummaryReader, err error)
NewIndexSummaryReader gets a new IndexSummaryReader from the given stream of index lines.
type IndexedParseSource ¶
type IndexedParseSource interface {
OpenIndex() (io.ReadCloser, error)
OpenData() (ReadSeekCloser, error)
}
An IndexedParseSource provides access to a multistream xml dump and its index.
This is typically downloaded as two files, but a seekable interface such as HTTP with range requests can also serve.
type Page ¶
type Page struct {
Title string `xml:"title"`
ID uint64 `xml:"id"`
Redir Redirect `xml:"redirect"`
Revisions []Revision `xml:"revision"`
Ns uint64 `xml:"ns"`
}
A Page in the wiki.
type Parser ¶
type Parser interface {
// Get the next page from the parser
Next() (*Page, error)
// Get the toplevel site info from the stream
SiteInfo() SiteInfo
}
A Parser emits wiki pages.
func NewIndexedParser ¶
NewIndexedParser gets an indexed/parallel wikipedia dump parser from the given index and data files.
func NewIndexedParserFromSrc ¶
func NewIndexedParserFromSrc(src IndexedParseSource, numWorkers int) (Parser, error)
NewIndexedParserFromSrc creates a Parser that can parse multiple pages concurrently from a single source.
type ReadSeekCloser ¶
type ReadSeekCloser interface {
io.ReadSeeker
io.Closer
}
ReadSeekCloser is io.ReadSeeker + io.Closer.
type Redirect ¶
type Redirect struct {
Title string `xml:"title,attr"`
}
A Redirect to another Page.
type Revision ¶
type Revision struct {
ID uint64 `xml:"id"`
Timestamp string `xml:"timestamp"`
Contributor Contributor `xml:"contributor"`
Comment string `xml:"comment"`
Text string `xml:"text"`
}
A Revision to a page.
type SiteInfo ¶
type SiteInfo struct {
SiteName string `xml:"sitename"`
Base string `xml:"base"`
Generator string `xml:"generator"`
Case string `xml:"case"`
Namespaces []struct {
Key string `xml:"key,attr"`
Case string `xml:"case,attr"`
Value string `xml:",chardata"`
} `xml:"namespaces>namespace"`
}
SiteInfo is the toplevel site info describing basic dump properties.
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
tools
|
|
|
cbload
command
Load a wikipedia dump into CouchBase
|
Load a wikipedia dump into CouchBase |
|
couchload
command
Load a wikipedia dump into CouchDB
|
Load a wikipedia dump into CouchDB |
|
esload
command
Load a wikipedia dump into ElasticSearch
|
Load a wikipedia dump into ElasticSearch |
|
mgoload
command
|
|
|
traverse
command
Sample program that finds all the geo data in wikipedia pages.
|
Sample program that finds all the geo data in wikipedia pages. |