MediaWiki XML Snapshot File Parser
Jump to navigation
Jump to search
A MediaWiki XML Snapshot File Parser is an XML parser for a MediaWiki XML snapshot file.
- Context:
- It can be a Text-focused MediaWiki XML Snapshot File Parser, such as
gensim.corpora.WikiCorpus
[1]. - It can be a Raw Content-focused MediaWiki XML Snapshot File Parser, such as (Heaton, 2017).
- It can be a Text-focused MediaWiki XML Snapshot File Parser, such as
- Example(s):
GMRKB.ReadMWDump
;gensim.corpora.WikiCorpus
;- one based on:
xml.etree.ElementTree
, such as [2]; - …
- Counter-Example(s):
- See: MediaWiki Markup Parser.
References
2017
- (Heaton, 2017) ⇒ Jeff Heaton. (2017). “Reading Wikipedia XML Dumps with Python." Blog post
- QUOTE: … The code below shows you the beginning of this file. As you can see the file is made up of page tags that contain revision tags. … To read this file it is important that the XML is streamed and not read directly into memory as a DOM parser might do. The xml.etree.ElementTree class can be used to do this. The following imports are needed for this example. For the complete source code see the following GitHub link. ...