HTML Hyperlink Extraction System

From GM-RKB

Revision as of 02:42, 6 January 2023 by Gmelli (talk | contribs) (Text replacement - "__NOTOC__ " to "__NOTOC__ ")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to navigation Jump to search

An HTML Hyperlink Extraction System is an information extraction system that can extract HTML hyperlinks from an HTML item.

- …
Example(s):
- cat index.html | iconv -c -f utf-8 -t ascii | perl -ne 'chomp; print' | perl -ne 's/<font.*?>//gi; s/<span.*?>//gi; s/<td.*?>//gi; s/<img.*?>//gi; print ;' | perl -ne 's/<\/a.*?>/<\/A>/gi; s/(.*?)<a(.*) /<A$2/i; s/^(.*)<\/a(.*)/$1<\/A>/i; s/\/A>(.*?)<A/\/A>\n<A/g; print $_'
Counter-Example(s):
- an HTML Document De-HTMLing System.
See: Web Crawler.

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=HTML_Hyperlink_Extraction_System&oldid=780946"

Concept