HTML Hyperlink Extraction System

From GM-RKB
Revision as of 02:42, 6 January 2023 by Gmelli (talk | contribs) (Text replacement - "__NOTOC__ " to "__NOTOC__ ")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

An HTML Hyperlink Extraction System is an information extraction system that can extract HTML hyperlinks from an HTML item.

  • Example(s):
    • cat index.html | iconv -c -f utf-8 -t ascii | perl -ne 'chomp; print' | perl -ne 's/<font.*?>//gi; s/<span.*?>//gi; s/<td.*?>//gi; s/<img.*?>//gi; print ;' | perl -ne 's/<\/a.*?>/<\/A>/gi; s/(.*?)<a(.*) /<A$2/i; s/^(.*)<\/a(.*)/$1<\/A>/i; s/\/A>(.*?)<A/\/A>\n<A/g; print $_'
  • Counter-Example(s):
  • See: Web Crawler.