Wednesday, November 24, 2010

Parse Hyperlinks - Python

 While doing my previous post (downloading content from codingbat) i got this script for parsing the url links


urls = re.findall(r'href=[\'"]p?([^\'" >]+)', line)

r - is provided to denote the string is a rawstring(we dont need to specify escape charcters)
href=[\'"] - the string must start with "HREF=" and can either have any of the characters (' - single quote, " - double quote) next to it.
p? - the next character must be a p
([^\'" >]+) - it must end with a greater than symbol which must be preceeded either by single or double quote.

No comments:

Post a Comment