Home > Article > Backend Development > Go language uses regular expressions to extract web page text
Example: Find the two numbers before and after in the string 1000abcd123.
Example 1: Example of matching this string
package main import( "fmt" "regexp" ) var digitsRegexp = regexp.MustCompile(`(\d+)\D+(\d+)`) func main(){ someString:="1000abcd123" fmt.Println(digitsRegexp.FindStringSubmatch(someString)) }
The above code output:
[1000abcd123 1000 123]
Example 2: Using named Regular expression
package main import( "fmt" "regexp" ) var myExp=regexp.MustCompile(`(?P<first>\d+)\.(\d+).(?P<second>\d+)`) func main(){ fmt.Printf("%+v",myExp.FindStringSubmatch("1234.5678.9")) }
The above code output, all matching ones are output:
[1234.5678.9 1234 5678 9]
The Named capturing groups (?P8a11bc632ea32a57b3e3693c7987c420) method named regular expression here is Unique to python and Go languages, java and c# are the (?8a11bc632ea32a57b3e3693c7987c420) naming method.
Example 3: Extend the regular expression class with a method to obtain all naming information and use it.
package main import( "fmt" "regexp" ) //embed regexp.Regexp in a new type so we can extend it type myRegexp struct{ *regexp.Regexp } //add a new method to our new regular expression type func(r *myRegexp)FindStringSubmatchMap(s string) map[string]string{ captures:=make(map[string]string) match:=r.FindStringSubmatch(s) if match==nil{ return captures } for i,name:=range r.SubexpNames(){ //Ignore the whole regexp match and unnamed groups if i==0||name==""{ continue } captures[name]=match[i] } return captures } //an example regular expression var myExp=myRegexp{regexp.MustCompile(`(?P<first>\d+)\.(\d+).(?P<second>\d+)`)} func main(){ mmap:=myExp.FindStringSubmatchMap("1234.5678.9") ww:=mmap["first"] fmt.Println(mmap) fmt.Println(ww) }
The output of the above code:
map[first:1234 second:9] 1234
Example 4, capture the number restriction information and record it in a Map.
package main import( "fmt" iconv "github.com/djimenez/iconv-go" "io/ioutil" "net/http" "os" "regexp" ) // embed regexp.Regexp in a new type so we can extend it type myRegexp struct{ *regexp.Regexp } // add a new method to our new regular expression type func(r *myRegexp)FindStringSubmatchMap(s string)[](map[string]string){ captures:=make([](map[string]string),0) matches:=r.FindAllStringSubmatch(s,-1) if matches==nil{ return captures } names:=r.SubexpNames() for _,match:=range matches{ cmap:=make(map[string]string) for pos,val:=range match{ name:=names[pos] if name==""{ continue } /* fmt.Println("+++++++++") fmt.Println(name) fmt.Println(val) */ cmap[name]=val } captures=append(captures,cmap) } return captures } // 抓取限号信息的正则表达式 var myExp=myRegexp{regexp.MustCompile(`自(?P<byear>[\d]{4})年(?P<bmonth>[\d]{1,2})月(?P<bday>[\d]{1,2})日至(?P<eyear>[\d]{4})年(?P<emonth>[\d]{1,2})月(?P<eday>[\d]{1,2})日,星期一至星期五限行机动车车牌尾号分别为:(?P<n11>[\d])和(?P<n12>[\d])、(?P<n21>[\d])和(?P<n22>[\d])、(?P<n31>[\d])和(?P<n32>[\d])、(?P<n41>[\d])和(?P<n42>[\d])、(?P<n51>[\d])和(?P<n52>[\d])`)} func ErrorAndExit(err error){ fmt.Fprintln(os.Stderr,err) os.Exit(1) } func main(){ response,err:=http.Get("http://www.bjjtgl.gov.cn/zhuanti/10weihao/index.html") defer response.Body.Close() if err!=nil{ ErrorAndExit(err) } input,err:=ioutil.ReadAll(response.Body) if err!=nil{ ErrorAndExit(err) } body :=make([]byte,len(input)) iconv.Convert(input,body,"gb2312","utf-8") mmap:=myExp.FindStringSubmatchMap(string(body)) fmt.Println(mmap) }
The above code output:
[map[n32:0 n22:9 emonth:7 n11:3 n41:1 n21:4 n52:7 bmonth:4 n51:2 bday:9 n42:6 byear:2012 eday:7 eyear:2012 n12:8 n31:5] map[emonth:10 n41:5 n52:6 n31:4 byear:2012 n51:1 eyear:2012 n32:9 bmonth:7 n22:8 bday:8 n11:2 eday:6 n42:0 n21:3 n12:7] map[bday:7 n51:5 n22:7 n31:3 eday:5 n32:8 byear:2012 bmonth:10 emonth:1 eyear:2013 n11:1 n12:6 n52:0 n21:2 n42:9 n41:4] map[eyear:2013 byear:2013 n22:6 eday:10 bmonth:1 n41:3 n32:7 n31:2 n21:1 n11:5 bday:6 n12:0 n51:4 n42:8 emonth:4 n52:9]]
For more go language knowledge, please pay attention to the go language tutorial column on the PHP Chinese website.
The above is the detailed content of Go language uses regular expressions to extract web page text. For more information, please follow other related articles on the PHP Chinese website!