Home >Operation and Maintenance >Safety >Use lexical analysis to extract domain names and IPs
Background
When analyzing the logs, I found that some log parameters contained other URLs, for example:
https://blog.csdn.net/breaksoftware/article/details/7009209. If you are interested, you can take a look. Facts have proved that following the master really improves your posture.
The original text is in C version, here I wrote a similar one in Python for your reference. Common URL classificationwww.baidu.com.
while (i < len(z) and z[i].isdigit()): i = i + 1 ip_v1 = True reti = i if i < len(z) and z[i] == '.': i = i + 1 reti = i else: tokenType = TK_OTHER reti = 1while (i < len(z) and z[i].isdigit()): i = i + 1 ip_v2 = True if i < len(z) and z[i] == '.': i = i + 1 else: if tokenType != TK_DOMAIN: tokenType = TK_OTHER reti = 1while (i < len(z) and z[i].isdigit()): i = i + 1 ip_v3 = True if i < len(z) and z[i] == '.': i = i + 1 else: if tokenType != TK_DOMAIN: tokenType = TK_OTHER reti = 1while (i < len(z) and z[i].isdigit()): i = i + 1 ip_v4 = True if i < len(z) and z[i] == ':': i = i + 1 while (i < len(z) and z[i].isdigit()): i = i + 1 if ip_v1 and ip_v2 and ip_v3 and ip_v4: self.urls.append(z[0:i]) return reti, tokenType else: if tokenType != TK_DOMAIN: tokenType = TK_OTHER reti = 1Mixed form extraction: such as 1234.com.
Scan the first half of 1234, which conforms to the characteristics of the IP form, but it is found that the code will report an exception, so the IP processing code segment needs to be added to determine whether the suffix is a top-level domain name:
https://github.com/skskevin/UrlDetect/blob/master/tool/domainExtract/domainExtract.py
The above is the detailed content of Use lexical analysis to extract domain names and IPs. For more information, please follow other related articles on the PHP Chinese website!