Jump to content

[SOLVED] Improving PCRE knowledge, parsing HOSTS file for hostnames.


Recommended Posts

This is now considered solved. Thanks to jchd & UEZ as well as Spiff59 who participated in providing additional examples.

UDF List:

 
_AdapterConnections()_AlwaysRun()_AppMon()_AppMonEx()_ArrayFilter/_ArrayReduce_BinaryBin()_CheckMsgBox()_CmdLineRaw()_ContextMenu()_ConvertLHWebColor()/_ConvertSHWebColor()_DesktopDimensions()_DisplayPassword()_DotNet_Load()/_DotNet_Unload()_Fibonacci()_FileCompare()_FileCompareContents()_FileNameByHandle()_FilePrefix/SRE()_FindInFile()_GetBackgroundColor()/_SetBackgroundColor()_GetConrolID()_GetCtrlClass()_GetDirectoryFormat()_GetDriveMediaType()_GetFilename()/_GetFilenameExt()_GetHardwareID()_GetIP()_GetIP_Country()_GetOSLanguage()_GetSavedSource()_GetStringSize()_GetSystemPaths()_GetURLImage()_GIFImage()_GoogleWeather()_GUICtrlCreateGroup()_GUICtrlListBox_CreateArray()_GUICtrlListView_CreateArray()_GUICtrlListView_SaveCSV()_GUICtrlListView_SaveHTML()_GUICtrlListView_SaveTxt()_GUICtrlListView_SaveXML()_GUICtrlMenu_Recent()_GUICtrlMenu_SetItemImage()_GUICtrlTreeView_CreateArray()_GUIDisable()_GUIImageList_SetIconFromHandle()_GUIRegisterMsg()_GUISetIcon()_Icon_Clear()/_Icon_Set()_IdleTime()_InetGet()_InetGetGUI()_InetGetProgress()_IPDetails()_IsFileOlder()_IsGUID()_IsHex()_IsPalindrome()_IsRegKey()_IsStringRegExp()_IsSystemDrive()_IsUPX()_IsValidType()_IsWebColor()_Language()_Log()_MicrosoftInternetConnectivity()_MSDNDataType()_PathFull/GetRelative/Split()_PathSplitEx()_PrintFromArray()_ProgressSetMarquee()_ReDim()_RockPaperScissors()/_RockPaperScissorsLizardSpock()_ScrollingCredits_SelfDelete()_SelfRename()_SelfUpdate()_SendTo()_ShellAll()_ShellFile()_ShellFolder()_SingletonHWID()_SingletonPID()_Startup()_StringCompact()_StringIsValid()_StringRegExpMetaCharacters()_StringReplaceWholeWord()_StringStripChars()_Temperature()_TrialPeriod()_UKToUSDate()/_USToUKDate()_WinAPI_Create_CTL_CODE()_WinAPI_CreateGUID()_WMIDateStringToDate()/_DateToWMIDateString()Au3 script parsingAutoIt SearchAutoIt3 PortableAutoIt3WrapperToPragmaAutoItWinGetTitle()/AutoItWinSetTitle()CodingDirToHTML5FileInstallrFileReadLastChars()GeoIP databaseGUI - Only Close ButtonGUI ExamplesGUICtrlDeleteImage()GUICtrlGetBkColor()GUICtrlGetStyle()GUIEventsGUIGetBkColor()Int_Parse() & Int_TryParse()IsISBN()LockFile()Mapping CtrlIDsOOP in AutoItParseHeadersToSciTE()PasswordValidPasteBinPosts Per DayPreExpandProtect GlobalsQueue()Resource UpdateResourcesExSciTE JumpSettings INISHELLHOOKShunting-YardSignature CreatorStack()Stopwatch()StringAddLF()/StringStripLF()StringEOLToCRLF()VSCROLLWM_COPYDATAMore Examples...

Updated: 22/04/2018

Link to comment
Share on other sites

  • Replies 47
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Hey,

It might be solved but since the post is to improve your knowledge of regular expressions I thought I would share anyway.

Here is my take on it, have a read, I explained most of it in the comments though I admit it could have been clearer.

#include <Array.au3>


#cs
    Tested hosts files:
    1. http://winhelp2002.mvps.org/hosts.txt
    2. http://support.it-mate.co.uk/downloads/HOSTS.txt
    3. http://remember.mine.nu/Hosts
    4. http://www.autoitscript.com/forum/topic/140724-solved-improving-pcre-knowledge-parsing-hosts-file-for-hostnames/page__view__findpost__p__989981
    5. local hosts file

    Timers samples:        1 (14446)          2 (183763)         3 (70595)          4 (15)
    Example:              75.9701664916754 | 881.994120293985 | 340.773511324434 | 0.254597270136493
    Example_guinness:     159.703579821009 | 1934.34539773011 | 733.858217089146 | 0.265321733913304
    Example_jchd:                   failed |           failed |           failed | 0.241597920103995
#ce


Local $sData = FileRead(@SystemDir & 'driversetcHOSTS') ; 'HOSTS.txt'

$timer = TimerInit()
$aArray = Example($sData)
ConsoleWrite(StringFormat("%-20s %-20s t%dn", 'Example', TimerDiff($timer), UBound($aArray, 1)))
;~ _ArrayDisplay($aArray, "Example")

$timer = TimerInit()
$aArray = Example_guinness($sData)
ConsoleWrite(StringFormat("%-20s %-20s t%dn", 'Example_guinness', TimerDiff($timer), UBound($aArray, 1)))
;~ _ArrayDisplay($aArray, "Example_guinness")

$timer = TimerInit()
$aArray = Example_jchd($sData) ; please note that this one is not working for some hosts files
ConsoleWrite(StringFormat("%-20s %-20s t%dn", 'Example_jchd', TimerDiff($timer), UBound($aArray, 1))) ; last to avoid breaking the script if it fails
;~ _ArrayDisplay($aArray, "Example_jchd")


#cs
    - Combined replace patterns and added stripping of IPv6 entries (::1 ...).
    - Removed unnecessary group and unnecessary escape backslashes.
    These meta characters do not retain their special meaning inside a character set (some get a different meaning),
    except the - (dash) it expresses a character range, so if we list that last there is no need to escape.
#ce
Func Example_guinness($sData)
    $sData = StringRegExpReplace($sData, '#.*|(?m)^h*(::d.*|[d.]{7,15})', '')
    Return StringRegExp($sData, '[w/.-]{3,}', 3)
EndFunc   ;==>Example_guinness

#cs
    jchd had the right idea, since we can not repeatedly capture a group, even with global matching,
    we have to make the pattern match repeatedly, which is done by skipping irrelevant parts before each match.
    This is where the G sequence comes into play, it anchors the match at the beginning of the string,
    just like the ^, at the first match only but from the second match on it anchors the end of the previous match.
    If we were to use ^ instead of G it would only match the first entry.
    We also notice this pattern is too restrictive when parsing some hosts files, it does not match any hostnames
    starting with digits for example.
#ce
Func Example_jchd($sData)
;~   Return StringRegExp($sData, "(?im)G(?:(?:s*#.*$s*)*|(?:s*)*)*(?:|(?:d{1,3}.){3}d{1,3}s+)((?:(?:d{1,3}.){3}d{1,3}.)?[[:alpha:]][w.-/]{2,})s*", 3) ; jchd's pattern (original)
    Return StringRegExp($sData, "(?im)G(?:s*(?:#.*$s*)*)*(?:(?:d{1,3}.){3}d{1,3}s+)?((?:(?:d{1,3}.){4})?[a-z][w/.-]{2,})s*", 3) ; slightly modified version of jchd's pattern
EndFunc   ;==>Example_jchd

#cs
    We can write this (jchd's) pattern a bit cleaner and more effective if we drop extraneous parts and simplify other parts.
    We start with the G anchor, then optionally match, but not capture, pound or colon preceded parts (comments/IPv6), followed
    by 7-15 digits and dots (an IPv4, most likely) and then we capture the next word with optional slashes, dots and dashes.
    Looks pretty simple now right?
#ce
Func Example($sData)
    Return StringRegExp($sData, "(?im)G(?:h*[#:].*$s*)*(?:[d.]{7,15}h+)?([w/.-]{2,})s*", 3) ; simplified pattern
EndFunc   ;==>Example
Edited by Robjong
Link to comment
Share on other sites

You couldn't know but there has been a bit of talks in MPs about this.

I agree that the RE could be simpler and while roaming I made a couple of dumb mistakes (like not allowing hosts with leading digits ;) ). I also explicitely excluded IPv6 forms (that can be easily fixed once the rest works). I kept guinness subpatterns as is and added extras accordingly, since they perform a useful (?) format validation (albeit elementary) so I didn't look at simplifying more this part.

The actual format guinness uses looks partly like this:

127.0.0.1 08.185.87.4.liveadvert.com 08.185.87.40.liveadvert.com 08.185.87.41.liveadvert.com

and I even checked the validity of the host's IP part, which as your simplication shows, can be matched as part of the hostname in a single go without any attempt at breaking it down into pieces.

Edit: is line 11092 of example 2 valid (hostname ending with dot)?

127.0.0.1 albatross.cz.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

It seems hphosts have made a mistake along the way with that hostname and the proceeding . (dot) plus never knew about the remember.mine.nu site.

Thanks Robjong for adding to the discussion, certainly an enlightening insight into regular expressions. I will leave jchd & UEZ to continue as they know more than I do.

UDF List:

 
_AdapterConnections()_AlwaysRun()_AppMon()_AppMonEx()_ArrayFilter/_ArrayReduce_BinaryBin()_CheckMsgBox()_CmdLineRaw()_ContextMenu()_ConvertLHWebColor()/_ConvertSHWebColor()_DesktopDimensions()_DisplayPassword()_DotNet_Load()/_DotNet_Unload()_Fibonacci()_FileCompare()_FileCompareContents()_FileNameByHandle()_FilePrefix/SRE()_FindInFile()_GetBackgroundColor()/_SetBackgroundColor()_GetConrolID()_GetCtrlClass()_GetDirectoryFormat()_GetDriveMediaType()_GetFilename()/_GetFilenameExt()_GetHardwareID()_GetIP()_GetIP_Country()_GetOSLanguage()_GetSavedSource()_GetStringSize()_GetSystemPaths()_GetURLImage()_GIFImage()_GoogleWeather()_GUICtrlCreateGroup()_GUICtrlListBox_CreateArray()_GUICtrlListView_CreateArray()_GUICtrlListView_SaveCSV()_GUICtrlListView_SaveHTML()_GUICtrlListView_SaveTxt()_GUICtrlListView_SaveXML()_GUICtrlMenu_Recent()_GUICtrlMenu_SetItemImage()_GUICtrlTreeView_CreateArray()_GUIDisable()_GUIImageList_SetIconFromHandle()_GUIRegisterMsg()_GUISetIcon()_Icon_Clear()/_Icon_Set()_IdleTime()_InetGet()_InetGetGUI()_InetGetProgress()_IPDetails()_IsFileOlder()_IsGUID()_IsHex()_IsPalindrome()_IsRegKey()_IsStringRegExp()_IsSystemDrive()_IsUPX()_IsValidType()_IsWebColor()_Language()_Log()_MicrosoftInternetConnectivity()_MSDNDataType()_PathFull/GetRelative/Split()_PathSplitEx()_PrintFromArray()_ProgressSetMarquee()_ReDim()_RockPaperScissors()/_RockPaperScissorsLizardSpock()_ScrollingCredits_SelfDelete()_SelfRename()_SelfUpdate()_SendTo()_ShellAll()_ShellFile()_ShellFolder()_SingletonHWID()_SingletonPID()_Startup()_StringCompact()_StringIsValid()_StringRegExpMetaCharacters()_StringReplaceWholeWord()_StringStripChars()_Temperature()_TrialPeriod()_UKToUSDate()/_USToUKDate()_WinAPI_Create_CTL_CODE()_WinAPI_CreateGUID()_WMIDateStringToDate()/_DateToWMIDateString()Au3 script parsingAutoIt SearchAutoIt3 PortableAutoIt3WrapperToPragmaAutoItWinGetTitle()/AutoItWinSetTitle()CodingDirToHTML5FileInstallrFileReadLastChars()GeoIP databaseGUI - Only Close ButtonGUI ExamplesGUICtrlDeleteImage()GUICtrlGetBkColor()GUICtrlGetStyle()GUIEventsGUIGetBkColor()Int_Parse() & Int_TryParse()IsISBN()LockFile()Mapping CtrlIDsOOP in AutoItParseHeadersToSciTE()PasswordValidPasteBinPosts Per DayPreExpandProtect GlobalsQueue()Resource UpdateResourcesExSciTE JumpSettings INISHELLHOOKShunting-YardSignature CreatorStack()Stopwatch()StringAddLF()/StringStripLF()StringEOLToCRLF()VSCROLLWM_COPYDATAMore Examples...

Updated: 22/04/2018

Link to comment
Share on other sites

The pattern is not optimal if you want to make sure the file is parsed correctly (too much room for error, you're warned), the implementation of regular expressions in AutoIt turned out to be too limited for my initial approach.

However guinness stated he wanted to improve his knowledge of PCRE which I think my scripts helps with a bit. I will have look at the standards/rules for the hosts file (any good links?) and see what I come up with.

BTW is the G sequence for SRE not documented in the help file? or do I just need to update again?

@jchd So that's where all the good discussions are hehe, did anything worth mentioning come up there? About the entry I think it is not valid, dots are separators, but it does work.

Link to comment
Share on other sites

BTW is the G sequence for SRE not documented in the help file? or do I just need to update again?

It isn't but I'm afraid to update StringRegExp if G isn't officially supported in the version of PCRE AutoIt uses.

UDF List:

 
_AdapterConnections()_AlwaysRun()_AppMon()_AppMonEx()_ArrayFilter/_ArrayReduce_BinaryBin()_CheckMsgBox()_CmdLineRaw()_ContextMenu()_ConvertLHWebColor()/_ConvertSHWebColor()_DesktopDimensions()_DisplayPassword()_DotNet_Load()/_DotNet_Unload()_Fibonacci()_FileCompare()_FileCompareContents()_FileNameByHandle()_FilePrefix/SRE()_FindInFile()_GetBackgroundColor()/_SetBackgroundColor()_GetConrolID()_GetCtrlClass()_GetDirectoryFormat()_GetDriveMediaType()_GetFilename()/_GetFilenameExt()_GetHardwareID()_GetIP()_GetIP_Country()_GetOSLanguage()_GetSavedSource()_GetStringSize()_GetSystemPaths()_GetURLImage()_GIFImage()_GoogleWeather()_GUICtrlCreateGroup()_GUICtrlListBox_CreateArray()_GUICtrlListView_CreateArray()_GUICtrlListView_SaveCSV()_GUICtrlListView_SaveHTML()_GUICtrlListView_SaveTxt()_GUICtrlListView_SaveXML()_GUICtrlMenu_Recent()_GUICtrlMenu_SetItemImage()_GUICtrlTreeView_CreateArray()_GUIDisable()_GUIImageList_SetIconFromHandle()_GUIRegisterMsg()_GUISetIcon()_Icon_Clear()/_Icon_Set()_IdleTime()_InetGet()_InetGetGUI()_InetGetProgress()_IPDetails()_IsFileOlder()_IsGUID()_IsHex()_IsPalindrome()_IsRegKey()_IsStringRegExp()_IsSystemDrive()_IsUPX()_IsValidType()_IsWebColor()_Language()_Log()_MicrosoftInternetConnectivity()_MSDNDataType()_PathFull/GetRelative/Split()_PathSplitEx()_PrintFromArray()_ProgressSetMarquee()_ReDim()_RockPaperScissors()/_RockPaperScissorsLizardSpock()_ScrollingCredits_SelfDelete()_SelfRename()_SelfUpdate()_SendTo()_ShellAll()_ShellFile()_ShellFolder()_SingletonHWID()_SingletonPID()_Startup()_StringCompact()_StringIsValid()_StringRegExpMetaCharacters()_StringReplaceWholeWord()_StringStripChars()_Temperature()_TrialPeriod()_UKToUSDate()/_USToUKDate()_WinAPI_Create_CTL_CODE()_WinAPI_CreateGUID()_WMIDateStringToDate()/_DateToWMIDateString()Au3 script parsingAutoIt SearchAutoIt3 PortableAutoIt3WrapperToPragmaAutoItWinGetTitle()/AutoItWinSetTitle()CodingDirToHTML5FileInstallrFileReadLastChars()GeoIP databaseGUI - Only Close ButtonGUI ExamplesGUICtrlDeleteImage()GUICtrlGetBkColor()GUICtrlGetStyle()GUIEventsGUIGetBkColor()Int_Parse() & Int_TryParse()IsISBN()LockFile()Mapping CtrlIDsOOP in AutoItParseHeadersToSciTE()PasswordValidPasteBinPosts Per DayPreExpandProtect GlobalsQueue()Resource UpdateResourcesExSciTE JumpSettings INISHELLHOOKShunting-YardSignature CreatorStack()Stopwatch()StringAddLF()/StringStripLF()StringEOLToCRLF()VSCROLLWM_COPYDATAMore Examples...

Updated: 22/04/2018

Link to comment
Share on other sites

Oh yes G is part of PCRE from the start. It's mandatory to do things like that, where splittings may occur outside line breaks.

There are a big number of options and metacharacters in PCRE that the help file doesn't discuss. OTOH, an exact discussion of many options would require lengthy explanations.

Official PCRE man pages for the latest release are available here. I also make available a link to download the v8.30 html counterpart which I find easier to use.

Remember that AutoIt PCRE may be a few versions behind but the differences are mostly in incredibly dark corners with sharp angles where most of us are likely never going to wander.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

I think we are far enough behind now that an update should be in order. I really don't think it's necessary to update with every new version but evry 4 or 5 wouldn't be so bad.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...