Jump to content

extracting dates from different file name formats


Recommended Posts

It has taken me a while to get my head around this, and I'm still not 100% sure about the results. With so many possible variants, things get a little complicated. From what I understand you are looking for a fuzzy algorithm. There is always a chance that a match is not actually a date, and the more variants you allow, the greater the risk of a false positive result. Anyway after some analysis I came up with the idea of searching each variant from the left until a plausible date is encountered. Then continuing in the same vein, to try and find a time stamp afterwards. [EDIT - with numeric strings] - One delimiter ONLY is allowed between the date and the time. [when a date and time contain valid delimiters, the previous rule is disgaurded] Times are accurate to within 60 seconds if no false positive result occurs.

I notice that you give some formats which include characters not allowed in file names. For this reason I have separated the delimiter options and created an exception so the code can be modified fairly easily. I have not compaired your original results with mine. I rewrote the whole thing and it's a bit rough and ready in places. I think it's closer to what you want, but you will have to test it more thoroughly. The code may form the basis of a better version later. I guess the regular expressions could possibly be improved. It seems to be working as it stands.
 

#include <Array.au3>
#include <Date.au3>
#include <DTC.au3>

Local $aTest = dates_array()

; This code formats the results from _ExtractDate():
Local $sExtracted, $aException
For $i = 0 To UBound($aTest) -1
    $sExtracted = _ExtractDate($aTest[$i][0])
    If Not @error Then
        $sExtracted = StringRegExpReplace(StringReplace($sExtracted, ' ', '_'), '[/\:]', '')
        $sExtracted = _Date_Time_Convert($sExtracted, "yyyyMMdd_HHmmss", "MM/dd/yyyy hh:mm TT")

        ; An exception is made for AM and PM
        $aException = StringRegExp($aTest[$i][0], '(?i)(?: )(\d{1,2})(?:\:)(\d{1,2})( [AP]M)',3)
        If IsArray($aException) Then $sExtracted = StringLeft($sExtracted, 10) & _
        ' ' & StringFormat('%02i:%02i', $aException[0], $aException[1]) & $aException[2]

        $aTest[$i][1] = $sExtracted
    Else
        $aTest[$i][2] = "ERROR"
    EndIf
Next
_ArrayDisplay($aTest)


; This function returns dates in the format yyyy/MM/DD HH:MM:SS
Func _ExtractDate($sString)
    Local $sDateDelim = '-/_', $sTimeDelim = '-._' ; Delimiter options - can be modified.

    Local $sMDY = '(?:.*?)(\d{1,2})([\Q' & $sDateDelim & '\E])(\d{1,2})(\g2)(\d{4})' ; Formats *M?D?YYYY, *M?DD?YYYY, *MM?D?YYYY, *MM?DD?YYYY
    Local $sYMD = '(?:.*?)(\d{4})([\Q' & $sDateDelim & '\E])(\d{1,2})(\g2)(\d{1,2})' ; Formats *YYYY?M?D, *YYYY?M?DD, *YYYY?MM?D, *YYYY?MM?DD
    Local $sHMS = '(?:.*?)(\d{1,2})([\Q' & $sTimeDelim & '\E])(\d{1,2})(\g7)?(\d{1,2})?' ; Formats *HH?MM?SS
    Local $sDDD = '(?:\D*)(\d{8,})(?:\D)?(\d{4,6})?' ; Just Digits *YYYYMMDD?HHMMSS, *YYYYMMDD
    Local $sYY = '(?:\D*)(\d{4})(?:\D|\z)' ; Just 4 digits *YYYY

    Local $aFormat[6] = [$sYMD & $sHMS, $sMDY & $sHMS, $sYMD, $sMDY, $sDDD, $sYY] ; Formats $sYMD & $sHMS, $sMDY & $sHMS, $sYMD, $sMDY, $sDDD, $sYY

    Local $aSRE, $sDate, $sTime = ' 00:00:00'
    For $i = 0 To 5
        $sDate = ''
        $aSRE = StringRegExp($sString, $aFormat[$i], 3)
        If Not @error Then ; Match found
            Switch $i
                Case 0 To 3
                    $sDate = (Mod($i, 2)) ? _
                    $aSRE[4] & '/' & StringFormat('%02i/%02i', $aSRE[0], $aSRE[2]) : _
                    $aSRE[0] & '/' & StringFormat('%02i/%02i',$aSRE[2], $aSRE[4])

                    If _DateIsValid($sDate) Then
                        If $i < 2 Then
                            $sTime = ' ' & StringFormat('%02i:%02i', $aSRE[5], $aSRE[7]) & ':' & '00'
                            If Not _DateIsValid($sDate & $sTime) Then $sTime = ' 00:00:00'
                        EndIf
                        ExitLoop
                    EndIf

                Case 4 ; *YYYYMMDD*HHMMSS, *YYYYMMDD
                    For $j = 1 To StringLen($aSRE[0]) - 7
                        $sYear = StringMid($aSRE[0], $j, 4)
                        If Not __IsWithin50Years($sYear) Then ContinueLoop
                        $sDate = $sYear & '/' & StringMid($aSRE[0], $j + 4, 2) & '/' & StringMid($aSRE[0], $j + 6, 2)
                        If _DateIsValid($sDate) Then ; Search for a valid time
                            $sTime = ' ' & StringMid($aSRE[0], $j + 8, 2) & ':' & StringMid($aSRE[0], $j + 10, 2) & ':00'
                            If Stringlen($sTime) = 9 And _DateIsValid($sDate & $sTime) Then
                                ExitLoop 2
                            Else
                                If UBound($aSRE) > 1 Then
                                    $sTime = ' ' & StringLeft($aSRE[1], 2) & ':' & StringMid($aSRE[1], 3, 2) & ':00'
                                    If Not _DateIsValid($sDate & $sTime) Then $sTime = ' 00:00:00'
                                Else
                                    $sTime = ' 00:00:00'
                                EndIf
                                ExitLoop 2
                            EndIf
                        EndIf
                    Next

                Case 5 ; *YYYY
                    If __IsWithin50Years($aSRE[0]) Then
                        $sDate = $aSRE[0] & '/01/01'
                        $sTime = ' 00:00:00'
                        ExitLoop
                    EndIf
            EndSwitch
        EndIf
    Next

    $sDate &= $sTime
    If StringLen($sDate) <> 19 Then Return SetError(1) ; No date found.

    Return $sDate
EndFunc ;==> _ExtractDate

; This time window can easily be extended to further in the past.
Func __IsWithin50Years($vYear, $iRange = 50)
    Local $iCurrentYear = @YEAR
    Return $vYear < $iCurrentYear And $iCurrentYear - $vYear < $iRange
EndFunc ;==> __IsWithin50Years


#Region - original test data
Func dates_array()

    Local $array[65][3]

    ;resolved
    $array[0][0] = "2/3/2012 8:38 PM"
    $array[1][0] = "2/03/2012 08:38 PM"
    $array[2][0] = "02/3/2012 8:38 AM"
    $array[3][0] = "11/03/2012 8:38 AM"
    $array[4][0] = "11/03/2012 08:38 AM"
    $array[5][0] = "2012-12-30_14-48-34_90"
    $array[6][0] = "2012_12_30_14_48_34_90"
    $array[7][0] = "2012-12-30-14-48-34-90"
    $array[8][0] = "2012-12-30 14-48-34-90"
    $array[9][0] = "2015-04-29 03.46.36"
    $array[10][0] = "2015_04_29 03.46.36"
    $array[11][0] = "12-26-2012-bridge(1)"
    $array[12][0] = "12_26_2012-bridge(1)"
    $array[13][0] = "12-26-2012"
    $array[14][0] = "12_26_2012"
    $array[15][0] = "IMG00136-20100524-0109"
    $array[16][0] = "IMG00136_20100524_0109"
    $array[17][0] = "IMG_20000526_100019_402"
    $array[18][0] = "IMG-20120615-00028"
    $array[19][0] = "IMG_20120615_00028"
    $array[20][0] = "Texas-20111117-00060"
    $array[21][0] = "Texas_20111117_00060"
    $array[22][0] = "Southwest San Marcos Valley-20111110-00046"
    $array[23][0] = "Southwest San Marcos Valley_20111110_00046"
    $array[24][0] = "Long Island-Laketown-20110526-00023"
    $array[25][0] = "Long Island-Laketown_20110526_00023"
    $array[26][0] = "20141119_193702"
    $array[27][0] = "20141119-193702"

    ;still need to resolve - RESOLVED
    $array[28][0] = "2014071495201859"
    $array[29][0] = "2013072695195930"
    $array[30][0] = "IMG-20140619-WA0000"
    $array[31][0] = "IMG-20140402-WA0000"
    $array[32][0] = "VID-20141002-WA0001"
    $array[33][0] = "VID-20141009-WA0004"
    $array[34][0] = "IMG95201405169510533295434"
    $array[35][0] = "IMG95201310319519475695780"
    $array[36][0] = "IMG952014050695205100"
    $array[37][0] = "IMG952013010695192927"
    $array[38][0] = "Resampled952012-07-099515-09-279577"
    $array[39][0] = "Resampled952012-05-169519-32-049577"
    $array[40][0] = "Resampled952012-05-129518-02-1795365"
    $array[41][0] = "Resampled952012-06-109513-34-0395360"
    $array[42][0] = "IMG_20141003_244125_273"
    $array[43][0] = "IMG_20141003_244129_571"
    $array[44][0] = "2012-07-149519"
    $array[45][0] = "VID_20120415103537718"
    $array[46][0] = "VID_20120415103537718"
    $array[47][0] = "VN_20120520103037802"
    $array[48][0] = "VN_20121005215040254"
    $array[49][0] = "PicStory-2012-04-01-02-53"
    $array[50][0] = "2012-12-209510-42-3195121"
    $array[51][0] = "2012-12-219512-05-0395507"
    $array[52][0] = "2014-08-259507.27.29"
    $array[53][0] = "2013-01-29"

    ;should not match
    $array[54][0] = "0623112010"
    $array[55][0] = "0710122020"
    $array[56][0] = "0710122022"
    $array[57][0] = "0710122024"
    $array[58][0] = "0710122026"
    $array[59][0] = "0710122020"
    $array[60][0] = "0710122022"
    $array[61][0] = "0710122023a"
    $array[62][0] = "0710122024"
    $array[63][0] = "0710122026"
    $array[64][0] = "13659097338151"

    Return $array

EndFunc   ;==>dates_array
#EndRegion

 

Edited by czardas
Missing a space in line 83
Link to comment
Share on other sites

this is working great!  except i noticed something weird...ive noticed formats which resolve fine sometimes but for some reason the same formats do not resolve other times

for instance

array elements 26 and 27 resolve great but if you will note array elements 65-81 do not

also array elements 42 and 43 resolve but elements 82 and 83 do not

#include <Array.au3>
#include <Date.au3>
#include <DTC.au3>

Local $aTest = dates_array()

; This code formats the results from _ExtractDate():
Local $sExtracted, $aException
For $i = 0 To UBound($aTest) -1
    $sExtracted = _ExtractDate($aTest[$i][0])

    If Not @error Then
        $aTest[$i][1] = $sExtracted
    Else
        $aTest[$i][2] = "ERROR"
    EndIf
Next
_ArrayDisplay($aTest)


; This function returns dates in the format yyyy/MM/DD HH:MM:SS
Func _ExtractDate($sString)
    Local $sDateDelim = '-/_', $sTimeDelim = '-._' ; Delimiter options - can be modified.

    Local $sMDY = '(?:.*?)(\d{1,2})([\Q' & $sDateDelim & '\E])(\d{1,2})(\g2)(\d{4})' ; Formats *M?D?YYYY, *M?DD?YYYY, *MM?D?YYYY, *MM?DD?YYYY
    Local $sYMD = '(?:.*?)(\d{4})([\Q' & $sDateDelim & '\E])(\d{1,2})(\g2)(\d{1,2})' ; Formats *YYYY?M?D, *YYYY?M?DD, *YYYY?MM?D, *YYYY?MM?DD
    Local $sHMS = '(?:.*?)(\d{1,2})([\Q' & $sTimeDelim & '\E])(\d{1,2})(\g7)?(\d{1,2})?' ; Formats *HH?MM?SS
    Local $sDDD = '(?:\D*)(\d{8,})(?:\D)?(\d{4,6})?' ; Just Digits *YYYYMMDD?HHMMSS, *YYYYMMDD
    Local $sYY = '(?:\D*)(\d{4})(?:\D|\z)' ; Just 4 digits *YYYY

    Local $aFormat[6] = [$sYMD & $sHMS, $sMDY & $sHMS, $sYMD, $sMDY, $sDDD, $sYY] ; Formats $sYMD & $sHMS, $sMDY & $sHMS, $sYMD, $sMDY, $sDDD, $sYY

    Local $aSRE, $sDate, $sTime = ' 00:00:00'
    For $i = 0 To 5
        $sDate = ''
        $aSRE = StringRegExp($sString, $aFormat[$i], 3)
        If Not @error Then ; Match found
            Switch $i
                Case 0 To 3
                    $sDate = (Mod($i, 2)) ? _
                    $aSRE[4] & '/' & StringFormat('%02i/%02i', $aSRE[0], $aSRE[2]) : _
                    $aSRE[0] & '/' & StringFormat('%02i/%02i',$aSRE[2], $aSRE[4])

                    If _DateIsValid($sDate) Then
                        If $i < 2 Then
                            $sTime = ' ' & StringFormat('%02i:%02i', $aSRE[5], $aSRE[7]) & ':' & '00'
                            If Not _DateIsValid($sDate & $sTime) Then $sTime = ' 00:00:00'
                        EndIf
                        ExitLoop
                    EndIf

                Case 4 ; *YYYYMMDD*HHMMSS, *YYYYMMDD
                    For $j = 1 To StringLen($aSRE[0]) - 7
                        $sYear = StringMid($aSRE[0], $j, 4)
                        If Not __IsWithin50Years($sYear) Then ContinueLoop
                        $sDate = $sYear & '/' & StringMid($aSRE[0], $j + 4, 2) & '/' & StringMid($aSRE[0], $j + 6, 2)
                        If _DateIsValid($sDate) Then ; Search for a valid time
                            $sTime = ' ' & StringMid($aSRE[0], $j + 8, 2) & ':' & StringMid($aSRE[0], $j + 10, 2) & ':00'
                            If Stringlen($sTime) = 9 And _DateIsValid($sDate & $sTime) Then
                                ExitLoop 2
                            Else
                                If UBound($aSRE) > 1 Then
                                    $sTime = ' ' & StringLeft($aSRE[1], 2) & ':' & StringMid($aSRE[1], 3, 2) & ':00'
                                    If Not _DateIsValid($sDate & $sTime) Then $sTime = ' 00:00:00'
                                Else
                                    $sTime = ' 00:00:00'
                                EndIf
                                ExitLoop 2
                            EndIf
                        EndIf
                    Next

                Case 5 ; *YYYY
                    If __IsWithin50Years($aSRE[0]) Then
                        $sDate = $aSRE[0] & '/01/01'
                        $sTime = ' 00:00:00'
                        ExitLoop
                    EndIf
            EndSwitch
        EndIf
    Next

    $sDate &= $sTime
    If StringLen($sDate) <> 19 Then Return SetError(1) ; No date found.

    $sDate = StringRegExpReplace(StringReplace($sDate, ' ', '_'), '[/\:]', '')
    $sDate = _Date_Time_Convert($sDate, "yyyyMMdd_HHmmss", "MM/dd/yyyy hh:mm TT")

    ; An exception is made for AM and PM
    $aException = StringRegExp($sString, '(?i)(?: )(\d{1,2})(?:\:)(\d{1,2})( [AP]M)',3)
    If IsArray($aException) Then $sDate = StringLeft($sDate, 10) & _
    ' ' & StringFormat('%02i:%02i', $aException[0], $aException[1]) & $aException[2]

    Return $sDate
EndFunc ;==> _ExtractDate

; This time window can easily be extended to further in the past.
Func __IsWithin50Years($vYear, $iRange = 50)
    Local $iCurrentYear = @YEAR
    Return $vYear < $iCurrentYear And $iCurrentYear - $vYear < $iRange
EndFunc ;==> __IsWithin50Years


#Region - original test data
Func dates_array()

    Local $array[84][3]

    ;resolved
    $array[0][0] = "2/3/2012 8:38 PM"
    $array[1][0] = "2/03/2012 08:38 PM"
    $array[2][0] = "02/3/2012 8:38 AM"
    $array[3][0] = "11/03/2012 8:38 AM"
    $array[4][0] = "11/03/2012 08:38 AM"
    $array[5][0] = "2012-12-30_14-48-34_90"
    $array[6][0] = "2012_12_30_14_48_34_90"
    $array[7][0] = "2012-12-30-14-48-34-90"
    $array[8][0] = "2012-12-30 14-48-34-90"
    $array[9][0] = "2015-04-29 03.46.36"
    $array[10][0] = "2015_04_29 03.46.36"
    $array[11][0] = "12-26-2012-bridge(1)"
    $array[12][0] = "12_26_2012-bridge(1)"
    $array[13][0] = "12-26-2012"
    $array[14][0] = "12_26_2012"
    $array[15][0] = "IMG00136-20100524-0109"
    $array[16][0] = "IMG00136_20100524_0109"
    $array[17][0] = "IMG_20000526_100019_402"
    $array[18][0] = "IMG-20120615-00028"
    $array[19][0] = "IMG_20120615_00028"
    $array[20][0] = "Texas-20111117-00060"
    $array[21][0] = "Texas_20111117_00060"
    $array[22][0] = "Southwest San Marcos Valley-20111110-00046"
    $array[23][0] = "Southwest San Marcos Valley_20111110_00046"
    $array[24][0] = "Long Island-Laketown-20110526-00023"
    $array[25][0] = "Long Island-Laketown_20110526_00023"
    $array[26][0] = "20141119_193702"
    $array[27][0] = "20141119-193702"

    ;still need to resolve - RESOLVED
    $array[28][0] = "2014071495201859"
    $array[29][0] = "2013072695195930"
    $array[30][0] = "IMG-20140619-WA0000"
    $array[31][0] = "IMG-20140402-WA0000"
    $array[32][0] = "VID-20141002-WA0001"
    $array[33][0] = "VID-20141009-WA0004"
    $array[34][0] = "IMG95201405169510533295434"
    $array[35][0] = "IMG95201310319519475695780"
    $array[36][0] = "IMG952014050695205100"
    $array[37][0] = "IMG952013010695192927"
    $array[38][0] = "Resampled952012-07-099515-09-279577"
    $array[39][0] = "Resampled952012-05-169519-32-049577"
    $array[40][0] = "Resampled952012-05-129518-02-1795365"
    $array[41][0] = "Resampled952012-06-109513-34-0395360"
    $array[42][0] = "IMG_20141003_244125_273"
    $array[43][0] = "IMG_20141003_244129_571"
    $array[44][0] = "2012-07-149519"
    $array[45][0] = "VID_20120415103537718"
    $array[46][0] = "VID_20120415103537718"
    $array[47][0] = "VN_20120520103037802"
    $array[48][0] = "VN_20121005215040254"
    $array[49][0] = "PicStory-2012-04-01-02-53"
    $array[50][0] = "2012-12-209510-42-3195121"
    $array[51][0] = "2012-12-219512-05-0395507"
    $array[52][0] = "2014-08-259507.27.29"
    $array[53][0] = "2013-01-29"

    ;should not match
    $array[54][0] = "0623112010"
    $array[55][0] = "0710122020"
    $array[56][0] = "0710122022"
    $array[57][0] = "0710122024"
    $array[58][0] = "0710122026"
    $array[59][0] = "0710122020"
    $array[60][0] = "0710122022"
    $array[61][0] = "0710122023a"
    $array[62][0] = "0710122024"
    $array[63][0] = "0710122026"
    $array[64][0] = "13659097338151"

    ;new
    $array[65][0] = "20150102_171408"
    $array[66][0] = "20150104_174204"
    $array[67][0] = "20150104_174353"
    $array[68][0] = "20150104_175104"
    $array[69][0] = "20150104_181751"
    $array[70][0] = "20150102_171408"
    $array[71][0] = "20150104_174204"
    $array[72][0] = "20150104_174353"
    $array[73][0] = "20150104_175104"
    $array[74][0] = "20150104_181751"
    $array[75][0] = "20150104_184735"
    $array[76][0] = "20150209_200557"
    $array[77][0] = "20150313_200638"
    $array[78][0] = "20150313_200914"
    $array[79][0] = "20150313_201126"
    $array[80][0] = "20150418_201504"
    $array[81][0] = "20150419_100142"

    $array[82][0] = "IMG_20150219_121547_663"
    $array[83][0] = "IMG_20150219_145706_239"


    Return $array

EndFunc   ;==>dates_array
#EndRegion

by the way i move this section to the _ExtractDate function - hope you dont mind - dont think that's making the difference right?

$sDate = StringRegExpReplace(StringReplace($sDate, ' ', '_'), '[/\:]', '')
$sDate = _Date_Time_Convert($sDate, "yyyyMMdd_HHmmss", "MM/dd/yyyy hh:mm TT")

; An exception is made for AM and PM
$aException = StringRegExp($sString, '(?i)(?: )(\d{1,2})(?:\:)(\d{1,2})( [AP]M)',3)
If IsArray($aException) Then $sDate = StringLeft($sDate, 10) & _
' ' & StringFormat('%02i:%02i', $aException[0], $aException[1]) & $aException[2]

thanks for your help!

Link to comment
Share on other sites

Duh, my fault. The problem is in the function _IsWithin50Years(). I overlooked a silly thing, by focusing more attention on the other function. Replace it with the following version:

; This time window can easily be extended to further in the past.
Func __IsWithin50Years($vYear, $iRange = 50)
    Local $iCurrentYear = @YEAR
    Return $vYear <= $iCurrentYear And $iCurrentYear - $vYear < $iRange ; Modified
EndFunc ;==> __IsWithin50Years

 

Edited by czardas
Link to comment
Share on other sites

I was never quite happy with including a single 4 digit year. Try replacing line 29 with this:

Local $sYY = '(?:\A|\D)(\d{4})(?:\D|\z)' ; Just 4 digits *YYYY

It should still be able to match names like myFile2015.txt, but not myFile20155.txt. It would be less inclined towards false positives if these four digit matches were excluded from the results altogether.

Edit: I just modified the above expression once again.

Edited by czardas
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...