Jump to content

Recommended Posts

Posted (edited)

Hi Experts,

Hope everyone is having a good day today!😊

I have this new task that involved XML creation based on data given. I've been searching how to create XML out of pure data text but until now still wondering if there's a thread on that since could not find one. Maybe I missed something in my searching.😅 For now, I posted this without any sample code yet coz I'm still looking for a head start and also, hope you can provide me any thread or suggestion on where should I start.

Here, below is the data text that I should convert into an XML tag.

John J., Gracy D., Jame R., et al., (2019) This is a sample sentence here. Then another here. 5(2);101-109. doi:1001.10110/aj21.j1j.10.

 

Here's the XML looks like:

<File xml:id="name-of-filename">
<Citation type="letter" xml:id="name-of-filename"><Person><familyName>John</familyName> <givenName>J.</givenName></Person>, <Person><familyName>Gracy</familyName>, <givenName>D.</givenName></Person>, et al. (<Year year="2019">2019</pubYear>). <Title>This is a sample Title sentence here</Title>. <SubTitle>Then another here</SubTitle>, <vol>5</vol>(<issue>2</issue>); <FisrstPage>101</FirstPage>&ndash;<SecondPage>109</SecondPage>. <url href="https://doi.org/1001.10110/aj21.j1j.10.">doi:1001.10110/aj21.j1j.10.</url></citation>
</File>

 

If you have any suggestions and if you can refer me to any thread that would be a big help Experts. Thank you in advance😁

 

KS15

Edited by KickStarter15

Programming is "To make it so simple that there are obviously no deficiencies" or "To make it so complicated that there are no obvious deficiencies" by C.A.R. Hoare.

Posted

obviously, he first task at hand is to parse the text into fields - i.e. split it into an array where each element contains a single data item. that i leave to you.

once you have the individual data items, i advise you compose the XML yourself. do not use any existing UDF for that, it is an overkill for such a simple task. you know the fields names, data, order and hierarchy; just write it down.

Signature - my forum contributions:

Spoiler

UDF:

LFN - support for long file names (over 260 characters)

InputImpose - impose valid characters in an input control

TimeConvert - convert UTC to/from local time and/or reformat the string representation

AMF - accept multiple files from Windows Explorer context menu

DateDuration -  literal description of the difference between given dates

Apps:

Touch - set the "modified" timestamp of a file to current time

Show For Files - tray menu to show/hide files extensions, hidden & system files, and selection checkboxes

SPDiff - Single-Pane Text Diff

 

Posted (edited)

@orbs, Thanks, I tried doing the stringsplit but could not do it correctly. Can you guide me of what you mean? maybe I'm just to upset right now knowing that I still need to learn the XML creation. 😓 I don't have any idea for now on what should I do.

First, I tried this way: FileWrite() - but I need to input all names in FileWrite() function just to generate the XML I want.

Second, tried using StringSplit - but I'm stuck in the For loop and still not what I expected.

Here:

#include <MsgBoxConstants.au3>
#include <StringConstants.au3>
$data_file = @ScriptDir & "\data.txt"
$File = FileRead($data_file)
$FilePath = @ScriptDir & "\Test.xml"
XMLFile()

Func XMLFile()
    Local $sText = $File

    Local $aArray = StringSplit($sText, ' ', $STR_ENTIRESPLIT)
    For $i = 1 To $aArray[0] ; Loop through the array returned by StringSplit to display the individual values.
        MsgBox($MB_SYSTEMMODAL, "", "$aArray[" & $i & "] - " & $aArray[$i])
        $Text = $aArray[$i]
    Next
        FileWrite($FilePath,'<File xml:id="name-of-filename">'&@CRLF& _
                '<Citation type="letter" xml:id="name-of-filename">' &@CRLF& _
                '<Person><familyName>'&$Text&'</familyName> <givenName>'&$Text&'</givenName></Person>, '&@CRLF& _
                '</citation>'&@CRLF& _
                '</File>')
EndFunc

Yah, it's funny but true😂. From the help file and trying to compose one code that can put me to head start. I really need to learn this XML UDFs for me to avoid asking so much help.😅

 

@jdelaney, Yup I did searched that in forum and google but it's all about the existing XML that need to append or create new element. I need to create an XML file from the data I posted above. Maybe I'm not in the right link as you suggested but can you point me to existing thread relating to my inquiry? much appreciated jdelaney. Thanks😁

 

Edited by KickStarter15

Programming is "To make it so simple that there are obviously no deficiencies" or "To make it so complicated that there are no obvious deficiencies" by C.A.R. Hoare.

Posted

@KickStarter15
Just extract all the values from XML with SRE and then concat them, filtering them as you want :)

#include <StringConstants.au3>

Global $strFileContent = '<File xml:id="name-of-filename">' & _
                         '<Citation type="letter" xml:id="name-of-filename"><Person><familyName>John</familyName><givenName>J.</givenName></Person>,' & _
                         '<Person><familyName>Gracy</familyName>, <givenName>D.</givenName></Person>, et al. (<Year year="2019">2019</pubYear>).' & _
                         '<Title>This is a sample Title sentence here</Title>. <SubTitle>Then another here</SubTitle>, <vol>5</vol>(<issue>2</issue>);' & _
                         '<FisrstPage>101</FirstPage>&ndash;<SecondPage>109</SecondPage>.' & _
                         '<url href="https://doi.org/1001.10110/aj21.j1j.10.">doi:1001.10110/aj21.j1j.10.</url></citation>' & _
                         '</File>', _
      $arrResult, _
      $strResult


$arrResult = StringRegExp($strFileContent, '>([^<]+)<', $STR_REGEXPARRAYGLOBALMATCH)

For $i = 0 To UBound($arrResult) - 1 Step 1
    If $arrResult[$i] = "&ndash;" Then
        $strResult &= "-"
    Else
        $strResult &= StringReplace($arrResult[$i], ';', '.')
    EndIf
Next

ConsoleWrite($strResult & @CRLF)

 

Click here to see my signature:

Spoiler

ALWAYS GOOD TO READ:

 

Posted

@FrancescoDiMuro, I think you understand it reversely😊 What I need is this:

From the data.txt:

"John J., Gracy D., Jame R., et al., (2019) This is a sample sentence here. Then another here. 5(2);101-109. doi:1001.10110/aj21.j1j.10.".

 

It should be captured in XML file.

<File xml:id="name-of-filename">
<Citation type="letter" xml:id="name-of-filename"><Person><familyName>John</familyName> <givenName>J.</givenName></Person>, <Person><familyName>Gracy</familyName>, <givenName>D.</givenName></Person>, et al. (<Year year="2019">2019</pubYear>). <Title>This is a sample Title sentence here</Title>. <SubTitle>Then another here</SubTitle>, <vol>5</vol>(<issue>2</issue>); <FisrstPage>101</FirstPage>&ndash;<SecondPage>109</SecondPage>. <url href="https://doi.org/1001.10110/aj21.j1j.10.">doi:1001.10110/aj21.j1j.10.</url></citation>
</File>

 

or maybe I miss understand your suggestion😅.

Programming is "To make it so simple that there are obviously no deficiencies" or "To make it so complicated that there are no obvious deficiencies" by C.A.R. Hoare.

Posted

orbs is definitely right.
First parse your string using the way you want to get a cute array, example :

#Include <Array.au3>

$str = "John J., Gracy D., Jame R., et al., (2019) This is a sample sentence here. Then another here. 5(2);101-109. doi:1001.10110/aj21.j1j.10."

$res = StringRegExp($str, '(?x) (?|  ([[:alpha:]]+)\h(?=[A-Z])  |  ([^,]+),\h  |  \((\d+)\)\h  |  ([A-Z][^.]+)\.\h   |  (\d+)  |  (\S+)$  ) ', 3)

 _ArrayDisplay($res)

Then loop through the array and build the xml content string using conditions on each array element
Such a xml is a custom thingy so there is no 'generic' way to do the job
Good luck  :)

Posted

@mikell, Thanks, that's problem now. How can I loop thru the array and assign each array to a specific element for XML creation.☹️

However, can you help me with getting the string to get the array (which is the code you gave) and then create a unique delimiter of each array then create the XML based on the new delimiters added. Would that be possible?

 

Programming is "To make it so simple that there are obviously no deficiencies" or "To make it so complicated that there are no obvious deficiencies" by C.A.R. Hoare.

Posted (edited)

you need to review carefully that fabulous regex magic by @mikell, to properly identify the data fields you receive. not all titles would have exactly two authors, right? and not all authors would have exactly two components of the name, right? same goes for the other data items. i advise you apply that regex on multiple different sample data strings - as many as you can find - to confirm its usefulness in all scenarios.

b.t.w. i notice the data string contains three names, yet your desired XML contains only two - "Jame R." is not included in the XML. is this intended?

once you have a verified array of identified data items, composing them into XML is easy. but let's fry one fish at a time, ok?

Edited by orbs

Signature - my forum contributions:

Spoiler

UDF:

LFN - support for long file names (over 260 characters)

InputImpose - impose valid characters in an input control

TimeConvert - convert UTC to/from local time and/or reformat the string representation

AMF - accept multiple files from Windows Explorer context menu

DateDuration -  literal description of the difference between given dates

Apps:

Touch - set the "modified" timestamp of a file to current time

Show For Files - tray menu to show/hide files extensions, hidden & system files, and selection checkboxes

SPDiff - Single-Pane Text Diff

 

Posted

Obviously, if the original string is formatted differently the regex may fail. Such an expression should be tested against many entry strings to check its reliability and to change it if needed. This first step is mandatory  ^_^

Anyway to build such an xml the hard work can't be avoided to create and populate the fields, whatever the way to be used - xml way as jdelaney said or string way as below

#Include <Array.au3>

$str = "John J., Gracy D., Jame R., et al., (2019) This is a sample sentence here. Then another here. 5(2);101-109. doi:1001.10110/aj21.j1j.10."

;$res = StringRegExp($str, '(?x) (?|  ([^,]+),\h  |  \((\d+)\)\h  |  ([A-Z][^.]+)\.\h   |  (\d+)  |  (\S+)$  ) ', 3)

$res = StringRegExp($str, '(?x) (?|  ([[:alpha:]]+)\h(?=[A-Z])  |  ([^,]+),\h  |  \((\d+)\)\h  |  ([A-Z][^.]+\.)\h   |  (\d+)  |  (\S+)$  ) ', 3)

 _ArrayDisplay($res)


Local $i, $s = '<File xml:id="name-of-filename">' & @crlf & _ 
                     '<Citation type="letter" xml:id="name-of-filename">'& @crlf

While StringRegExp($res[$i], '^[A-Z]') 
     $s &= '<Person>' & @crlf & '<familyName>' & $res[$i] & '</familyName>' & _
                   '<givenName>' & $res[$i+1] & '</givenName>' & @crlf & '</Person>' & @crlf 
     $i += 2
Wend
  $s &= $res[$i] & @crlf 
  $i += 1
  $s &= '<pubYear year="' & $res[$i] & '">' & $res[$i] & '</pubYear>' & @crlf 
  $i += 1
  
  ; and so on
  
  $s &= '</citation>' & @crlf

Msgbox(0,"", $s)

 

Posted

@mikell, Thanks, I see now what you mean. Sorry, I don't have this RegExp background yet and still learning on that part🤤. Also, trying the code you have, it is already displaying and creating the XML output that I need. However, if there are more that three person names, example Person 1, Person 2, Person 3, Person 4 and soon.... the code will stop and flagged as exceeds the range required.

Please can you advise me where on the code part that I can adjust the range?😅

Programming is "To make it so simple that there are obviously no deficiencies" or "To make it so complicated that there are no obvious deficiencies" by C.A.R. Hoare.

Posted

As long as the formatting is unchanged, the number of persons doesn't matter (see below). If a trouble occurs this means that something went different in the formatting - as orbs warned about this possibility
You might post some string examples

;this works with my previous snippet

$str = "John J., Gracy D., Jame R., Starter K., Orb S., Mikell C., Delaney J., Melba S., SoOn And, et al., (2019) This is a sample sentence here. Then another here. 5(2);101-109. doi:1001.10110/aj21.j1j.10."

BTW there are other ways to parse the string, but if the format of the string is not constant then none of them will work flawless

Posted (edited)

@mikell, Well now all make sense to me. Thanks, it worked perfectly however there are some format that are not using "et al.," after the last person, is there any else if... to this?😅 Or should I do a different RegExp on this. Let's say I have three different pattern so each pattern will have their own RegExp()? Please advise, I tried reading the below explanation from RegExp site but I only understand the few, well need to learn this now.

image.png.10f940651b90e7eb08ed8bfb3c7f7da6.png

 

 

Edited by KickStarter15

Programming is "To make it so simple that there are obviously no deficiencies" or "To make it so complicated that there are no obvious deficiencies" by C.A.R. Hoare.

Posted

@mikell, These are the other sample string that I'm worried about and I already tried changing the regexp but could not do it correctly.😥

;with "et al.,"
John J., Gracy D., Jame R., John J., Gracy D., Jame R., John J., Gracy D., Jame R., et al., (2019) This is a sample sentence here. Then another here. 5(2);101-109. doi:1001.10110/aj21.j1j.10.

;without "et al.,"
John J., Gracy D., Jame R., John J., Gracy D., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5(2);101-109. doi:1001.10110/aj21.j1j.10.

;without issue number "(2)"
John J., Gracy D., Jame R., John J., Gracy D., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5;101-109. doi:1001.10110/aj21.j1j.10.

;without "doi:" and issue "(2)" numbers
John J., Gracy D., Jame R., John J., Gracy D., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5;101-109.

 

Programming is "To make it so simple that there are obviously no deficiencies" or "To make it so complicated that there are no obvious deficiencies" by C.A.R. Hoare.

Posted (edited)

The regex can be slighly changed, but If some elements are likely to miss then this can be managed using conditions when building the xml - as I mentioned in my first post

#Include <Array.au3>

Local $astr[4] = ["John J., Gracy D., Jame R., et al., (2019) This is a sample sentence here. Then another here. 5(2);101-109. doi:1001.10110/aj21.j1j.10.", "John J., Gracy D., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5(2);101-109. doi:1001.10110/aj21.j1j.10.", "Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5;101-109. doi:1001.10110/aj21.j1j.10.", "John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5;101-109."]

$n = 0

For $str In $astr

$res = StringRegExp($str, '(?x) (?|  ' & _
    '([[:alpha:]]+)\h([A-Z].) ' & _         ; names
    ' |  ([[a-z\h.]+),\h ' & _              ; et al
    ' |  \((\d+)\)\h ' & _                  ; year
    ' |  ([A-Z][^.]+\.)\h  ' & _            ; title, subtitle
    ' |  (\d+[\(\);.-]) ' & _               ; vol, issue, pages
    ' |  (\S+)$  ) ' , 3)                   ; the rest

$n += 1
 _ArrayDisplay($res, $n)


Local $i = 0, $s = '<File xml:id="name-of-filename">' & @crlf & _ 
                   '<Citation type="letter" xml:id="name-of-filename">'& @crlf

While StringRegExp($res[$i], '^[A-Z]') 
     $s &= '<Person>' & '<familyName>' & $res[$i] & '</familyName>' & _
           '<givenName>' & $res[$i+1] & '</givenName>' & '</Person>' & @crlf 
     $i += 2
Wend

  If StringRegExp($res[$i], '^[a-z\h.]+$') Then
       $s &= $res[$i] & @crlf 
       $i += 1
  EndIf

  $s &= '<pubYear year="' & $res[$i] & '">' & $res[$i] & '</pubYear>' & @crlf 
  $i += 1
  $s &= '<Title>' & $res[$i] & '<\Title>' & @crlf 
  $i += 1
  $s &= '<SubTitle>' & $res[$i] & '<\SubTitle>' & @crlf 
  $i += 1

  If StringRegExp($res[$i], '\d+[\(;]$') Then
       $s &= '<vol>' & StringTrimRight($res[$i], 1) & '<\vol>' & @crlf 
       $i += 1
  EndIf

  If StringRegExp($res[$i], '\d+\)$') Then
       $s &= '<issue>' & StringTrimRight($res[$i], 1) & '<\issue>' & @crlf 
       $i += 1
  EndIf

  ; and so on

  $s &= '</citation>' & @crlf

Msgbox(0,$n, $s)

Next

 

Edited by mikell
Posted (edited)

@mikell, I've got this error below after checking the string without url.

image.png.9f4bb4f750d01f8d862f7293d15cff3c.png

image.png.d9422764013aa596da9ceba81ebb2449.png

 

And also, the code will have the same error as above if the first page and last page were changed.

image.thumb.png.a20e2ea0a5313c982e25c13b3d97af77.png

 

image.png

 

And another is this. It will include the hyphen in first page and period in lastpage which should not be included. Tried doing some changes but could not succeed😅. Please can you advise?

' |  (\d+[\(\);.-]) ' & _               ; vol, issue, pages

image.png.a3c18d8c2a1f855a49a5413e4347825b.png

image.png.20aef7c3a22687f1f9d9c5f582306d7d.png

 

 

Edited by KickStarter15

Programming is "To make it so simple that there are obviously no deficiencies" or "To make it so complicated that there are no obvious deficiencies" by C.A.R. Hoare.

Posted (edited)

@KickStarter15, you must be much less fuzzy in describing your conditions. if you cannot properly define the data items components and delimiters, how do you expect the computer could?

speaking for myself, i'm no regex expert - not a regex novice even - so i know i cannot maintain such an elaborate code, i would walk the direct path of string manipulation, but first i would properly define the input string structure. read the following code carefullyespecially the comments - it is a bit long, but very simple to understand, troubleshoot and maintain.

; ref: https://www.autoitscript.com/forum/topic/198739-craeting-xml-file-based-on-data-text/

Global $aSample[4]
;with "et al.,"
$aSample[0] = 'John J., Gracy D., Jame R., John J., Gracy D., Jame R., John J., Gracy D., Jame R., et al., (2019) This is a sample sentence here. Then another here. 5(2);101-109. doi:1001.10110/aj21.j1j.10.'
;without "et al.,"
$aSample[1] = 'John J., Gracy D., Jame R., John J., Gracy D., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5(2);101-109. doi:1001.10110/aj21.j1j.10.'
;without issue number "(2)"
$aSample[2] = 'John J., Gracy D., Jame R., John J., Gracy D., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5;101-109. doi:1001.10110/aj21.j1j.10.'
;without "doi:" and issue "(2)" numbers
$aSample[3] = 'John J., Gracy D., Jame R., John J., Gracy D., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5;101-109.'

For $i = 0 To UBound($aSample) - 1
    _StringToXML($aSample[$i])
Next




Func _StringToXML($sString)
    ConsoleWrite(@CRLF)
    ConsoleWrite('-source string:' & @CRLF)
    ConsoleWrite($sString & @CRLF)


    ; declare variables for data items
    Local $s_authors, $s_year, $s_title, $s_subtitle, $s_vol, $s_issue, $s_firstpage, $s_lastpage, $s_doi
    ; declare temporary variables
    Local $iPos, $sSubstring, $aSubString, $aSubStringPartial


    ; if 'doi:' exists then it must be after the last space
    $iPos = StringInStr($sString, ' ', Default, -1)
    If $iPos = 0 Then Return SetError(1, 0, '') ; no whitespace -> something went horribly wrong!
    ; get the part of the string after the last space
    $sSubstring = StringRight($sString, StringLen($sString) - $iPos)
    ; check if it is doi. if not, then $s_doi simply remains empty
    If StringLeft($sSubstring, 4) = 'doi:' Then
        ; store the value for later use
        $s_doi = StringTrimLeft($sSubstring, 4)
        ; trim the entire substring from the string, also trim the last whitspace
        $sString = StringTrimRight($sString, StringLen($sSubstring) + 1)
    EndIf
    ; now the input string does not contain the doi part, wether existed or not


    ; the vol/issue/pages part must be after the last space
    $iPos = StringInStr($sString, ' ', Default, -1)
    If $iPos = 0 Then Return SetError(1, 0, '') ; no whitespace -> something went horribly wrong!
    ; get the part of the string after the last space
    $sSubstring = StringRight($sString, StringLen($sString) - $iPos)
    ; remove dot from the end
    If StringRight($sSubstring, 1) = '.' Then
        $sSubstring = StringTrimRight($sSubstring, 1)
    Else
        Return SetError(1, 0, '') ; no dot -> something went horribly wrong!
    EndIf
    ; split the substring to two parts by semicilon
    $aSubString = StringSplit($sSubstring, ';')
    If $aSubString[0] <> 2 Then Return SetError(1, 0, '') ; not two parts -> something went horribly wrong!
    ; handle the pages part
    $aSubStringPartial = StringSplit($aSubString[2], '-')
    If $aSubStringPartial[0] <> 2 Then Return SetError(1, 0, '') ; not two page numbers -> something went horribly wrong!
    ; check if page parts are numbers
    If StringIsDigit($aSubStringPartial[1]) And StringIsDigit($aSubStringPartial[2]) Then
        ; store the value for later use
        $s_firstpage = $aSubStringPartial[1]
        $s_lastpage = $aSubStringPartial[2]
        ; trim the entire substring from the string, also trim the last whitspace and dot
        $sString = StringTrimRight($sString, StringLen($sSubstring) + 2)
    Else
        Return SetError(1, 0, '') ; not numbers -> something went horribly wrong!
    EndIf
    ; handle the vol/issue part
    $aSubStringPartial = StringSplit($aSubString[1], '(')
    Switch $aSubStringPartial[0]
        Case 2 ; two parts - vol and issue exist
            If StringRight($aSubStringPartial[2], 1) = ')' Then
                $s_issue = StringTrimRight($aSubStringPartial[2], 1)
                If Not StringIsDigit($s_issue) Then Return SetError(1, 0, '') ; issue is not a number -> something went horribly wrong!
                $s_vol = $aSubStringPartial[1]
                If Not StringIsDigit($s_vol) Then Return SetError(1, 0, '') ; vol is not a number -> something went horribly wrong!
            Else
                Return SetError(1, 0, '') ; issue number does not end with ')' -> something went horribly wrong!
            EndIf
        Case 1 ; one part - only vol exist
            $s_vol = $aSubStringPartial[1]
            If Not StringIsDigit($s_vol) Then Return SetError(1, 0, '') ; vol is not a number -> something went horribly wrong!
        Case Else
            Return SetError(1, 0, '') ; not exactly two parts -> something went horribly wrong!
    EndSwitch
    ; now the input string does not contain the doi part and the vol/issue/pages part



    ConsoleWrite('-elements: ' & @CRLF)
    ConsoleWrite('vol:       ' & $s_vol & @CRLF)
    ConsoleWrite('issue:     ' & $s_issue & @CRLF)
    ConsoleWrite('firstpage: ' & $s_firstpage & @CRLF)
    ConsoleWrite('lastpage:  ' & $s_lastpage & @CRLF)
    ConsoleWrite('doi:       ' & $s_doi & @CRLF)
    ConsoleWrite('-remaining string: ' & @CRLF)
    ConsoleWrite($sString & @CRLF)
EndFunc   ;==>_StringToXML

 

oh, and when you post "code" that is not code - please, please, please select "Plain" from the dropdown list at the botton-right corner.

 

Edited by orbs

Signature - my forum contributions:

Spoiler

UDF:

LFN - support for long file names (over 260 characters)

InputImpose - impose valid characters in an input control

TimeConvert - convert UTC to/from local time and/or reformat the string representation

AMF - accept multiple files from Windows Explorer context menu

DateDuration -  literal description of the difference between given dates

Apps:

Touch - set the "modified" timestamp of a file to current time

Show For Files - tray menu to show/hide files extensions, hidden & system files, and selection checkboxes

SPDiff - Single-Pane Text Diff

 

Posted

To prevent the array error the index must be checked in the conditions. The hyphen (and other chars) is intentionally left in the array elements to allow an easy later check when building the xml
These checks could be done using a bunch of String* funcs nested or not, but the regex way remains the easier way to check for instance if a string "contains only digit(s) and a trailing hyphen"
As I said before there are many ways to parse the source string. I used one big regular expression but it could be done using several smaller ones as well, or "classic" String* funcs. But whatever the method is the purpose is the same : you have to get a list of "checkable" elements to build a consistent xml

Here is my last try

#Include <Array.au3>

Local $astr[7] = ["John J., Gracy D., Jame R., et al., (2019) This is a sample sentence here. Then another here. 5(2);101-109. doi:1001.10110/aj21.j1j.10.", _ 
"John J., Gracy D., (2019) This is a sample sentence here.", _ 
"John J., Gracy D., (2019). doi:1001.10110/aj21.j1j.10.", _ 
"John J., Gracy D., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5(2);101-109. doi:1001.10110/aj21.j1j.10.", _ 
"John J., Gracy D., (2019) This is a sample sentence here. 5(2);101-109. doi:1001.10110/aj21.j1j.10.", _ 
"Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5;246-251. doi:1001.10110/aj21.j1j.10.", _ 
"John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5;101-109."]

$n = 0

For $str In $astr

$res = StringRegExp($str, '(?x) (?|  ' & _
    '([[:alpha:]]+)\h([A-Z]\.) ' & _          ; names
    ' |  ([[a-z\h.]+),\h? ' & _               ; et al
    ' |  \((\d+)\)\.?\h? ' & _                ; year
    ' |  ([A-Z][^.]+\.)\h?  ' & _             ; title, subtitle
    ' |  (\d+[\(\);.-]) ' & _                 ; vol, issue, pages
    ' |  (\w+:\S+)$  ) ' , 3)                 ; url


$n += 1
 _ArrayDisplay($res, $n)

Local $i = 0, $s = '<File xml:id="name-of-filename">' & @crlf & _ 
                     '<Citation type="letter" xml:id="name-of-filename">'& @crlf

While StringRegExp($res[$i], '^[A-Z]') 
     $s &= '<Person>' & '<familyName>' & $res[$i] & '</familyName>' & _
                   '<givenName>' & $res[$i+1] & '</givenName>' & '</Person>' & @crlf 
     $i += 2
Wend

  If StringRegExp($res[$i], '^[a-z\h.]+$') Then
       $s &= $res[$i] & @crlf 
       $i += 1
  EndIf

  If $i < UBound($res) AND StringIsDigit($res[$i]) Then
      $s &= '<pubYear year="' & $res[$i] & '">' & $res[$i] & '</pubYear>' & @crlf 
      $i += 1
  EndIf

  If $i < UBound($res) AND StringIsUpper(StringLeft($res[$i], 1)) Then
      $s &= '<Title>' & $res[$i] & '<\Title>' & @crlf 
      $i += 1
  EndIf

  If $i < UBound($res) AND StringIsUpper(StringLeft($res[$i], 1)) Then
      $s &= '<SubTitle>' & $res[$i] & '<\SubTitle>' & @crlf 
      $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+[\(;]$') Then
       $s &= '<vol>' & StringTrimRight($res[$i], 1) & '<\vol>' & @crlf 
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+\)$') Then
       $s &= '<issue>' & StringTrimRight($res[$i], 1) & '<\issue>' & @crlf 
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+-$') Then
       $s &= '<FirstPage>' & StringTrimRight($res[$i], 1) & '<\FirstPage>' & @crlf 
       $i += 1
  EndIf
 
  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+\.$') Then
       $s &= '<SecondPage>' & StringTrimRight($res[$i], 1) & '<\SecondPage>' & @crlf 
       $i += 1
  EndIf

   If $i < UBound($res) AND StringLeft($res[$i], 4) = "doi:" Then
       $s &= '<Url href="https://' & $res[$i] & '">' & StringTrimRight($res[$i], 1) & '<\Url>' & @crlf 
  EndIf

  $s &= '</citation>' & @crlf

Msgbox(0, $n, $s)
Next

 

Posted

@mikell, Thank you so much. This is perfect and all the conditions were carried out correctly. I only have one more question.

Question:

I changed the below expression to cater two initial names given and that is by adding "+" after the character range A-Z. My question is, when there's an initial names like "i.e., S-J., S -J., ..." where hyphen was used, how can I add this from the below expression? I tried checking my guide found here and do some attempts, but still could not get the correct expression i need. I tried using this "[^.]+" but it will affect other format.

$res = StringRegExp($str, '(?x) (?|  ' & _
    '([[:alpha:]]+)\h([A-Z][^.]+\.) ' & _          ; names

Honestly, I really appreciated your time and attention in providing me the solution I need and I know how hard would that be that someone is depending on you, but please don't leave me now.😥 There are lot's of confusions in me right now that only you enlighten me in the right path. Hope this is not you last help, Mikell. Thank you so much!☺️

It's really hard to learn this Regular Expression thing, but I'll do my best to learn this type of coding for the future concerns.😅

Programming is "To make it so simple that there are obviously no deficiencies" or "To make it so complicated that there are no obvious deficiencies" by C.A.R. Hoare.

Posted (edited)
11 minutes ago, KickStarter15 said:

([A-Z][^.]+\.)

This part of the pattern already captures everything that is a capital letter, immediately followed by everything that is not a dot (from 1 to N characters, possessive), immediately followed by a dot.

Maybe post a runnable example where this pattern is not doing what you are describing :)

Edited by FrancescoDiMuro

Click here to see my signature:

Spoiler

ALWAYS GOOD TO READ:

 

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...