[Solved] Creating xml file based on data text

KickStarter15 · May 6, 2019

@FrancescoDiMuro, Yup your right with that one, I just learned from the expression library (just now) and change that part to this "([[:alpha:]]+)\h(?=[A-Z]) | ([^,]+),\h " which was actually running correctly and not changing other format that I have.

Here's the code from Mikell: It is now running as expected.😀

#Include <Array.au3>

Local $astr[7] = ["John S -J., Gracy S-J., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5(2);101-109. doi:1001.10110/aj21.j1j.10.", _
"John J., Gracy D., (2019) This is a sample sentence here.", _
"John J., Gracy D., (2019). doi:1001.10110/aj21.j1j.10.", _
"John J., Gracy D., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5(2);101-109. doi:1001.10110/aj21.j1j.10.", _
"John J., Gracy D., (2019) This is a sample sentence here. 5(2);101-109. doi:1001.10110/aj21.j1j.10.", _
"Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5;246-251. doi:1001.10110/aj21.j1j.10.", _
"John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5;101-109."]

$n = 0

For $str In $astr
;~ '([[:alpha:]]+)\h([A-Z][^.]+\.) ' & _          ; names if hyphen name
$res = StringRegExp($str, '(?x) (?|  ' & _
    '([[:alpha:]]+)\h(?=[A-Z])  |  ([^,]+),\h ' & _          ; names
    ' |  ([[a-z\h.]+),\h? ' & _               ; et al
    ' |  \((\d+)\)\.?\h? ' & _                ; year
    ' |  ([A-Z][^.]+\.)\h?  ' & _             ; title, subtitle
    ' |  (\d+[\(\);.-]) ' & _                 ; vol, issue, pages
    ' |  (\w+:\S+)$  ) ' , 3)                 ; url

;~ $res = StringRegExp($str, '(?x) (?|  ' & _
;~     '([[:alpha:]]+)\h([A-Z]\.) ' & _          ; names
;~     ' |  ([[a-z\h.]+),\h? ' & _               ; et al
;~     ' |  \((\d+)\)\.?\h? ' & _                ; year
;~     ' |  ([A-Z][^.]+\.)\h?  ' & _             ; title, subtitle
;~     ' |  (\d+[\(\);.-]) ' & _                 ; vol, issue, pages
;~     ' |  (\w+:\S+)$  ) ' , 3)                 ; url


$n += 1
 _ArrayDisplay($res, $n)

Local $i = 0, $s = '<File xml:id="name-of-filename">' & @crlf & _
                     '<Citation type="letter" xml:id="name-of-filename">'& @crlf

While StringRegExp($res[$i], '^[A-Z]')
     $s &= '<Person>' & '<familyName>' & $res[$i] & '</familyName>' & _
                   '<givenName>' & $res[$i+1] & '</givenName>' & '</Person>' & @crlf
     $i += 2
Wend

  If StringRegExp($res[$i], '^[a-z\h.]+$') Then
       $s &= $res[$i] & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringIsDigit($res[$i]) Then
      $s &= '<pubYear year="' & $res[$i] & '">' & $res[$i] & '</pubYear>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringIsUpper(StringLeft($res[$i], 1)) Then
      $s &= '<Title>' & $res[$i] & '<\Title>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringIsUpper(StringLeft($res[$i], 1)) Then
      $s &= '<SubTitle>' & $res[$i] & '<\SubTitle>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+[\(;]$') Then
       $s &= '<vol>' & StringTrimRight($res[$i], 1) & '<\vol>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+\)$') Then
       $s &= '<issue>' & StringTrimRight($res[$i], 1) & '<\issue>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+-$') Then
       $s &= '<FirstPage>' & StringTrimRight($res[$i], 1) & '<\FirstPage>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+\.$') Then
       $s &= '<SecondPage>' & StringTrimRight($res[$i], 1) & '<\SecondPage>' & @crlf
       $i += 1
  EndIf

   If $i < UBound($res) AND StringLeft($res[$i], 4) = "doi:" Then
       $s &= '<Url href="https://' & $res[$i] & '">' & StringTrimRight($res[$i], 1) & '<\Url>' & @crlf
  EndIf

  $s &= '</citation>' & @crlf

Msgbox(0, $n, $s)
Next

@mikell, Thank you so much for the BIG BIG HELP!!!! I got it now and only this expression help me solving my last question "([[:alpha:]]+)\h(?=[A-Z]) | ([^,]+),\h" under name part. Hope if I have another issue found with the current code you gave, you're still there to support. Thank you so much for now. THUMBZ Up...😎

I posted back the working code based on my requirements to help other if they need something like this as I did. Credit to Mikell and other Experts who help me with this concern. LONG LIVE AUTOIT.....🙏🙂

mikell · May 6, 2019

45 minutes ago, KickStarter15 said:

"([[:alpha:]]+)\h(?=[A-Z]) | ([^,]+),\h"

Congrats
Finding yourself the solution was nice. Learning regex is not easy but It's worth it

KickStarter15 · May 7, 2019

@mikell, First of all my apology for this, I've found out just yesterday that if the word starts with the word "New", the code will not get the <subTitle> correctly. But if I used another word except "New" then the code has no issue. It's kinda weird😅 and I tried checking the expression but all are okay even reading the explanation in regExp library it's okay. Please can you check this one? Thank you so much.

Local $astr[7] = ["John S -J., Gracy S-J., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New Set Subtitle. 5(2);101-109. doi:1001.10110/aj21.j1j.10.", _
"John J., Gracy D., (2019) This is a sample sentence here.", _
"John J., Gracy D., (2019). doi:1001.10110/aj21.j1j.10.", _
"John J., Gracy D., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New York City. 5(2);101-109. doi:1001.10110/aj21.j1j.10.", _
"John J., Gracy D., (2019) This is a sample sentence here. 5(2);101-109. doi:1001.10110/aj21.j1j.10.", _
"Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New State Academy. 5;246-251. doi:1001.10110/aj21.j1j.10.", _
"John J., Gracy D., Jame R., (2019) This is a sample sentence here. New State. 5;101-109."]

Or even changing the subtitle with another set of new subtitle, the captured subtitle is incorrect.

Edited May 7, 2019 by KickStarter15

KickStarter15 · May 7, 2019

On 5/4/2019 at 5:03 PM, orbs said:

speaking for myself, i'm no regex expert - not a regex novice even - so i know i cannot maintain such an elaborate code, i would walk the direct path of string manipulation, but first i would properly define the input string structure. read the following code carefully - especially the comments - it is a bit long, but very simple to understand, troubleshoot and maintain.

@orbs, I tried checking you suggested code but it only capture the vol, issue, pages and url.

mikell · May 7, 2019

1 hour ago, KickStarter15 said:

if the word starts with the word "New", the code will not get the <subTitle> correctly

It's not because of the word "New", it's because "New" is followed by a space and a uppercase letter, so the 'name' part of the regex matches
This means that the part dedicated to match the names must be more selective, so it can't match another part of the string. Try this :

'([[:alpha:]]+)\h([A-Z][-.\h]*[A-Z]?\.) ' & _          ; names

or this

'([[:alpha:]]+)\h([A-Z][^,]+),\h ' & _          ; names

Edited May 7, 2019 by mikell

KickStarter15 · May 7, 2019

@mikell, Yup it's working now and generating the correct subtitle. Thank you. Last one, if the url is presented this way "https://doi.org/10.1016/j.jseaes.2007.10.021" it is not generating in <url> element but if presented like "https/doi:10.1016/j.jseaes.2007.10.021" it is working.

I tried changing the expression but I think I'm wrong again.

' |  (\w+:\S+)$  ) ' , 3)                 ; url

Edited May 7, 2019 by KickStarter15

mikell · May 7, 2019

There are two changes to do

' |  (?:https?://)?(\w+\S+)$  ) ' , 3)    ; url

;....

  If $i < UBound($res) AND StringLeft($res[$i], 3) = "doi" Then  ;<<<<<<
       $s &= '<Url href="https://' & $res[$i] & '">' & StringTrimRight($res[$i], 1) & '<\Url>' & @crlf
  EndIf

KickStarter15 · May 8, 2019

@mikell, Thanks, but still having the same issue. Still won't generate in <url> element.😥

mikell · May 8, 2019

PLease show the whole used code and the concerned string

KickStarter15 · May 8, 2019

@mikell, Please see below. Thanks...😅

Local $astr[2] = ["John S -J., Gracy S-J., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New Science. 5(2);101-109. https://doi.org/doi:10.1016/j.jseaes.2007.10.021", _
"John S -J., Gracy S-J., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New Science. 5(2);101-109. https/doi:10.1016/j.jseaes.2007.10.021"]

$n = 0
For $str In $astr

$res = StringRegExp($str, '(?x) (?|  ' & _
    '([[:alpha:]]+)\h([A-Z][^,]+),\h ' & _          ; names
    ' |  ([[a-z\h.]+),\h? ' & _               ; et al
    ' |  \((\d+)\)\.?\h? ' & _                ; year
    ' |  ([A-Z][^.]+\.)\h?  ' & _             ; title, subtitle
    ' |  (\d+[\(\);.-]) ' & _                 ; vol, issue, pages
    ' |  (?:https?://)?(\w+:\S+)$) ' , 3)                 ; url

$n += 1
 _ArrayDisplay($res, $n)

Local $i = 0, $s = '<File xml:id="name-of-filename">' & @crlf & _
                     '<Citation type="letter" xml:id="name-of-filename">'& @crlf

While StringRegExp($res[$i], '^[A-Z]')
     $s &= '<Person>' & '<familyName>' & $res[$i] & '</familyName>' & _
                   '<givenName>' & $res[$i+1] & '</givenName>' & '</Person>' & @crlf
     $i += 2
Wend

  If StringRegExp($res[$i], '^[a-z\h.]+$') Then
       $s &= $res[$i] & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringIsDigit($res[$i]) Then
      $s &= '<pubYear year="' & $res[$i] & '">' & $res[$i] & '</pubYear>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringIsUpper(StringLeft($res[$i], 1)) Then
      $s &= '<Title>' & $res[$i] & '<\Title>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringIsUpper(StringLeft($res[$i], 1)) Then
      $s &= '<SubTitle>' & $res[$i] & '<\SubTitle>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+[\(;]$') Then
       $s &= '<vol>' & StringTrimRight($res[$i], 1) & '<\vol>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+\)$') Then
       $s &= '<issue>' & StringTrimRight($res[$i], 1) & '<\issue>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+-$') Then
       $s &= '<FirstPage>' & StringTrimRight($res[$i], 1) & '<\FirstPage>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+\.$') Then
       $s &= '<SecondPage>' & StringTrimRight($res[$i], 1) & '<\SecondPage>' & @crlf
       $i += 1
  EndIf

;~    If $i < UBound($res) AND StringLeft($res[$i], 4) = "doi:" Then
;~        $s &= '<Url href="https://' & $res[$i] & '">' & StringTrimRight($res[$i], 1) & '<\Url>' & @crlf
;~   EndIf
  If $i < UBound($res) AND StringLeft($res[$i], 3) = "doi" Then  ;<<<<<<
       $s &= '<Url href="https://' & $res[$i] & '">' & StringTrimRight($res[$i], 1) & '<\Url>' & @crlf
  EndIf

  $s &= '</citation>' & @crlf

Msgbox(0, $n, $s)
Next

KickStarter15 · May 8, 2019

@mikell, I think I've got it.😅

Changing the below condition from:

If $i < UBound($res) AND StringLeft($res[$i], 3) = "doi" Then  ;<<<<<<
       $s &= '<Url href="https://' & $res[$i] & '">' & StringTrimRight($res[$i], 1) & '<\Url>' & @crlf
  EndIf

To this new condition:

If $i < UBound($res) AND StringInStr($res[$i], "doi") > 0 Then  ;<<<<<<
       $s &= '<url href="' & $res[$i] & '">' & $res[$i] & '</url>'
       $i += 1
  EndIf

The url is generating correctly.

mikell · May 8, 2019

Nice
But the problem was that you didn't apply correctly (removing the colon in the capturing group) the first change I mentioned

#Include <Array.au3>

Local $astr[3] = ["John S -J., Gracy S-J., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New Science. 5(2);101-109. https://doi.org/doi:10.1016/j.jseaes.2007.10.021.", _
"John S -J., Gracy S-J., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New Science. 5(2);101-109. https://doi:10.1016/j.jseaes.2007.10.021.", _
"John S -J., Gracy S-J., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New Science. 5(2);101-109. doi:10.1016/j.jseaes.2007.10.021."]

$n = 0
For $str In $astr

$res = StringRegExp($str, '(?x) (?|  ' & _
    '([[:alpha:]]+)\h([A-Z][^,]+),\h ' & _          ; names
    ' |  ([[a-z\h.]+),\h? ' & _               ; et al
    ' |  \((\d+)\)\.?\h? ' & _                ; year
    ' |  ([A-Z][^.]+\.)\h?  ' & _             ; title, subtitle
    ' |  (\d+[\(\);.-]) ' & _                 ; vol, issue, pages
    ' |  (?:https?://)?(\w+\S+)$) ' , 3)                 ; url

$n += 1
 _ArrayDisplay($res, $n)

Local $i = 0, $s = '<File xml:id="name-of-filename">' & @crlf & _
                     '<Citation type="letter" xml:id="name-of-filename">'& @crlf

While StringRegExp($res[$i], '^[A-Z]')
     $s &= '<Person>' & '<familyName>' & $res[$i] & '</familyName>' & _
                   '<givenName>' & $res[$i+1] & '</givenName>' & '</Person>' & @crlf
     $i += 2
Wend

  If StringRegExp($res[$i], '^[a-z\h.]+$') Then
       $s &= $res[$i] & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringIsDigit($res[$i]) Then
      $s &= '<pubYear year="' & $res[$i] & '">' & $res[$i] & '</pubYear>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringIsUpper(StringLeft($res[$i], 1)) Then
      $s &= '<Title>' & $res[$i] & '<\Title>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringIsUpper(StringLeft($res[$i], 1)) Then
      $s &= '<SubTitle>' & $res[$i] & '<\SubTitle>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+[\(;]$') Then
       $s &= '<vol>' & StringTrimRight($res[$i], 1) & '<\vol>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+\)$') Then
       $s &= '<issue>' & StringTrimRight($res[$i], 1) & '<\issue>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+-$') Then
       $s &= '<FirstPage>' & StringTrimRight($res[$i], 1) & '<\FirstPage>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+\.$') Then
       $s &= '<SecondPage>' & StringTrimRight($res[$i], 1) & '<\SecondPage>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringLeft($res[$i], 3) = "doi" Then  ;<<<<<<
       $s &= '<Url href="https://' & $res[$i] & '">' & StringTrimRight($res[$i], 1) & '<\Url>' & @crlf
  EndIf

  $s &= '</citation>' & @crlf

Msgbox(0, $n, $s)
Next

KickStarter15 · May 9, 2019

@mikell, Thank you so much, yup its working now.😁🤩 Hope there's no other issue sooner or later.😅☺️.. thanks..

KickStarter15 · May 9, 2019

@mikell, Wow just now i've got another set of issue. I tried checking the regexp library but could not find how to add this condition to our expressions.

"Estrada L. A., Díaz J. A., Hernández-Ramírez V. I., ...." there's an accent on a letter and other is using hyphenated name before the initial. Please can you advise on this one?😅 My apology again mikell.

FrancescoDiMuro · May 9, 2019

@KickStarter15

Place the string "(*UCP)" (Unicode Character Properties) at the beginning of the pattern; it allows you to match Unicode digits and characters

KickStarter15 · May 9, 2019

@FrancescoDiMuro, Did not exactly get what you mean, how can we do this by adding this in expression?

FrancescoDiMuro · May 9, 2019

@KickStarter15

Something like this:

$arrResult = StringRegExp($strString, '(*UCP)(?x)...', $STR_REGEXPARRAYGLOBALMATCH)

KickStarter15 · May 9, 2019

Thanks, @FrancescoDiMuro. It is now capture the accented letters, now the problem is the hyphenated name.😊

FrancescoDiMuro · May 9, 2019

@KickStarter15
The code above seems to work fine.
Post the code you have and the result you obtain from the script

KickStarter15 · May 9, 2019

@FrancescoDiMuro, The code posted by mikell is the current code I'm checking. Please see below.

#Include <Array.au3>

Local $astr[3] = ["Estrada L. A., Díaz J. A., Hernández-Ramírez V. I., Tsitsigiannis I., Bok J., Andes D., Nielson K., Frisvad J., Keller N., (2005) Aspergillus cyclooxygenase-like enzymes are associated with µ prostaglandin production and virulence. Infection and Immunity. 73(8);4548–4559. https://doi.org/10.1128/IAI.73.8.4548-4559.2005.", _
"John S -J., Gracy S-J., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New Science. 5(2);101-109. https://doi:10.1016/j.jseaes.2007.10.021.", _
"John S -J., Gracy S-J., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New Science. 5(2);101-109. doi:10.1016/j.jseaes.2007.10.021."]

$n = 0
For $str In $astr

$res = StringRegExp($str, '(*UCP)(?x) (?|  ' & _ ; <<< added the (*UCP) here...
    '([[:alpha:]]+)\h([A-Z][^,]+),\h ' & _          ; names
    ' |  ([[a-z\h.]+),\h? ' & _               ; et al
    ' |  \((\d+)\)\.?\h? ' & _                ; year
    ' |  ([A-Z][^.]+\.)\h?  ' & _             ; title, subtitle
    ' |  (\d+[\(\);.-]) ' & _                 ; vol, issue, pages
    ' |  (?:https?://)?(\w+\S+)$) ' , 3)                 ; url

$n += 1
 _ArrayDisplay($res, $n)

Local $i = 0, $s = '<File xml:id="name-of-filename">' & @crlf & _
                     '<Citation type="letter" xml:id="name-of-filename">'& @crlf

While StringRegExp($res[$i], '^[A-Z]')
     $s &= '<Person>' & '<familyName>' & $res[$i] & '</familyName>' & _
                   '<givenName>' & $res[$i+1] & '</givenName>' & '</Person>' & @crlf
     $i += 2
Wend

  If StringRegExp($res[$i], '^[a-z\h.]+$') Then
       $s &= $res[$i] & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringIsDigit($res[$i]) Then
      $s &= '<pubYear year="' & $res[$i] & '">' & $res[$i] & '</pubYear>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringIsUpper(StringLeft($res[$i], 1)) Then
      $s &= '<Title>' & $res[$i] & '<\Title>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringIsUpper(StringLeft($res[$i], 1)) Then
      $s &= '<SubTitle>' & $res[$i] & '<\SubTitle>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+[\(;]$') Then
       $s &= '<vol>' & StringTrimRight($res[$i], 1) & '<\vol>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+\)$') Then
       $s &= '<issue>' & StringTrimRight($res[$i], 1) & '<\issue>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+-$') Then
       $s &= '<FirstPage>' & StringTrimRight($res[$i], 1) & '<\FirstPage>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+\.$') Then
       $s &= '<SecondPage>' & StringTrimRight($res[$i], 1) & '<\SecondPage>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringLeft($res[$i], 3) = "doi" Then  ;<<<<<<
       $s &= '<Url href="https://' & $res[$i] & '">' & StringTrimRight($res[$i], 1) & '<\Url>' & @crlf
  EndIf

  $s &= '</citation>' & @crlf

Msgbox(0, $n, $s)
Next

Edited May 9, 2019 by KickStarter15

Sign In

[Solved] Creating xml file based on data text

Recommended Posts

KickStarter15

Top Posters In This Topic

Top Posters In This Topic

Popular Posts

FrancescoDiMuro

mikell

mikell

Posted Images

mikell

KickStarter15

KickStarter15

mikell

KickStarter15

mikell

KickStarter15

mikell

KickStarter15

KickStarter15

mikell

KickStarter15

KickStarter15

FrancescoDiMuro

KickStarter15

FrancescoDiMuro

KickStarter15

FrancescoDiMuro

KickStarter15

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

AutoIt Resources

Release

Beta