Jump to content

[Solved] Creating xml file based on data text


Recommended Posts

@FrancescoDiMuro, Yup your right with that one, I just learned from the expression library (just now) and change that part to this "([[:alpha:]]+)\h(?=[A-Z])  |  ([^,]+),\h " which was actually running correctly and not changing other format that I have.

Here's the code from Mikell: It is now running as expected.😀

#Include <Array.au3>

Local $astr[7] = ["John S -J., Gracy S-J., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5(2);101-109. doi:1001.10110/aj21.j1j.10.", _
"John J., Gracy D., (2019) This is a sample sentence here.", _
"John J., Gracy D., (2019). doi:1001.10110/aj21.j1j.10.", _
"John J., Gracy D., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5(2);101-109. doi:1001.10110/aj21.j1j.10.", _
"John J., Gracy D., (2019) This is a sample sentence here. 5(2);101-109. doi:1001.10110/aj21.j1j.10.", _
"Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5;246-251. doi:1001.10110/aj21.j1j.10.", _
"John J., Gracy D., Jame R., (2019) This is a sample sentence here. Then another here. 5;101-109."]

$n = 0

For $str In $astr
;~ '([[:alpha:]]+)\h([A-Z][^.]+\.) ' & _          ; names if hyphen name
$res = StringRegExp($str, '(?x) (?|  ' & _
    '([[:alpha:]]+)\h(?=[A-Z])  |  ([^,]+),\h ' & _          ; names
    ' |  ([[a-z\h.]+),\h? ' & _               ; et al
    ' |  \((\d+)\)\.?\h? ' & _                ; year
    ' |  ([A-Z][^.]+\.)\h?  ' & _             ; title, subtitle
    ' |  (\d+[\(\);.-]) ' & _                 ; vol, issue, pages
    ' |  (\w+:\S+)$  ) ' , 3)                 ; url

;~ $res = StringRegExp($str, '(?x) (?|  ' & _
;~     '([[:alpha:]]+)\h([A-Z]\.) ' & _          ; names
;~     ' |  ([[a-z\h.]+),\h? ' & _               ; et al
;~     ' |  \((\d+)\)\.?\h? ' & _                ; year
;~     ' |  ([A-Z][^.]+\.)\h?  ' & _             ; title, subtitle
;~     ' |  (\d+[\(\);.-]) ' & _                 ; vol, issue, pages
;~     ' |  (\w+:\S+)$  ) ' , 3)                 ; url


$n += 1
 _ArrayDisplay($res, $n)

Local $i = 0, $s = '<File xml:id="name-of-filename">' & @crlf & _
                     '<Citation type="letter" xml:id="name-of-filename">'& @crlf

While StringRegExp($res[$i], '^[A-Z]')
     $s &= '<Person>' & '<familyName>' & $res[$i] & '</familyName>' & _
                   '<givenName>' & $res[$i+1] & '</givenName>' & '</Person>' & @crlf
     $i += 2
Wend

  If StringRegExp($res[$i], '^[a-z\h.]+$') Then
       $s &= $res[$i] & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringIsDigit($res[$i]) Then
      $s &= '<pubYear year="' & $res[$i] & '">' & $res[$i] & '</pubYear>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringIsUpper(StringLeft($res[$i], 1)) Then
      $s &= '<Title>' & $res[$i] & '<\Title>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringIsUpper(StringLeft($res[$i], 1)) Then
      $s &= '<SubTitle>' & $res[$i] & '<\SubTitle>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+[\(;]$') Then
       $s &= '<vol>' & StringTrimRight($res[$i], 1) & '<\vol>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+\)$') Then
       $s &= '<issue>' & StringTrimRight($res[$i], 1) & '<\issue>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+-$') Then
       $s &= '<FirstPage>' & StringTrimRight($res[$i], 1) & '<\FirstPage>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+\.$') Then
       $s &= '<SecondPage>' & StringTrimRight($res[$i], 1) & '<\SecondPage>' & @crlf
       $i += 1
  EndIf

   If $i < UBound($res) AND StringLeft($res[$i], 4) = "doi:" Then
       $s &= '<Url href="https://' & $res[$i] & '">' & StringTrimRight($res[$i], 1) & '<\Url>' & @crlf
  EndIf

  $s &= '</citation>' & @crlf

Msgbox(0, $n, $s)
Next

 

@mikell, Thank you so much for the BIG BIG HELP!!!! I got it now and only this expression help me solving my last question "([[:alpha:]]+)\h(?=[A-Z])  |  ([^,]+),\h" under name part. Hope if I have another issue found with the current code you gave, you're still there to support. Thank you so much for now. THUMBZ Up...😎

I posted back the working code based on my requirements to help other if they need something like this as I did. Credit to Mikell and other Experts who help me with this concern. LONG LIVE AUTOIT.....🙏🙂

Programming is "To make it so simple that there are obviously no deficiencies" or "To make it so complicated that there are no obvious deficiencies" by C.A.R. Hoare.

Link to comment
Share on other sites

@mikell, First of all my apology for this, I've found out just yesterday that if the word starts with the word "New", the code will not get the <subTitle> correctly. But if I used another word except "New" then the code has no issue. It's kinda weird😅 and I tried checking the expression but all are okay even reading the explanation in regExp library it's okay. Please can you check this one? Thank you so much.

 

Local $astr[7] = ["John S -J., Gracy S-J., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New Set Subtitle. 5(2);101-109. doi:1001.10110/aj21.j1j.10.", _
"John J., Gracy D., (2019) This is a sample sentence here.", _
"John J., Gracy D., (2019). doi:1001.10110/aj21.j1j.10.", _
"John J., Gracy D., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New York City. 5(2);101-109. doi:1001.10110/aj21.j1j.10.", _
"John J., Gracy D., (2019) This is a sample sentence here. 5(2);101-109. doi:1001.10110/aj21.j1j.10.", _
"Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New State Academy. 5;246-251. doi:1001.10110/aj21.j1j.10.", _
"John J., Gracy D., Jame R., (2019) This is a sample sentence here. New State. 5;101-109."]

Or even changing the subtitle with another set of new subtitle, the captured subtitle is incorrect.

Edited by KickStarter15

Programming is "To make it so simple that there are obviously no deficiencies" or "To make it so complicated that there are no obvious deficiencies" by C.A.R. Hoare.

Link to comment
Share on other sites

On 5/4/2019 at 5:03 PM, orbs said:

speaking for myself, i'm no regex expert - not a regex novice even - so i know i cannot maintain such an elaborate code, i would walk the direct path of string manipulation, but first i would properly define the input string structure. read the following code carefullyespecially the comments - it is a bit long, but very simple to understand, troubleshoot and maintain.

@orbs, I tried checking you suggested code but it only capture the vol, issue, pages and url.

Programming is "To make it so simple that there are obviously no deficiencies" or "To make it so complicated that there are no obvious deficiencies" by C.A.R. Hoare.

Link to comment
Share on other sites

1 hour ago, KickStarter15 said:

if the word starts with the word "New", the code will not get the <subTitle> correctly

It's not because of the word "New", it's because "New" is followed by a space and a uppercase letter, so the 'name' part of the regex matches  :)
This means that the part dedicated to match the names must be more selective, so it can't match another part of the string. Try this :

'([[:alpha:]]+)\h([A-Z][-.\h]*[A-Z]?\.) ' & _          ; names

or this

'([[:alpha:]]+)\h([A-Z][^,]+),\h ' & _          ; names

 

Edited by mikell
Link to comment
Share on other sites

@mikell, Yup it's working now and generating the correct subtitle. Thank you. Last one, if the url is presented this way "https://doi.org/10.1016/j.jseaes.2007.10.021" it is not generating in <url> element but if presented like "https/doi:10.1016/j.jseaes.2007.10.021" it is working.

I tried changing the expression but I think I'm wrong again.

' |  (\w+:\S+)$  ) ' , 3)                 ; url

 

Edited by KickStarter15

Programming is "To make it so simple that there are obviously no deficiencies" or "To make it so complicated that there are no obvious deficiencies" by C.A.R. Hoare.

Link to comment
Share on other sites

@mikell, Please see below. Thanks...😅

Local $astr[2] = ["John S -J., Gracy S-J., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New Science. 5(2);101-109. https://doi.org/doi:10.1016/j.jseaes.2007.10.021", _
"John S -J., Gracy S-J., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New Science. 5(2);101-109. https/doi:10.1016/j.jseaes.2007.10.021"]

$n = 0
For $str In $astr

$res = StringRegExp($str, '(?x) (?|  ' & _
    '([[:alpha:]]+)\h([A-Z][^,]+),\h ' & _          ; names
    ' |  ([[a-z\h.]+),\h? ' & _               ; et al
    ' |  \((\d+)\)\.?\h? ' & _                ; year
    ' |  ([A-Z][^.]+\.)\h?  ' & _             ; title, subtitle
    ' |  (\d+[\(\);.-]) ' & _                 ; vol, issue, pages
    ' |  (?:https?://)?(\w+:\S+)$) ' , 3)                 ; url

$n += 1
 _ArrayDisplay($res, $n)

Local $i = 0, $s = '<File xml:id="name-of-filename">' & @crlf & _
                     '<Citation type="letter" xml:id="name-of-filename">'& @crlf

While StringRegExp($res[$i], '^[A-Z]')
     $s &= '<Person>' & '<familyName>' & $res[$i] & '</familyName>' & _
                   '<givenName>' & $res[$i+1] & '</givenName>' & '</Person>' & @crlf
     $i += 2
Wend

  If StringRegExp($res[$i], '^[a-z\h.]+$') Then
       $s &= $res[$i] & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringIsDigit($res[$i]) Then
      $s &= '<pubYear year="' & $res[$i] & '">' & $res[$i] & '</pubYear>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringIsUpper(StringLeft($res[$i], 1)) Then
      $s &= '<Title>' & $res[$i] & '<\Title>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringIsUpper(StringLeft($res[$i], 1)) Then
      $s &= '<SubTitle>' & $res[$i] & '<\SubTitle>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+[\(;]$') Then
       $s &= '<vol>' & StringTrimRight($res[$i], 1) & '<\vol>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+\)$') Then
       $s &= '<issue>' & StringTrimRight($res[$i], 1) & '<\issue>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+-$') Then
       $s &= '<FirstPage>' & StringTrimRight($res[$i], 1) & '<\FirstPage>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+\.$') Then
       $s &= '<SecondPage>' & StringTrimRight($res[$i], 1) & '<\SecondPage>' & @crlf
       $i += 1
  EndIf

;~    If $i < UBound($res) AND StringLeft($res[$i], 4) = "doi:" Then
;~        $s &= '<Url href="https://' & $res[$i] & '">' & StringTrimRight($res[$i], 1) & '<\Url>' & @crlf
;~   EndIf
  If $i < UBound($res) AND StringLeft($res[$i], 3) = "doi" Then  ;<<<<<<
       $s &= '<Url href="https://' & $res[$i] & '">' & StringTrimRight($res[$i], 1) & '<\Url>' & @crlf
  EndIf

  $s &= '</citation>' & @crlf

Msgbox(0, $n, $s)
Next

 

Programming is "To make it so simple that there are obviously no deficiencies" or "To make it so complicated that there are no obvious deficiencies" by C.A.R. Hoare.

Link to comment
Share on other sites

@mikell, I think I've got it.😅

Changing the below condition from:

If $i < UBound($res) AND StringLeft($res[$i], 3) = "doi" Then  ;<<<<<<
       $s &= '<Url href="https://' & $res[$i] & '">' & StringTrimRight($res[$i], 1) & '<\Url>' & @crlf
  EndIf

To this new condition:

If $i < UBound($res) AND StringInStr($res[$i], "doi") > 0 Then  ;<<<<<<
       $s &= '<url href="' & $res[$i] & '">' & $res[$i] & '</url>'
       $i += 1
  EndIf

The url is generating correctly.

Programming is "To make it so simple that there are obviously no deficiencies" or "To make it so complicated that there are no obvious deficiencies" by C.A.R. Hoare.

Link to comment
Share on other sites

Nice
But the problem was that you didn't apply correctly (removing the colon in the capturing group) the first change I mentioned ;)

#Include <Array.au3>

Local $astr[3] = ["John S -J., Gracy S-J., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New Science. 5(2);101-109. https://doi.org/doi:10.1016/j.jseaes.2007.10.021.", _
"John S -J., Gracy S-J., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New Science. 5(2);101-109. https://doi:10.1016/j.jseaes.2007.10.021.", _
"John S -J., Gracy S-J., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New Science. 5(2);101-109. doi:10.1016/j.jseaes.2007.10.021."]

$n = 0
For $str In $astr

$res = StringRegExp($str, '(?x) (?|  ' & _
    '([[:alpha:]]+)\h([A-Z][^,]+),\h ' & _          ; names
    ' |  ([[a-z\h.]+),\h? ' & _               ; et al
    ' |  \((\d+)\)\.?\h? ' & _                ; year
    ' |  ([A-Z][^.]+\.)\h?  ' & _             ; title, subtitle
    ' |  (\d+[\(\);.-]) ' & _                 ; vol, issue, pages
    ' |  (?:https?://)?(\w+\S+)$) ' , 3)                 ; url

$n += 1
 _ArrayDisplay($res, $n)

Local $i = 0, $s = '<File xml:id="name-of-filename">' & @crlf & _
                     '<Citation type="letter" xml:id="name-of-filename">'& @crlf

While StringRegExp($res[$i], '^[A-Z]')
     $s &= '<Person>' & '<familyName>' & $res[$i] & '</familyName>' & _
                   '<givenName>' & $res[$i+1] & '</givenName>' & '</Person>' & @crlf
     $i += 2
Wend

  If StringRegExp($res[$i], '^[a-z\h.]+$') Then
       $s &= $res[$i] & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringIsDigit($res[$i]) Then
      $s &= '<pubYear year="' & $res[$i] & '">' & $res[$i] & '</pubYear>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringIsUpper(StringLeft($res[$i], 1)) Then
      $s &= '<Title>' & $res[$i] & '<\Title>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringIsUpper(StringLeft($res[$i], 1)) Then
      $s &= '<SubTitle>' & $res[$i] & '<\SubTitle>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+[\(;]$') Then
       $s &= '<vol>' & StringTrimRight($res[$i], 1) & '<\vol>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+\)$') Then
       $s &= '<issue>' & StringTrimRight($res[$i], 1) & '<\issue>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+-$') Then
       $s &= '<FirstPage>' & StringTrimRight($res[$i], 1) & '<\FirstPage>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+\.$') Then
       $s &= '<SecondPage>' & StringTrimRight($res[$i], 1) & '<\SecondPage>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringLeft($res[$i], 3) = "doi" Then  ;<<<<<<
       $s &= '<Url href="https://' & $res[$i] & '">' & StringTrimRight($res[$i], 1) & '<\Url>' & @crlf
  EndIf

  $s &= '</citation>' & @crlf

Msgbox(0, $n, $s)
Next


 

Link to comment
Share on other sites

@mikell, Thank you so much, yup its working now.😁🤩 Hope there's no other issue sooner or later.😅☺️.. thanks..

Programming is "To make it so simple that there are obviously no deficiencies" or "To make it so complicated that there are no obvious deficiencies" by C.A.R. Hoare.

Link to comment
Share on other sites

@mikell, Wow just now i've got another set of issue. I tried checking the regexp library but could not find how to add this condition to our expressions.

"Estrada L. A., Díaz J. A., Hernández-Ramírez V. I., ...." there's an accent on a letter and other is using hyphenated name before the initial. Please can you advise on this one?😅 My apology again mikell.

Programming is "To make it so simple that there are obviously no deficiencies" or "To make it so complicated that there are no obvious deficiencies" by C.A.R. Hoare.

Link to comment
Share on other sites

@FrancescoDiMuro, The code posted by mikell is the current code I'm checking. Please see below.

#Include <Array.au3>

Local $astr[3] = ["Estrada L. A., Díaz J. A., Hernández-Ramírez V. I., Tsitsigiannis I., Bok J., Andes D., Nielson K., Frisvad J., Keller N., (2005) Aspergillus cyclooxygenase-like enzymes are associated with µ prostaglandin production and virulence. Infection and Immunity. 73(8);4548–4559. https://doi.org/10.1128/IAI.73.8.4548-4559.2005.", _
"John S -J., Gracy S-J., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New Science. 5(2);101-109. https://doi:10.1016/j.jseaes.2007.10.021.", _
"John S -J., Gracy S-J., Jame R., John J., Gracy D., Jame R., (2019) This is a sample sentence here. New Science. 5(2);101-109. doi:10.1016/j.jseaes.2007.10.021."]

$n = 0
For $str In $astr

$res = StringRegExp($str, '(*UCP)(?x) (?|  ' & _ ; <<< added the (*UCP) here...
    '([[:alpha:]]+)\h([A-Z][^,]+),\h ' & _          ; names
    ' |  ([[a-z\h.]+),\h? ' & _               ; et al
    ' |  \((\d+)\)\.?\h? ' & _                ; year
    ' |  ([A-Z][^.]+\.)\h?  ' & _             ; title, subtitle
    ' |  (\d+[\(\);.-]) ' & _                 ; vol, issue, pages
    ' |  (?:https?://)?(\w+\S+)$) ' , 3)                 ; url

$n += 1
 _ArrayDisplay($res, $n)

Local $i = 0, $s = '<File xml:id="name-of-filename">' & @crlf & _
                     '<Citation type="letter" xml:id="name-of-filename">'& @crlf

While StringRegExp($res[$i], '^[A-Z]')
     $s &= '<Person>' & '<familyName>' & $res[$i] & '</familyName>' & _
                   '<givenName>' & $res[$i+1] & '</givenName>' & '</Person>' & @crlf
     $i += 2
Wend

  If StringRegExp($res[$i], '^[a-z\h.]+$') Then
       $s &= $res[$i] & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringIsDigit($res[$i]) Then
      $s &= '<pubYear year="' & $res[$i] & '">' & $res[$i] & '</pubYear>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringIsUpper(StringLeft($res[$i], 1)) Then
      $s &= '<Title>' & $res[$i] & '<\Title>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringIsUpper(StringLeft($res[$i], 1)) Then
      $s &= '<SubTitle>' & $res[$i] & '<\SubTitle>' & @crlf
      $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+[\(;]$') Then
       $s &= '<vol>' & StringTrimRight($res[$i], 1) & '<\vol>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+\)$') Then
       $s &= '<issue>' & StringTrimRight($res[$i], 1) & '<\issue>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+-$') Then
       $s &= '<FirstPage>' & StringTrimRight($res[$i], 1) & '<\FirstPage>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringRegExp($res[$i], '^\d+\.$') Then
       $s &= '<SecondPage>' & StringTrimRight($res[$i], 1) & '<\SecondPage>' & @crlf
       $i += 1
  EndIf

  If $i < UBound($res) AND StringLeft($res[$i], 3) = "doi" Then  ;<<<<<<
       $s &= '<Url href="https://' & $res[$i] & '">' & StringTrimRight($res[$i], 1) & '<\Url>' & @crlf
  EndIf

  $s &= '</citation>' & @crlf

Msgbox(0, $n, $s)
Next
Edited by KickStarter15

Programming is "To make it so simple that there are obviously no deficiencies" or "To make it so complicated that there are no obvious deficiencies" by C.A.R. Hoare.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...