[SOLVED] Extracting all text from a file that start with >"text": "< and ends with >", "timestamp":<

Fr33b0w · March 5, 2022

There must be a very simple solution for this "problem". I know how I would do it if there is just a few but I need to do it for every instance and it might be 1000 of them.

Quote

QVzADyw", "author_thumbnail": "https://yt3.ggpht.com/ytc/AKedOLTt1rCz0emlXz_QUVNB7T1AH11QBO13oYbFZw=s176-c-k-c0x00ffffff-no-rj", "author_is_uploader": false, "parent": "root"}, {"id": "UgxjxCRsERmHwFHpnXN4AaABAg", "text": "Some Youtube comment as example", "timestamp": 1583452800, "time_text": "2 years ago", "like_count": 1, "is_favorited": false, "author": "Aramis Papadopulos", "author_id": "UCRkVkOOpOBYkdrvGLmAETLQ", "author_thumbnail": "https://yt3.ggpht.com/ytc/AKedOLSbooVjQtUXaSGBpjrxOWh3kVTfXLRpsvcmu-AP=s176-c-k-c0x00ffffff-no-rj", "author_is_uploader": false, "parent": "root"}, {"id": "UgxbPLIT8b8pFW3K3f54AaABAg", "text": "some other text example", "timestamp": 1583452800, "time_text": "2 years ago", "like_count": 0, "is_favorited": false, "author": "BL1TZ", "author_id": "UCXqbwq4gWUpk8yJC561Pqew", "author_thumbnail": "https://yt3.ggpht.com/ytc/AKedOLRmm233HPvOaL-ohfRufFpnpAaHEoaMxKcypVru=s176-c-k-c0x00ffffff-no-rj", "author_is_uploader": false, "parent": "root"}, {"id": "UgwRoRLyWM_TIiaTEZp4AaABAg", "text": "another text example and so on", "timestamp": 1583452800, "time_text": "2 years ago", "like_count": 6, "is_favorited": false,

I guess it is just a few lines of code. i am not so good with regular expressions so solution without it for my better understanding would be much appreciated. I use autoit for a long time but I am not an expert and I am recovering from corona illness, haven't been coding for some time, etc. So, if any good soul would give me a hint, comments that I would like to extract are between "text": " and ", "timestamp": . Anyone? Thanks!

Edited March 8, 2022 by Fr33b0w

Trong · March 5, 2022

I don't know how to use RegEx but you can use _StringBetween():

#include <String.au3>
Local $InputData = '"text": "Some Youtube comment as example", "timestamp":346230, "text": "SomeSDGs example", "timestamp": 15833460, "text": "Some YoutFGNSFGJnt as example", "timestamp": 45634572800, "'
$InputData = StringReplace($InputData, ', "', ',"')
$InputData = StringReplace($InputData, '": ', '":')
Local $textArray = _StringBetween($InputData, '"text":', ',"')
If IsArray($textArray) Then
    For $i = 0 To UBound($textArray) - 1
        ConsoleWrite($textArray[$i] & @CRLF)
    Next
EndIf

Local $timestampArray = _StringBetween($InputData, '"timestamp":', ',"')
If IsArray($timestampArray) Then
    For $i = 0 To UBound($timestampArray) - 1
        ConsoleWrite($timestampArray[$i] & @CRLF)
    Next
EndIf

Subz · March 5, 2022

Or something like:

#include <Array.au3>
Global $sText = '"text": "Some Youtube comment as example", "timestamp":346230, "text": "SomeSDGs example", "timestamp": 15833460, "text": "Some YoutFGNSFGJnt as example", "timestamp": 45634572800, "'
Global $aText = StringRegExp($sText, '(?<=\"text\": \").*?(?=\", \"timestamp\")', 3)
_ArrayDisplay($aText)

Fr33b0w · March 5, 2022

Thank You very much VIP. This solve my probem and do exactly what I wanted to achieve. It was not that simple I thought it could be, so sorry for that. I am learning from your example. i wish you good and healthy life.

Reedit: Thanks Subz! This regex I can understand and learn from it. Guys, thanks a lot. You made my day.

Edited March 5, 2022 by Fr33b0w

Fr33b0w · March 18, 2022

Sorry... Still have some problems with this. It wont process all files... Did try to rename them, did try to change the code. but it wont work... It process 223 files of 327 and I dont know why...

Script I am trying to use is:

#include <String.au3>
#include <Array.au3>


Local $search = FileFindFirstFile("*.info.json")
DirCreate(@ScriptDir & "\comments\")
Local $dir = @ScriptDir & "\comments\"


 If $search = -1 Then
        MsgBox($MB_SYSTEMMODAL, "", "Error: No files/directories matched the search pattern.")
        Return False
     EndIf


While 1
   Local $file = FileFindNextFile($search)
    If @error Then ExitLoop
Local $target = StringReplace($file, '.info.json', '.txt')
Local $InputData = FileRead($file)

$InputData = StringReplace($InputData, ', "', ',"')
$InputData = StringReplace($InputData, '": ', '":')
Local $textArray = _StringBetween($InputData, '"text":', ',"')
If IsArray($textArray) Then
    For $i = 0 To UBound($textArray) - 1
        FileWriteLine($dir & $target, @CRLF & " * " & $textArray[$i] & @CRLF)
    Next
EndIf

Local $timestampArray = _StringBetween($InputData, '"timestamp":', ',"')
If IsArray($timestampArray) Then
    For $i = 0 To UBound($timestampArray) - 1
        FileWriteLine($dir & $target, @CRLF & " * " & $textArray[$i] & @CRLF)
    Next
 EndIf
    FileClose($dir & $target)
WEnd

Exit

I added files which I am trying to scrap... I let them be in a same folder where designated files are... Files are in attachment... Thanks.

test.zip

Edited March 18, 2022 by Fr33b0w
I didnt enter how many files are processed of how many targeted... Brain burnt by non working script...

Nine · March 18, 2022

Few suggestions for your script :

1- Use _FileListToArray instead of FileFindFirstFile/FileFindNextFile. You can then use _ArrayDisplay to make sure you got all the files in the array.

2- Your second FileWriteLine should use $timestampArray instead of $textArray

3- FileClose on a named file is useless (see help file : it should be a handle)

4- You should add a consoleWrite warning when your stringBetween does not work

5- Adding traces to a script to understand what is going on is the best way to debug...

Edited March 18, 2022 by Nine

Fr33b0w · March 18, 2022

Thanks. I decided to use second example, which I can see its better, but far away from my level of knowledge. And it works even better then the first one, but with much more difficulty to play with it. This way it looks like script is playing with me.... Problem is that in this case I cant add @CRLF after every set of text which is find and I don't know how to do that. I did try to use StringReplace function to replace every @CRLF with two, so I will get a blank line after every part of text that is found.... But I am not good with arrays and RegEX... Got nothing... I am still using FindFile instead of _FileListToArray as you have been suggested, but thats only because I would like to make this code work on field where I am less uncomfortable and after that I could try to do it another way. Just... for someone this is a piece of cake and for me is rest of that cake... How to add @CRLF or @CR that will work?

#include <String.au3>
#include <Array.au3>
#include <File.au3>


Local $search = FileFindFirstFile("*.info.json")
DirCreate(@ScriptDir & "\comments\")
Local $dir = @ScriptDir & "\comments\"


 If $search = -1 Then
        MsgBox($MB_SYSTEMMODAL, "", "Error: No files/directories matched the search pattern.")
        Return False
     EndIf

While 1
   Local $file = FileFindNextFile($search)
    If @error Then ExitLoop
Local $target = StringReplace($file, '.info.json', '.txt')
Local $InputDataa = FileRead($file)

Global $InputDatab = StringRegExp($InputDataa, '(?<=\"text\": \").*?(?=\", \"timestamp\")', 3)
;_ArrayDisplay($InputDatab)
_FileWriteFromArray($dir & $target,$InputDatab, 1)

WEnd

Exit

Nine · March 18, 2022

Replace your _FileWriteFromArray line by this one:

FileWriteLine($dir & $target, _ArrayToString($InputDatab, "|"))

Fr33b0w · March 18, 2022

5 minutes ago, Nine said:
Replace your _FileWriteFromArray line by this one:
FileWriteLine($dir & $target, _ArrayToString($InputDatab, "|"))

I have seen that default array delimiter in help but wasnt sure how to use it. It now replaces existing carriage return with "|". Any tip for that?

So, from:

Line 1

Line 2

Line 3

I am getting Line 1|Line 2|Line 3

Edited March 18, 2022 by Fr33b0w

Nine · March 18, 2022

Replace "|" by @CRLF

Fr33b0w · March 18, 2022

FileWriteLine($dir & $target, _ArrayToString($InputDatab, @CRLF & @CRLF))

Just for the record it had to go like this. Thanks a ton, I have solved a problem and did what I wanted to do! Happy

Fr33b0w · April 25, 2024

Hi sorry for bumping an old post but again i have a problem because site code changed. Everything worked fine but now there is a new line of code which unable this regex to work. Instead of "author_id" as closure now there is sometimes "like_count" instead of author_id which is still there but after much more code I dont need to extract. I did try to use delimiter in RegEx but I guess regex is not easy for me... Can someone just give me a suggestion how to make a regex which will say: Get text from here to (here or here). I did try to put it like this:

Global $InputDatab = StringRegExp($InputDataa, '(?<=\"text\": \").*?(?=\", \"author_id\|like_count\")', 3)

...but it didnt work. Line instead of this was taking data from "text:" to "timestamp"

Global $InputDatab = StringRegExp($InputDataa, '(?<=\"text\": \").*?(?=\", \"timestamp\")', 3)

Here is an example of text which is in .info.json:

Quote

"text": "8 hours later the Fire HD 8 is $109. 99. I wish I would have gotten to watch this earlier. \nThanks for all you do Matt even if I'm late to the party.", "like_count": 1, "author_id": "UCWFKQey1WtCgGyxHPMhPtGQ", "author": "@kaceycampbell5550", "author_thumbnail": "https://yt3.ggpht.com/ytc/AOPolaSLWOprKke3uCsTselIrClAYoEM8RqDNcgadJvxBg=s176-c-k-c0x00ffffff-no-rj", "parent": "root", "_time_text": "2 years ago", "timestamp": 1630454400, "author_url": "https://www.youtube.com/channel/UCWFKQey1WtCgGyxHPMhPtGQ", "author_is_uploader": false, "is_favorited": false}, {"id": "Ugx1HNwbzMS9V0pUrgN4AaABAg", "text": "Lmao that first product is definitely photoshopped ≡ƒÿé", "like_count": 1, "author_id": "UCzeJMeX2bFwqvs9IJGKorfQ", "author": "@mrhappygoluckyjock", "author_thumbnail": "https://yt3.ggpht.com/ytc/AOPolaTRITy2x4xoy7aYMgIpyvmdF-ixQlv9thvtg7To=s176-c-k-c0x00ffffff-no-rj", "parent": "root", "_time_text": "2 years ago", "timestamp": 1630454400, "author_url": "https://www.youtube.com/channel/UCzeJMeX2bFwqvs9IJGKorfQ", "author_is_uploader": false, "is_favorited": false}, {"id": "Ugx1HNwbzMS9V0pUrgN4AaABAg.9HYyh7phVcw9HbAfkT-Wfr", "text": "Really, how can you tell? Genuinely asking, it looks too good to me", "author_id": "UCzTLWlN4pDD1jLiJJLVrfDA", "author": "@kikikiki3216", "author_thumbnail": "https://yt3.ggpht.com/ytc/AOPolaQiA8_KkqCrK7o7WNNL5qLk3C-PrOy1S591OQ=s176-c-k-c0x00ffffff-no-rj", "parent": "Ugx1HNwbzMS9V0pUrgN4AaABAg", "_time_text": "2 years ago", "timestamp": 1630454400, "author_url": "https://www.youtube.com/channel/UCzTLWlN4pDD1jLiJJLVrfDA", "author_is_uploader": false, "is_favorited": false}, {"id": "UgzhitiBqqS5dUzDfIZ4AaABAg", "text": "That echo auto does not have good user reviews", "author_id": "UC8Krza6o2IbS9zTGjYgd4jA", "author": "@soupedkid13", "author_thumbnail": "https://yt3.ggpht.com/ytc/AOPolaQbtLnxh1qhqgYU8i3LsO_6qE8lCRmBbV_OJ6f-=s176-c-k-c0x00ffffff-no-rj", "parent": "root", "_time_text": "2 years ago", "timestamp": 1630454400, "author_url": "https://www.youtube.com/channel/UC8Krza6o2IbS9zTGjYgd4jA", "author_is_uploader": false, "is_favorited": false}, {"id": "UgyWmQAcvH3gCxWMz9x4AaABAg", "text": "Merry Christmas , thank you for your videos and energy", "author_id": "UCOcfr_BebW1QqXpTI-PNEaQ", "author": "@teresafinnerty207", "author_thumbnail": "https://yt3.ggpht.com/ytc/AOPolaQDSwM9-eRu5aBKVVC1bh4xx4A6LoH2Vaompo-j=s176-c-k-c0x00ffffff-no-rj", "parent": "root", "_time_text": "2 years ago", "timestamp": 1630454400, "author_url": "https://www.youtube.com/channel/UCOcfr_BebW1QqXpTI-PNEaQ", "author_is_uploader": false, "is_favorited": false}, {"id": "UgyWmQAcvH3gCxWMz9x4AaABAg.9HYtClhcvg19HYwMNLud3_", "text": "Thanks for being here Teresa!", "author_id": "UC5Qbo0AR3CwpmEq751BIy0g", "author": "@thedealguy", "author_thumbnail": "https://yt3.ggpht.com/PHbn_ZwKQ-3PPhTtF7k6Q5t-vGBnENCPZAQc9lNe-EGCeJJ8T5DgbNIvGSSmFNVUrOCV6l3q=s176-c-k-c0x00ffffff-no-rj", "parent": "UgyWmQAcvH3gCxWMz9x4AaABAg", "_time_text": "2 years ago", "timestamp": 1630454400, "author_url": "https://www.youtube.com/channel/UC5Qbo0AR3CwpmEq751BIy0g", "author_is_uploader": true, "is_favorited": false, "author_is_verified": true}, {"id": "U

So, now there are two lines which can be a closure for getting text: ', "like_count":' and ', "author_id":'

How can I add in RegEx code that would do what i want? I did try it on my own with examples I found online but it does not work... Again much thanks in advance for this.

Sorry, I just tried a bit more and solved a problem. Correct line is:

Global $InputDatab = StringRegExp($InputDataa, '(?<=\"text\": \").*?(?=\", \"author_id|\", "like_count\")', 3)

Thanks, sorry!

Edited April 25, 2024 by Fr33b0w
Had a problem which I couldnt solve but then waiting for an answer I had an idea and... soleved it myself.

Nine · April 25, 2024

Global $InputDatab = StringRegExp($InputDataa, '(?<="text": )(.+?)(?|, "like_count"|, "author_id")', 3)

Try this.

Edited April 25, 2024 by Nine
forgot to have a capturing group

Fr33b0w · April 25, 2024

Hi Nine and thanks for trying to help. This version of a solution of yours leave ", "like_count" and , "author_id" after every line. I am very bad at regex so i dont know why but would like to see if you can correct it because your solution looks much more clear to me.

Nine · April 25, 2024

already done, see my edit

Fr33b0w · April 25, 2024

Sorry, didnt refresh. Yes it works great! Thank You for your help! Glad to see you again.

AspirinJunkie · April 25, 2024

The string appears to be a JSON string. Have you already tried one of the corresponding JSON UDFs? This should be easier to understand and more stable than using RegEx.

Fr33b0w · April 25, 2024

Ow, thanks for that. I am looking forward to check that UDFs. Have not been around much lately. I have to start learning RegEx proper way but I like also what you said about JSON UDFs...

Block all input without UAC	Save/Retrieve Images to/from Text	Monitor Management (VCP commands)
Tool to search in text (au3) files	Date Range Picker	Virtual Desktop Manager
Sudoku Game 2020	Overlapped Named Pipe IPC	HotString 2.0 - Hot keys with string
x64 Bitwise Operations	Multi-keyboards HotKeySet	Recursive Array Display
Fast and simple WCD IPC	Multiple Folders Selector	Printer Manager
GIF Animation (cached) Debug Messages Monitor UDF	Screen Scraping Round Corner GUI UDF	Multi-Threading Made Easy

Block all input without UAC	Save/Retrieve Images to/from Text	Monitor Management (VCP commands)
Tool to search in text (au3) files	Date Range Picker	Virtual Desktop Manager
Sudoku Game 2020	Overlapped Named Pipe IPC	HotString 2.0 - Hot keys with string
x64 Bitwise Operations	Multi-keyboards HotKeySet	Recursive Array Display
Fast and simple WCD IPC	Multiple Folders Selector	Printer Manager
GIF Animation (cached) Debug Messages Monitor UDF	Screen Scraping Round Corner GUI UDF	Multi-Threading Made Easy

Block all input without UAC	Save/Retrieve Images to/from Text	Monitor Management (VCP commands)
Tool to search in text (au3) files	Date Range Picker	Virtual Desktop Manager
Sudoku Game 2020	Overlapped Named Pipe IPC	HotString 2.0 - Hot keys with string
x64 Bitwise Operations	Multi-keyboards HotKeySet	Recursive Array Display
Fast and simple WCD IPC	Multiple Folders Selector	Printer Manager
GIF Animation (cached) Debug Messages Monitor UDF	Screen Scraping Round Corner GUI UDF	Multi-Threading Made Easy

Block all input without UAC	Save/Retrieve Images to/from Text	Monitor Management (VCP commands)
Tool to search in text (au3) files	Date Range Picker	Virtual Desktop Manager
Sudoku Game 2020	Overlapped Named Pipe IPC	HotString 2.0 - Hot keys with string
x64 Bitwise Operations	Multi-keyboards HotKeySet	Recursive Array Display
Fast and simple WCD IPC	Multiple Folders Selector	Printer Manager
GIF Animation (cached) Debug Messages Monitor UDF	Screen Scraping Round Corner GUI UDF	Multi-Threading Made Easy

Block all input without UAC	Save/Retrieve Images to/from Text	Monitor Management (VCP commands)
Tool to search in text (au3) files	Date Range Picker	Virtual Desktop Manager
Sudoku Game 2020	Overlapped Named Pipe IPC	HotString 2.0 - Hot keys with string
x64 Bitwise Operations	Multi-keyboards HotKeySet	Recursive Array Display
Fast and simple WCD IPC	Multiple Folders Selector	Printer Manager
GIF Animation (cached) Debug Messages Monitor UDF	Screen Scraping Round Corner GUI UDF	Multi-Threading Made Easy

Sign In

[SOLVED] Extracting all text from a file that start with >"text": "< and ends with >", "timestamp":<

Recommended Posts

Fr33b0w

Trong

Subz

Fr33b0w

Fr33b0w

Nine

Fr33b0w

Nine

Fr33b0w

Nine

Fr33b0w

Fr33b0w

Nine

Fr33b0w

Nine

Fr33b0w

AspirinJunkie

Fr33b0w

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

AutoIt Resources

Release

Beta