01:59 frost joined 02:53 frost left 02:57 frost joined 03:02 m_athias left, m_athias joined
stevied in a regex, what is the equivalent of not matching a new line from perl: `[^\n]+` 03:04
03:11 codechurch joined 03:35 frost left 03:36 frost joined 04:02 codechurch left 05:23 frost left 06:06 TempIRCLogger__ left, qorg11 left, Manifest0 left, SmokeMachine left, CIAvash left, thowe left, sivoais left, mjgardner left 06:07 gfldex left, anight[m] left, m_athias left, Util left, destroycomputers left, samebchase left, lizmat left, tbrowder left, codesections left, discord-raku-bot left, MasterDuke left, camelia left 06:13 frost joined, m_athias joined, discord-raku-bot joined, lizmat joined, TempIRCLogger__ joined, Util joined, qorg11 joined, MasterDuke joined, Manifest0 joined, destroycomputers joined, gfldex joined, anight[m] joined, CIAvash joined, SmokeMachine joined, tbrowder joined, samebchase joined, mjgardner joined, thowe joined, sivoais joined, codesections joined, camelia joined
I'm totally lost with grammars 07:21
this works:
```
grammar G {
token TOP {<blah> k}
token blah { \w\w\w }
}
my $match = G.parse('duck');
say $match;
```
this doesn't match:
```
grammar G {
token TOP {<blah> k}
token blah { \w+ }
}
Nahita `\N+` I believe 07:27
stevied this has got to be a bug. this matches: 07:48
```
grammar G {
token TOP { 'd' <blah> '/' }
token blah { \w+ }
}
my $match = G.parse('duc/');
say $match;
```
this doesn't:
```
grammar G {
token TOP { 'd' <blah> 'k' }
token blah { \w+ }
lizmat the \w+ is probably too greedy 07:51
stevied ok, sorry, the `/` is not a `\w` character 07:52
so that makes sense
i tried making it non-greedy
didn't work: `\w+?`
lizmat please, I'm a pretty Raku grammar noob myself :-) 07:53
stevied now I don't feel so bad. 07:54
lizmat m: my token blah { \w+? }; say "foo" ~~ / f <blah> o /
camelia 「foo」
blah => 「o」
stevied maybe you can't do non-greedy in grammars?
lizmat looks to me you can ?
there's also: raku.land/github:jnthn/Grammar::Debugger 07:55
stevied but that's not a grammar, right? 07:56
m: grammar G { token TOP { 'd' <blah> 'k' } token blah { \w+? } } my $match = G.parse('duck'); say $match; 07:57
lizmat no, but a grammar is just a module of regexen really, with regexen being methods
stevied m: grammar G { token TOP { 'd' <blah> 'k' } token blah { \w+? } }; my $match = G.parse('duck'); say $match;
lizmat m: grammar G { token TOP { 'd' <blah> 'k' } token blah { \w+? } }; my $match = G.parse('duck'); say $match;
camelia ===SORRY!=== Error while compiling <tmp>
Strange text after block (missing semicolon or comma?)
at <tmp>:1
------> grammar G { token TOP { 'd' <blah> 'k' }⏏ token blah { \w+? } }; my $match = G.pa
expecting any of:
lizmat m: grammar G { token TOP { 'd' <blah> 'k' }; token blah { \w+? } }; my $match = G.parse('duck'); say $match;
camelia (Any)
lizmat m: my token blah { \w+? }; say "duck" ~~ / d <blah> k / 07:58
camelia Nil
stevied m: grammar G { token TOP { 'd' <blah> 'k' }; token blah { \w+? } }; my $match = G.parse('duck'); say $match;
07:58 frost left
lizmat hmmm 07:59
stevied i gotta get to bed. wanted to go out on a good note but getting nowhere on this 08:03
lizmat sorry, hope we'll be a able to provide more clarity in the morn 08:04
stevied using that debugger
with non-greedy, it's just matching the "u" and nothing else
lizmat m: my regex blah { \w+ }; say "duck" ~~ / d <blah> k / 08:06
camelia 「duck」
blah => 「uc」
lizmat it needs to be able to backtrack, that's why it needs to be a regex
breakfast&
stevied oh the TOP needs to be a regex it looks like 08:08
I had only tried making the second block a regex
actually, they both need to be regexes, not tokens 08:09
alright, I'll have to sleep on this. I know what backtracking is but don't quite understand how it works across two different regexes like this. weird shit. 08:10
13:55 discord-raku-bot left, discord-raku-bot joined 13:59 discord-raku-bot left 14:00 discord-raku-bot joined 15:36 frost joined 16:03 frost left
ok, got this working: 17:41
```
grammar G {
token TOP { .*? ( '<' 'a' <-[ > ]>+ '>' <hypertext> '<' '/' 'a' '>' .*? )+ .* }
token hypertext { <-[ < ]>+ }
}
my @matches = G.parse('<a href="kjsdf">blah 1</a><a href="/">blah 2</a>');
say @matches;
```
it works, but I'm gonna say that grammars are not the ideal tool for parsing html, just like with regexes
is that the common wisdom?
lizmat I think it is :-) 18:02
especially since HTML can be improperly formed and still sorta render ok in a browser 18:03
stevied I'm in the middle of posting to reddit about this right now. Let's see what happens.
lizmat stevied++
stevied right. though in my particular situation, I'm parsing an html document created from markdown using a tool. so the html should be well-formed 18:04
www.reddit.com/r/rakulang/comments..._to_parse/ 18:26
don't know who that dude in the picture is 🙂
m_athias @stevied#8273 why do you need the .*? in there? it should work just fine without them. 18:37
if you want to allow whitespace at the beginning <.ws> works. that way lies madness: writing decent rules to figure out what whitespace is relevant is a pain. 18:48
18:55 thowe left, thowe joined
stevied @m_athias, I don't know why it's in there. I created it with lots of trial and error. I can play with it some more. 18:57
are you talking about the one at the beginning or near the end? 18:58
ok, yup. remove that worked 18:59
whoa, removing the second one worked, too 19:01
heh, i clearly don't know what I'm doing
actually, i take that back. remove those breaks things. I had change the string getting parsed to remove the text before and after the first and last anchor tags 19:03
actually, i take that back. removing those .*? breaks things. I had change the string getting parsed to remove the text before and after the first and last anchor tags
ah, dammit. I pated in the wrong code to reddit
good catch
ok, fixed it 19:04
ah, dammit. I pasted in the wrong code to reddit 19:05
alright, so what's the best way to parse html, then? 19:59
I think i'll pose this question to stackoverflow. too many requirements to outline here
stackoverflow.com/questions/708996...embed-code 20:08