01:59
frost joined
02:53
frost left
02:57
frost joined
03:02
m_athias left,
m_athias joined
|
|||
stevied | in a regex, what is the equivalent of not matching a new line from perl: `[^\n]+` | 03:04 | |
03:11
codechurch joined
03:35
frost left
03:36
frost joined
04:02
codechurch left
05:23
frost left
06:06
TempIRCLogger__ left,
qorg11 left,
Manifest0 left,
SmokeMachine left,
CIAvash left,
thowe left,
sivoais left,
mjgardner left
06:07
gfldex left,
anight[m] left,
m_athias left,
Util left,
destroycomputers left,
samebchase left,
lizmat left,
tbrowder left,
codesections left,
discord-raku-bot left,
MasterDuke left,
camelia left
06:13
frost joined,
m_athias joined,
discord-raku-bot joined,
lizmat joined,
TempIRCLogger__ joined,
Util joined,
qorg11 joined,
MasterDuke joined,
Manifest0 joined,
destroycomputers joined,
gfldex joined,
anight[m] joined,
CIAvash joined,
SmokeMachine joined,
tbrowder joined,
samebchase joined,
mjgardner joined,
thowe joined,
sivoais joined,
codesections joined,
camelia joined
|
|||
I'm totally lost with grammars | 07:21 | ||
this works: | |||
``` | |||
grammar G { | |||
token TOP {<blah> k} | |||
token blah { \w\w\w } | |||
} | |||
my $match = G.parse('duck'); | |||
say $match; | |||
``` | |||
this doesn't match: | |||
``` | |||
grammar G { | |||
token TOP {<blah> k} | |||
token blah { \w+ } | |||
} | |||
Nahita | `\N+` I believe | 07:27 | |
stevied | this has got to be a bug. this matches: | 07:48 | |
``` | |||
grammar G { | |||
token TOP { 'd' <blah> '/' } | |||
token blah { \w+ } | |||
} | |||
my $match = G.parse('duc/'); | |||
say $match; | |||
``` | |||
this doesn't: | |||
``` | |||
grammar G { | |||
token TOP { 'd' <blah> 'k' } | |||
token blah { \w+ } | |||
lizmat | the \w+ is probably too greedy | 07:51 | |
stevied | ok, sorry, the `/` is not a `\w` character | 07:52 | |
so that makes sense | |||
i tried making it non-greedy | |||
didn't work: `\w+?` | |||
lizmat | please, I'm a pretty Raku grammar noob myself :-) | 07:53 | |
stevied | now I don't feel so bad. | 07:54 | |
lizmat | m: my token blah { \w+? }; say "foo" ~~ / f <blah> o / | ||
camelia | 「foo」 blah => 「o」 |
||
stevied | maybe you can't do non-greedy in grammars? | ||
lizmat | looks to me you can ? | ||
there's also: raku.land/github:jnthn/Grammar::Debugger | 07:55 | ||
stevied | but that's not a grammar, right? | 07:56 | |
m: grammar G { token TOP { 'd' <blah> 'k' } token blah { \w+? } } my $match = G.parse('duck'); say $match; | 07:57 | ||
lizmat | no, but a grammar is just a module of regexen really, with regexen being methods | ||
stevied | m: grammar G { token TOP { 'd' <blah> 'k' } token blah { \w+? } }; my $match = G.parse('duck'); say $match; | ||
lizmat | m: grammar G { token TOP { 'd' <blah> 'k' } token blah { \w+? } }; my $match = G.parse('duck'); say $match; | ||
camelia | ===SORRY!=== Error while compiling <tmp> Strange text after block (missing semicolon or comma?) at <tmp>:1 ------> grammar G { token TOP { 'd' <blah> 'k' }⏏ token blah { \w+? } }; my $match = G.pa expecting any of: … |
||
lizmat | m: grammar G { token TOP { 'd' <blah> 'k' }; token blah { \w+? } }; my $match = G.parse('duck'); say $match; | ||
camelia | (Any) | ||
lizmat | m: my token blah { \w+? }; say "duck" ~~ / d <blah> k / | 07:58 | |
camelia | Nil | ||
stevied | m: grammar G { token TOP { 'd' <blah> 'k' }; token blah { \w+? } }; my $match = G.parse('duck'); say $match; | ||
07:58
frost left
|
|||
lizmat | hmmm | 07:59 | |
stevied | i gotta get to bed. wanted to go out on a good note but getting nowhere on this | 08:03 | |
lizmat | sorry, hope we'll be a able to provide more clarity in the morn | 08:04 | |
stevied | using that debugger | ||
with non-greedy, it's just matching the "u" and nothing else | |||
lizmat | m: my regex blah { \w+ }; say "duck" ~~ / d <blah> k / | 08:06 | |
camelia | 「duck」 blah => 「uc」 |
||
lizmat | it needs to be able to backtrack, that's why it needs to be a regex | ||
breakfast& | |||
stevied | oh the TOP needs to be a regex it looks like | 08:08 | |
I had only tried making the second block a regex | |||
actually, they both need to be regexes, not tokens | 08:09 | ||
alright, I'll have to sleep on this. I know what backtracking is but don't quite understand how it works across two different regexes like this. weird shit. | 08:10 | ||
13:55
discord-raku-bot left,
discord-raku-bot joined
13:59
discord-raku-bot left
14:00
discord-raku-bot joined
15:36
frost joined
16:03
frost left
|
|||
ok, got this working: | 17:41 | ||
``` | |||
grammar G { | |||
token TOP { .*? ( '<' 'a' <-[ > ]>+ '>' <hypertext> '<' '/' 'a' '>' .*? )+ .* } | |||
token hypertext { <-[ < ]>+ } | |||
} | |||
my @matches = G.parse('<a href="kjsdf">blah 1</a><a href="/">blah 2</a>'); | |||
say @matches; | |||
``` | |||
it works, but I'm gonna say that grammars are not the ideal tool for parsing html, just like with regexes | |||
is that the common wisdom? | |||
lizmat | I think it is :-) | 18:02 | |
especially since HTML can be improperly formed and still sorta render ok in a browser | 18:03 | ||
stevied | I'm in the middle of posting to reddit about this right now. Let's see what happens. | ||
lizmat | stevied++ | ||
stevied | right. though in my particular situation, I'm parsing an html document created from markdown using a tool. so the html should be well-formed | 18:04 | |
www.reddit.com/r/rakulang/comments..._to_parse/ | 18:26 | ||
don't know who that dude in the picture is 🙂 | |||
m_athias | @stevied#8273 why do you need the .*? in there? it should work just fine without them. | 18:37 | |
if you want to allow whitespace at the beginning <.ws> works. that way lies madness: writing decent rules to figure out what whitespace is relevant is a pain. | 18:48 | ||
18:55
thowe left,
thowe joined
|
|||
stevied | @m_athias, I don't know why it's in there. I created it with lots of trial and error. I can play with it some more. | 18:57 | |
are you talking about the one at the beginning or near the end? | 18:58 | ||
ok, yup. remove that worked | 18:59 | ||
whoa, removing the second one worked, too | 19:01 | ||
heh, i clearly don't know what I'm doing | |||
actually, i take that back. remove those breaks things. I had change the string getting parsed to remove the text before and after the first and last anchor tags | 19:03 | ||
actually, i take that back. removing those .*? breaks things. I had change the string getting parsed to remove the text before and after the first and last anchor tags | |||
ah, dammit. I pated in the wrong code to reddit | |||
good catch | |||
ok, fixed it | 19:04 | ||
ah, dammit. I pasted in the wrong code to reddit | 19:05 | ||
alright, so what's the best way to parse html, then? | 19:59 | ||
I think i'll pose this question to stackoverflow. too many requirements to outline here | |||
stackoverflow.com/questions/708996...embed-code | 20:08 |