First, thankyou for taking an interest in my inquiiry here and second,
a thankyou to theCodeBot for input earlier on parsing sentences.
I’m working on parsing entire documents of statements into questions regarding them, eventually, for now I’m aiming for a few simple sentences. The project is an educational system. I think I’ve nailed down how to go about the parsing process in general but as I prepare to build it some questions have arisen.
To determin parts of speech the system will use the 1913 Webster’s set of words (for now, I just used what I could find). I took a copy of the XML formatted GCIDE project files and have ripped out almost everything but the words and parts of speech. The word file set is now down to about 5 megbytes in size and may be a bit smaller when I finish.
So here’s the question:
How best to implement this data? What would be the best way to access the dictionary data?
Obviously my systems will have to load the text data but I’m not sure what the best way to store it in memory and use/access it would be, AS dictionary?, array? (I would think not), object properties key-value pairs? (did I even say that right?)
The first problem I see is that we’re talking about roughly 200.000 words. Thankfully, the dictionary files are broken up alphabetically by the first word but… And then, should the system evaluate each word of the target document one by one or … Probably so, since the use of a word can vary from one sentence to another or even in the same sentence, and since the presence of modifiers will help identify verbs as opposed to nouns, also punctuation will have a bearing too. Hrm…
in parsing the target document, an authoring system will first have to identify the part of speech of each word if it can and if there are multiple possibilities or no match, it will have to ask the material author for clarification and then add hidden instructions for that occasion to the file, perhaps adding the new data to it’s own dictionary. After the file is preprocessed, it can be used by the student’s system in the same manner but the student’s system will also build questions from the document to quiz the student on and then evaluate answers and do some other stuff not directly related to this.
I was already planning on building a small subset of common words to help speed up the process.
Maybe having the authoring system build a custom dictionary of unique words wouldn’t be a bad idea… Hrm…
The reason I’m asking this is that I’m anticipating that this could be a bit resource intensive, I’d like to be as resource concious as possible so that the resulting programs can be used on older computers if possible and won’t take too long to process for the students.
In the end the student’s system will also be doing other smaller tasks as well but I don’t think they’ll have a significant impact.
Thoughts?
Thankyou again.