[NTLUG:Discuss] Counting key presses in a file...

Mon Aug 27 21:48:57 CDT 2007

On 8/27/07, Chris Cox <cjcox at acm.org> wrote:
> Richard Geoffrion wrote:
> > Robert Citek wrote:
> >> On 08/27/2007 03:13 PM, Richard Geoffrion wrote:
> >>
> >>> I know that 'wc' can count words and lines, but how would one count individual keystrokes.
> >>> While 'a' would constitute ONE keystroke, 'A' would constitute two.
> >>> (Shift + a).
> >>>
> >> Not necessarily.  If you have CapsLock on, then the reverse would be
> >> true. That is, 'A' would count as one and 'a' as two.  What would happen if it was a string of 'A's, as in AAA?  <snip>
> >>> Carriage returns would count as one while bolding text
> >>> would constitute four keystrokes as one would need a CTRL-B to turn
> >>> bolding on and another to turn bolding off.
> >>>
> >> What if I type 'a^Hah'?  What if I type 'a{left arrow}h'?  What if I
> >> type 'ah'?  They all produce 'ah' but are a different number of
> >> key-strokes.  Lastly, what about cut and paste?
> >>
> >>
> >>> How would one go about <snip converting [a document] into something that can be parsed and counted?
> >>>
> >> If you can't use Word, which has a built-in counter, then open the
> >> document in OpenOffice Writer, press Ctrl-A to highlight everything,
> >> press Ctrl-C to copy all the text in the clipboard, open gedit, press
> >> Ctrl-V to pasted the text into gedit, and save as a textfile
> >> "foobar.txt"  Then open a terminal and type 'wc foobar.txt'.
> >>
> >> Am I even close to answering your question?
> >>
> >>
> > These are all VERY good points and maybe I should state a more
> > end-product kind of thing.  The end result would be an automated script
> > that could evaluate a set of files to calculate a $$ value for a
> > document based on the number of characters typed.   (One would have to
> > assign certain values to CAPS, *bold*, _underline_  et al.)
>
> You know since this is going to vary greatly and probably be
> very error prone.... perhaps something more probablistic is warranted.
> Maybe just assume a percentage based on normal text.
>
> Of course, I realize you are trying to be exact, but there are
> way too many variables.
>
> >
> > So, no matter how quickly or efficiently Bob can type a document, he
> > only gets paid a set value for the document based on pre established
> > rules.  If Bob can figure out shortcuts to transcribing his documents,
> > then fine -- he makes extra money.  If Bob has to correct 1/3 of the
> > text he types or chooses to use the mouse to do formatting -- well then
> > that's his problem for being woefully inaccurate or inefficient.
> >
> > So with set values assigned for each character type and formatting
> > class.....any words of wisdom to solve this issue?  I've been on
> > sourceforge and freshmeat but no luck so far.  I've seen a couple of
> > commercial win32 packages, but win32 apps don't lend themselves too well
> > to automation in a linux cron job. :)
>
> Those win32 apps HAVE to be making some huge assumptions.  Perhaps
> I missed it.  Have you defined things down to a particular app,
> particular keyboard type, particular language?
>
> >
> > If one *DID* try to parse a file manually... what would one need to do?
> > Lots of grepping, counting and stripping of control characters? Counting
> > and stripping higher cost characters that have ascii values > 127
> > (typically upper case and foreign characters)?
> >
> > I'm exploring the OO macro scene. Hopefully there is a similar project
> > somewhere.
> >
>
> Ok.. so we're assuming that what is being typed is an OOo document
> of some sort.  I'm still not sure if that limits the variables down
> enough though.
>
> I'm not sure that your "motivation" paragraph has sold me on
> the benefits of this (not that you have to sell me on the idea
> of course).  You're wanting to be very precise... are we
> talking about VERY high volumes or something?  You don't have
> to answer that... just wondering why such high precision is
> needed.
>
> ???

Here is one approach like Chris Cox suggested:
[The file]
RobertPearsonLSA2.doc - *.odt file saved as a "Microsoft Word 97/2000/XP (.doc)"

[byte count]
wc -c RobertPearsonLSA2.doc
161280 RobertPearsonLSA2.doc

[Character count]
wc -m RobertPearsonLSA2.doc
[error messages = 364]
wc: RobertPearsonLSA2.doc:364: Invalid or incomplete multibyte or wide character
96255 RobertPearsonLSA2.doc

[Word count]
wc -w RobertPearsonLSA2.doc
[error messages = 364]
wc: RobertPearsonLSA2.doc:364: Invalid or incomplete multibyte or wide character
1628 RobertPearsonLSA2.doc

[OpenOffice.org Writer reports for the *.odt original source file]
Tools > Word Count
Whole Document----
Words:  532
Characters:  3822

File > Properties > Statistics
Number of Pages:  2
Number of Tables:  9
Number of Graphics:  0
Number of OLE Objects:  0
Number of Paragraphs:  73
Number of Words:  532
Number of Characters:  3822
NUmber of Lines:  91

OpenOffice.org Writer does not report these for the *.doc file.
Microsoft Word should but I don't have it running.

If you design a script to automatically collect and analyze these
numbers against the File Type, very important, you can establish a
workable statistical base for doing what you want to do.

The other alternative is to write a script, Perl, Python or Ruby
preferred, to parse the file and return these values for each File
Type. Again the goal is to establish a workable statistical base to
make inferences from.

The real downside of the script is maintenance.
The structure of each file as created by the application software is
changed at random to deter "reverse engineering" by competitors.
This means you have to re-determine the new file structure each time
the script breaks.
I have lost count of the number of changes in the Word file format.
Some of them are really screwy (creative?).

The other alternative is to turn on a keystroke counter and create
representative files to produce these numbers. That's a lot of typing.
Back when they had stenographer pools or secretaries we just monitored
them with the keystroke counter turned on. Saved a lot of time.

You could do the same thing today with a "served" application. Turn on
the counter every time the application is called.

You will get a lot of data in a hurry that you will have to parse and
make sense of in order to refine the process to be reliable (works
every time) and valid (produces correct results every time).

There are some "Content Typing" third party products out there. There
is even a product called eDiscovery that will process your backup
streams and type the Content. It has to be run several times before it
"learns" your environment well enough to be reliable and valid. It
ain't cheap...

Once you know the Content you can ask someone in that LOB (Line Of
Business) how valuable that file is. You get the TCO from the IT group
and the ROI from the LOB group. After that it is just smiling and
receiving the accolades your boss will get for coming up with this
solution.